Plenary Key Note: The Power of the Electronic Scientific Thesis

Speaker: Peter Murray-Rust, Unlever Centre for Molecular Sciences Informatics, Department of Chemistry University of Cambridge, CB2 1EW, UK

Much of the laboratory and computational work in science is actually done by graduate students. A PhD thesis represents 3 or more years' work and much of the text is actually detailed accounts of scientific experiments, facts, recipes, methodology. I shall refer to this as "data" and argue that it is factual information which, when published, belongs to the scientific commons. I suspect this data represents somewhere between 25-50% of published science in fact-based domains such as chemistry, much of the rest being contributed by postdoctoral workers.

Yet our own work in the SPECTRa project has shown that 80% (or more) of scientific data is never published. This is an enormous loss to funders, academia and wealth-creating industries. The reasons for this are laziness, and the barrier created by scientific publishers who have little current interest in publishing data. Very worrryingly some publishers are starting to copyright scientific data, despite clear indications this is illegal amd unethical. Yet few publishers actually take any part in reviewing data quality.

Electronic theses have the power to change all this. The thesis has several major advantages over current methods of publication

There are technical and sciopolitical barriers.

My utopian vision is that students prepare their thesis in XML. This solves all the technical problems. It also will help the students to prepare better theses faster. For example students are often criticised for not having scientific units, omitting scales and labels on diagrams, missing out critical information, etc. We have developed an XML toolkit RACSO for authoring chemical theses which effectively checks and publishes the experimental part of synthetic chemistry. Synthetic chemistry is one of the easiest disciplines to start with as it is formulaic but other subjects have many large sections which use template-like thinking and could be adapted.

Even with PDF or Word theses it is possible to scrape useful data and metadata and I shall show how our SPECTRa-T project is able to do this.

I suggest the following simple rules:

The talk will be illustrated with example of how robots can read theses and will suggest how a graduate student could author a thesis in XML.

We thank JISC for support for SPECTRa and SPECTRa-T