ETD 2007: Added values to e-theses

Plenary Key Note: The Power of the Electronic Scientific Thesis

Speaker: Peter Murray-Rust, Unlever Centre for Molecular Sciences Informatics, Department of Chemistry University of Cambridge, CB2 1EW, UK

Much of the laboratory and computational work in science is actually done by graduate students. A PhD thesis represents 3 or more years' work and much of the text is actually detailed accounts of scientific experiments, facts, recipes, methodology. I shall refer to this as "data" and argue that it is factual information which, when published, belongs to the scientific commons. I suspect this data represents somewhere between 25-50% of published science in fact-based domains such as chemistry, much of the rest being contributed by postdoctoral workers.

Yet our own work in the SPECTRa project has shown that 80% (or more) of scientific data is never published. This is an enormous loss to funders, academia and wealth-creating industries. The reasons for this are laziness, and the barrier created by scientific publishers who have little current interest in publishing data. Very worrryingly some publishers are starting to copyright scientific data, despite clear indications this is illegal amd unethical. Yet few publishers actually take any part in reviewing data quality.

Electronic theses have the power to change all this. The thesis has several major advantages over current methods of publication

the author and/or its institution retain complete control over the copyright of the work and are not forced to hand it over to the publisher
there is a strict quality control system of internal and external examiners. The candidate has to convince them that the data are fit for purpose.
the student cannot be "lazy" about the means of authoring. If a university insists on XML then the student will have to do it.
an electronic thesis can (and I argue must) be openly available in an institutional repository.
an unlimited amount of supporting data can be copublished.

There are technical and sciopolitical barriers.

the thesis is often produced in some form or e-paper (TIFFs or PDF) which completely destroy all semantics
XML tools are not yet universal
there is no metadata for the scientific data
the authors and their supervisors are afraid that someone might read the thesis and (a) show there are errors (b) re-use it in clever ways thus "scooping" the authors. (This is sometimes contaminated withe the problems of patents and confidential human information - but there are well accepted mechanisms for this). There are no moral reasons why the averge thessis should not be fully visible to the world and re-usable under the BOAI declaration.
the university has medieval rules of ownership and copyright but enlightened ones now routinely post their theses.

My utopian vision is that students prepare their thesis in XML. This solves all the technical problems. It also will help the students to prepare better theses faster. For example students are often criticised for not having scientific units, omitting scales and labels on diagrams, missing out critical information, etc. We have developed an XML toolkit RACSO for authoring chemical theses which effectively checks and publishes the experimental part of synthetic chemistry. Synthetic chemistry is one of the easiest disciplines to start with as it is formulaic but other subjects have many large sections which use template-like thinking and could be adapted.

Even with PDF or Word theses it is possible to scrape useful data and metadata and I shall show how our SPECTRa-T project is able to do this.

I suggest the following simple rules:

invest in XML authoring technology for theses (it is then automatic to create PDFs)
invest in communal XML languages (MathML, CML, SVG...) for the major scientific domains and to check the quality of material
develop departmental awareness and practices for capturing data at source. Our SPECTRa project has done this for crystallography, computational chemistry and spectroscopy.
until then ALWAYS co deposit a Word or LaTeX document, never just the PDF
add a copyright notice such as Science/Creative Commons to protect the data being appropriated by publishers

The talk will be illustrated with example of how robots can read theses and will suggest how a graduate student could author a thesis in XML.

We thank JISC for support for SPECTRa and SPECTRa-T