DiVA Document Format

Version 1.0: Reference Description (Draft)

Introduction
The Global Structure of a DiVA Document
Metadata
Fulltext Contents
References

Introduction

The DiVA Document Format is a general XML [XML] document type especially developed for, but not limited to, scientific publications. The format is developed and maintained by the Electronic Publishing Centre [EPC] at Uppsala University Library within the DiVA project. A general background of this project and the format is given in D-Lib Magazine (November 2003) [DLib].

The Global Structure of a DiVA Document

A DiVA XML document consists of metadata descriptions of a publication and it may contain the fulltext contents as well. The root element is always documents to allow for many documents to be included in a single file. Each individual document is contained within the document element. If the fulltext is included it appears within the contents element. A date and a time element containing the creation date and time of the particular file are also required.

<documents>
   <date type="creation" timezone="UTC+1">
      <year>2004</year>
      <month>01</month>
      <day>27</day>
   </date>
   <time type="creation" timezone="UTC+1">14:28</time>
   <document>
       ...the metadata ...
      <contents>...the fulltext contents...</contents>
   </document>
</documents>

Detailed Description of the DiVA Document Structure.

Metadata

The DiVA Document Format uses an internal metadata format that was developed in the DiVA project since other existing formats considered did not include all the features needed. The format, described in an XML Schema [diva], is component based and extensible. Some inspiration has been gathered from the work concerning Functional Requirements for Bibliographic Records, FRBR [frbr], by IFLA. For instance all formats of the document (printed as well as electronic ones) are described within the same record as "manifestations". In the current version of the format manifestations describe the different ways and various formats (with the same contents) that a document can be represented in. Revised editions or other language versions of a document are considered to be other documents rather than manifestations. Common elements are used for information that is valid for all manifestations of a document.

Detailed Description of Metadata Components.

Common elements

The common elements for all documents are the following:

properties: A container element for the document types of the document contained in property elements. A document can have several document type attributes.

<properties>
   <property>book</property>
   <property>thesis</property>
</properties>

identifiers: A container element for identifier elements that belong to all manifestations. These contain the identifiersType component.

specifics: This element contains elements that belong to a certain document type, e.g. the degree or the date and place of the defence of a thesis. The document type is specified as a value in the type attribute.

<specifics type="thesis">
   <degree>
      <identifiers>...an identifier...</identifiers>
      <descriptions>...a description...</descriptions>
   </degree>
</specifics>

languages: A container element for elements that describe what languages are used in the document. The language or languages of the main part of the document are contained within the documentLanguages element and summaries in the summaryLanguages element. These contain the languageType component.

creators: A container element for the creators of the document containing creator elements for each creator according to the creatorType component.

contributors: A container element for the contributors of the document containing contributor elements for each contributor according to the contributorType component.

classificationCategories: A container element for the classification categories of the document containing classificationCategory elements for each classification according to the classificationCategoryType component.

titles: A container element for the titles of the document containing title elements for each title. The title element is divided into the maintitle and, if applicable, the subtitle element.

<titles>
   <title>
      <maintitle xml:lang="en">...maintitle... </maintitle>
   </title>
</titles>

listsOfReferences: A container element for the listOfReferences element. This element is used for references to papers included in Swedish theses, aka "List of Papers".

<listsOfReferences>
   <listOfReferences type="listOfPapers">
      <references number="1">
         <reference>...a reference... </reference>
      </references>
   </listOfReferences>
</listsOfReferences>

abstracts: A container element for the abstract element that contains abstracts in different languages. The abstracts are divided into paragraph elements.

<abstracts>
   <abstract xml:lang="en">
      <paragraph>...a paragraph...</paragraph>
   </abstract >
</abstracts>

contents: Contains all other parts of the document (e.g. the fulltext).

<contents>...the fulltext contents...</contents>

note: This element can contain any comments about the document.

<note>...a note...</note>

Manifestations

The manifestations element is a container for one or more manifestation elements that contain information about a particular format of the document (printed or electronic). The manifestation element can contain the following elements:

properties: A container element for the attributes or the properties of the manifestation contained in property elements.

<properties>
   <property>book</property>
   <property>physicalMedium</property>
</properties>

serialIssues: The manifestation belong to a series specified under this element. serialIssues is a container element for serialIssue elements that contain the serialPublicationType component.

date: Dates of publication and availability in the date element according to the dateType component.

time: Times of publication and availability in the time element according to the timeType component.

edition: The edition of the manifestation.

<edition>Second edition</edition>

numberOfCopies: The number of manufactured/distributed copies of the manifestation.

<numberOfCopies>300</numberOfCopies>

publishers: A container element for publisher elements that contain the publisherType component.

distributors: A container element for distributor elements that contain the publisherType component.

archivers: A container element for archiver elements that contain the publisherType component.

identifiers: A container element for manifestation specific identifier elements that contain the identifiersType component.

extent: Describes pages and/or filesize of the manifestation.

<extent type="pages">35</extent>
<extent type="filesize">2041465</extent>

Formatting

Inline text formatting can be created in the title, maintitle, subtitle, note, and paragraph elements using the formattedText component.

Mappings

The DiVA Document Format for metadata has been mapped to a number of other metadata formats.

Sample files:

DiVA Document Format Metadata (original XML file)

Other formats created through transformations of the above file:

Dublin Core/RDF (XML)
MARCXML (XML)
METS (XML)
MODS (XML)
TEI Header (XML)
Dublin Core/HTML (text)
Endnote (text)
MARC 21 (text)
Reference Manager (text)

Fulltext Contents

The DiVA Document Format uses a subset of DocBook V4.3 [DocBook] as described in DocBook: The Definitive Guide [DefintiveGuide], for the structural mark-up of the fulltext documents. This subset conforms to the templates for word processors which are being used in the DiVA Publishing System for the creation of fulltext contents.

The selected DocBook elements have not been modified and other elements have not been added. However, a stricter validation of elements may occur. The DiVA DocBook subset is defined by an XML schema [dbdiva] which is imported into the XML schema defining the DiVA metadata format. New DocBook elements may be added to the subset as the templates are being developed.

The Mathematical Markup Language (MathML) Version 2.0 (Second Edition) [MathML] is used for mathematical formulas.

Detailed Description of Fulltext Elements.

The Global Structure of the Fulltext Contents

The root element of the fulltext is book. It can contain four different subelements: dedication, chapter, bibliography, and index. The text is normally divided into several chapter elements.

<book>
   <dedication>...dedication...</dedication>
   <chapter>...first chapter text...</chapter>
   <chapter>...second chapter text...</chapter>
   <bibliography>...bibliography...</bibliography>
   <index>...index...</index>
</book>

The chapter element can in turn be divided into the subelements sect1 -- sect4.

<chapter>
   <sect1>...section 1...
      <sect2>...section 2...
         <sect3>...section 3...
            <sect4>...section 4...</sect4>
         </sect3>
      </sect2>
   </sect1>
</chapter>

Headings

Headings are created from the title element. Five heading levels can be created in chapters. The title element can also appear in several other contexts.

<chapter>
   <title>...heading level 1...</title>
   <sect1>...section 1...
      <title>...heading level 2...</title>
      <sect2>...section 2...
        <title>...heading level 3...</title>
        <sect3>...section 3...
           <title>...heading level 4...</title>
           <sect4>...section 4...
              <title>...heading level 5...</title>
           </sect4>
         </sect3>
      </sect2>
   </sect1>
</chapter>

Block Elements in Chapters and Sections

The chapter and sect elements may include the following block elements:

Lists

Lists are contained within the itemizedlist or the orderedlist elements. In an Itemized list, each listitem is marked with a disc, circle or square. The value is set in the mark attribute. In an ordered list, each listitem is marked with a numeral, letter, or other sequential symbol using the numeration attribute.

Each member of the list is contained in the listitem element. This element can contain the same block elements as chapters and sections, normally para.

<orderedlist numeration="upperroman">
   <listitem>
      <para>...item I...</para>
   </listitem>
   <listitem>
      <para>...item II...</para>
   </listitem>
</orderedlist>

Tables

Tables are contained within the table element. This element has two subelements: title (an optional title of a table) and tgroup (which surrounds a logically complete portion of a table).

The tgroup element contains the colspec element which specify the presentation characteristics of entries in a column in its attributes and the tbody element which is a container for the table rows. Optionally thead (table header) or tfoot (table footer) may be added.

The tbody, thead, and tfoot elements contain the row element including a row in a table. The entry element contains a cell in a table row. This element can contain either text or most block and inline elements except another table.

<table>
   <title>...title of table...</title>
   <tgroup>
      <colspec colnum="1" colname="col1"/> 
      <colspec colnum="2" colname="col2"/> 
      <tbody>
         <row>
            <entry>...cell 1 in table...</entry>
            <entry>...cell 2 in table...</entry>
         </row>
         <row>
            <entry>...cell 3 in table...</entry>
            <entry>...cell 4 in table...</entry>
         </row>    
      </tbody>
   </tgroup>
</table>

Footnotes

Footnotes can be put in para or entry elements using the footnote element. The footnote can contain several subelements, normally para.

<footnote>
    <para>...footnote text...</para>
</footnote>

A cross reference to a footnote (often used in tables) can be created within the footnoteref element. This element forms an IDREF link in the linkend attribute to a footnote. It generates the same mark or link as the footnote to which it points.

<row>		
   <entry>
      ... text...      
      <footnote id='1a'>
         <para>... footnote text... </para>
      </footnote>
   </entry>
</row>
<row>
   <entry>
      ... text...      
      <footnoteref linkend='1a'/>
   </entry>
</row>

Links to External Files

Links to external, non-text based files, can be created using the mediaobject (a block element including a caption) or the inlinemediaobject (in another element) element.

The types that can be used are: audioobject, imageobject, and videoobject which, in turn, contain the corresponding audiodata, imagedata or videodata element including the required attribute fileref which contains the URI to the object.

<mediaobject>
    <imageobject>   
        <imagedata fileref="...URI..." />   
    </imageobject>
    <caption>...caption...</caption>
</mediaobject>

Mathematical Formulas

The follwing elements can contain mathematical formulas expressed in MathML: equation (block element including a title) or informalequation (block element without a title) or inlineequation (in another element).

<equation>
    <title>...caption...</title>
    <mml:math...math...</mml:math>
</equation>

Formatting

The emphasis element is used for inline text formatting together with the subscript and superscript elements. The role attribute of emphasis can contain the values bold, italic or underlined.

<emphasis role="bold">...bold text...</emphasis>

Nested elements are used for multiple formatting:

<emphasis role="bold">
   <emphasis role="italic">...bold italics...</emphasis>
</emphasis>

If subscript or superscript are formatted these elements are contained within emphasis:

<emphasis role="italic">
   <subscript>...subscript italics...</subscript>
</emphasis>

The role attribute can be used for block formatting in the para, blockquote, itemizedlist, and orderedlist elements. The value can be set to indent or, in the case of para, to preceedingLineBreak.

<para role="indent">...indented paragraph...</para>
<para role="preceedingLineBreak">...empty line before this paragraph...</para>

Bibliography

Bibliographies are contained within the bibliography element. Sections in bibliographies are created within the bibliodiv element. The DocBook bibliomixed model is used for each item in the bibliography.

<bibliography>
   <title>References</title>
   <bibliodiv>
      <bibliomixed>
         <bibliomset relation="article">
            <author>...author of article...</author>
            <title>...title of article...</title>
            <pubdate>...publication date of article...</pubdate>
            <pagenums>...pages of article...</pagenums>
            <biblioid class="uri">...uri to article...</biblioid >     
         </bibliomset>
         <bibliomset relation="journal">
            <title>...title of journal...</title>
            <volumenum>...volume of journal...</volumenum>
            <issuenum>...issue of journal...</issuenum>
         </bibliomset>      
      </bibliomixed>
   </bibliodiv>
</bibliography>

Index

The index terms, which identifies text that is to be placed in the index, are contained within the indexterm element. Index terms can be primary, secondary or tertiary as well as see and seealso.

<para>
   ...text... 
   <indexterm>
      <primary>...primary index term...</primary>
      <secondary>...secondary index term...</secondary>
   </indexterm>
   ...text... 
</para>

Indexes are contained within the index element. Sections in indexes are created within the indexdiv element. The indexentry element wraps all of the index terms associated with a particular primary index term in the primaryie element. This includes an arbitrary list of secondaryie and tertiaryie as well as seeie and seealsoie elements.

<index>
   <title>Index</title>
   <indexdiv>
      <title>...title of index section...</title>
      <indexentry>
         <primaryie>...primary index term...</primaryie>
         <secondaryie>...secondary index term...</secondaryie>
      </indexentry>
   </indexdiv>
</index>

References

[XML]: Extensible Markup Language (XML)
[EPC]: Electronic Publishing Centre at Uppsala University Library
[DLib]: The DiVA Project - Development of an Electronic Publishing System, Eva Müller, Uwe Klosa, Stefan Andersson, and Peter Hansson, D-Lib Magazine, November 2003
[diva]: DiVA Metadata XML Schema, Version 1.0
[frbr]: Functional Requirements for Bibliographic Records Final Report, 1998
[DocBook]: DocBook official homepage
[DefinitiveGuide]: DocBook: The Definitive Guide, Norman Walsh and Leonard Muellner, 2003
[dbdiva]: DiVA DocBook Subset XML Schema, Version 1.0
[MathML]: Mathematical Markup Language (MathML) Version 2.0 (Second Edition), W3C Recommendation

DiVA Document Format

Version 1.0: Reference Description (Draft)

Table of Contents

Introduction

The Global Structure of a DiVA Document

Metadata

Common elements

Manifestations

Formatting

Mappings

Fulltext Contents

The Global Structure of the Fulltext Contents

Headings

Block Elements in Chapters and Sections

Lists

Tables

Footnotes

Links to External Files

Mathematical Formulas

Formatting

Bibliography

Index

References