Support for DOI bibliographic information auto-fetch through CrossRef

Author’s picture Nick Nicholas Author’s picture Andrei Kislichenko on 28 Dec 2022

Introduction

Digital Object Identifier and the DOI system

Since 1997, the Digital Object Identifier (DOI) service has been part of the essential infrastructure of digital publishing, registering and managing a unique identifier (DOI identifier) for all publications.

The DOI identifier scheme was ratified as an ISO standard as ISO 26324 (ISO 26324:2012, ISO 26324:2022), with the actual implementation managed by the DOI Foundation, which delegates registration matters to DOI registration agencies. DOI identifiers are lodged with a DOI registration agency.

As publishers and agencies have embraced the DOI, these identifiers have become ubiquitous, and have been adopted not only for most new publications, whether electronic or hardcopy, but increasingly for past publications as well, especially as those publications are reissued online.

One of the main advantages to a widely used identifier for publications is that value-added services can use that identifier to provide information about the publication.

Crossref

Among a wide range of online services that operate on the DOI identifier, Crossref is of particular interest in the data it collects and offers in the topic of bibliography.

Crossref, now a DOI registration agency, was originally established by a consortium of member publishers intended to facilitate stable cross-references (hence, "cross-ref") and citations across academic journal articles.

Therefore it is in the interests of Crossref, not only to register DOIs for publications, and use them in online publication, but also to associate those DOIs with bibliographic information that can be used in citations.

As a result, Crossref has one of the largest online bibliographic databases in existence, counting more than 141 million records today.

Note
It is worth noting that out of the millions of DOI identifiers, still only a fraction of them have bibliographic information on Crossref. So if Crossref doesn’t return information for a particular DOI identifier, don’t fret.

Crossref bibliographic API

Crossref has made this database publicly available in recent years through an API, and is committed to making citation information openly available.

The Crossref API data is natively in a Crossref JSON Schema, but it can readily be exported into popular bibliographic export formats.

In fact, a search on the Crossref home page for DOI 10.1515/9783110889406.257 returns on its landing page links not only to the Native Crossref JSON record and the URI of the associated resource, but also exports of the JSON record (via Javascript, as "Cite"), both machine-readable (BibTeX, RIS), and as human-readable renderings (APA, Harvard, MLA, Vancouver, Chicago).

Relaton integration with Crossref

Relaton-DOI (the relaton-doi gem) integrates the Crossref bibliographic database through the Crossref API, and Crossref bibliographic records are now available to the Relaton ecosystem.

In the following examples, we show the method of access to a Crossref bibliographic record given its DOI identifier, and how to use it in your chosen environment.

We also outline some of the enhancements we have brought to Crossref data, and some of the caveats to bear in mind while using it.

Fetching Crossref data from Relaton CLI

Like all Relaton gems, a reference can be fetched from relaton-doi through the Relaton CLI.

Install Relaton-DOI at the command line
$ gem install relaton-cli
Fetching bibliographic data using a DOI identifier in Relaton XML format
$ relaton fetch doi:10.1515/9783110889406.257
# returns record for 10.1515/9783110889406.257 in Relaton XML
Fetching bibliographic data using a DOI identifier in Relaton YAML format
$ relaton fetch doi:10.1515/9783110889406.257 -f yaml
# returns record for 10.1515/9783110889406.257 in Relaton YAML
Fetching bibliographic data using a DOI identifier in BibTeX format
$ relaton fetch doi:10.1515/9783110889406.257 -f bibtex
# returns record for 10.1515/9783110889406.257 in Bibtex

As with any command line tool, the output of these commands can be redirected to a text file and stored.

The command line tool consumes identifiers from a range of sources, and it requires doi: to be prefixed to each DOI, so they can be recognized as such.

So, if you wish to build up a BibTeX database of bibliographic entries given some DOIs, you might write a shell script to loop through the DOIs, and write each BibTeX object fetched into a file, like this:

bash script using Relaton to fetch Crossref data using DOI identifiers into a BibTeX file
#!/bin/bash
StringVal="10.1515/9783110889406.257 10.6028/nist.ir.8245 doi:10.1215/9781478007609-047"
echo "" > my.bibtex
for doi in $StringVal; do
  relaton fetch doi:$doi -f bibtex >> my.bibtex
done

Ruby

The relaton-doi gem can be used natively in your Ruby code, although this presupposes that you are already coding within the relaton Ruby ecosystem.

The received objects will be native Relaton objects, which will eventually need to be exported to XML, YAML, or BibTeX.

Ruby code that fetches Relaton bibliographic objects using DOI identifiers
> require 'relaton_doi'
=> true

# get NIST standard
> RelatonDoi::Crossref.get "doi:10.6028/nist.ir.8245"
[relaton-doi] ["doi:10.6028/nist.ir.8245"] fetching...
[relaton-doi] ["doi:10.6028/nist.ir.8245"] found 10.6028/nist.ir.8245
=> #<RelatonNist::NistBibliographicItem:0x00007ff22420d820
...

Enhancements and caveats

Challenges with Crossref data

Crossref is an unprecedented boon to users of electronic bibliographies: it is centralised source offering access to a huge and growing number of bibliographic records, with a universal and broadly discoverable identifier.

That said, there are downsides to the bibliographic metadata from Crossref:

  • Quality of bibliographic data is highly variable;

  • Metadata provided by Crossref is not normalized;

  • Encoding of metadata is often inconsistent;

  • Critical data is often missing.

These caveats stem from the fact that Crossref metadata is entered by the publishers themselves, and there is no effort in normalizing data provided by the different publishers.

We encourage users to use Crossref data as a first pass to populating bibliographic information quickly and at scale, but we also recommend that users review the bibliographic data fetched, and fill in any gaps they can identify.

Relaton repairs Crossref data

Relaton-DOI implements a number of enhancement mechanisms on top of the Crossref API data to remedy some of the clearly missing information.

Missing host items

Crossref does not normally in the fetched record any details of the host item — the edited volume or proceedings that a paper may be included in.

So the Crossref record will include the title of the volume the chapter or paper is included in, as well as the publisher and place of publication. However, it will not include the editors of the edited volume or proceedings.

So a search for 10.1515/9783110889406.257 natively returns, in BibteX:

Host item missing in Crossref bibliographic data
@incollection{Heller,
  doi = {10.1515/9783110889406.257},
  url = {https://www.degruyter.com/document/doi/10.1515/9783110889406.257/html},
  publisher = {{DE} {GRUYTER} {MOUTON}},
  author = {Monica Heller},
  title = {Gender and public space in a bilingual school},
  booktitle = {Multilingualism, Second Language Learning, and Gender}
}

But the editors of the volume Multilingualism, Second Language Learning, and Gender are not included in the Crossref record. Relaton-DOI makes a best-effort search, and will in most cases return the editors as well as the author.

So in the foregoing case, Relaton-DOI will look up the record for the edited volume Multilingualism, Second Language Learning, and Gender, and the record it returns will include its editors.

Host item restored to Crossref bibliographic data
@inbook{heller-a,
  title = {Gender and public space in a bilingual school},
  author = {Heller, Monica},
  editor = {Pavlenko, Aneta and Blackledge, Adrian and Piller, Ingrid and Teutsch-Dwyer, Marya},
  booktitle = {Multilingualism, Second Language Learning, and Gender},
  publisher = {DE GRUYTER MOUTON},
  address = {Berlin},
  timestamp = {2023-01-15},
  doi = {10.1515/9783110889406.257},
  url = {https://www.degruyter.com/document/doi/10.1515/9783110889406.257/html}
}

Stray delimiters in data fields

Some Crossref records, particularly from older bibliographic sources, have messy data incorporating delimiters.

For example, the record for 10.5962/bhl.title.124254, a publication from the year 1852, includes trailing punctuation in its title, place of publication, and publisher fields. These issues should not exist in the data.

Crossref fields containing stray delimiters
@book{kuster1852a,
  title = {Die Gattungen Pupa, Megaspira, Balea und Tornatellina : in Abbildungen nach der Natur mit Beschreibungen /},
  author = {H. C. Kuster and Johann Hieronymus Chemnitz and Friedrich Heinrich Wilhelm Martini},
  publisher = {Verlag von Bauer und Raspe (Julius Merz),},
  year = {1852},
  address = {Nürnberg :},
  doi = {http://dx.doi.org/10.5962/bhl.title.124254},
  url = {http://www.biodiversitylibrary.org/bibliography/124254}
}

Relaton-DOI cleans up the fields to the extent reasonable:

Crossref data in Relaton cleaned up of stray delimiters
@book{kuster1852a,
  title = {Die Gattungen Pupa, Megaspira, Balea und Tornatellina : in Abbildungen nach der Natur mit Beschreibungen},
  author = {Kuster, H. C. and Chemnitz, Johann Hieronymus and Martini, Friedrich Heinrich Wilhelm},
  publisher = {Verlag von Bauer und Raspe (Julius Merz)},
  year = {1852},
  address = {Nürnberg},
  timestamp = {2023-02-02},
  doi = {http://dx.doi.org/10.5962/bhl.title.124254},
  url = {http://www.biodiversitylibrary.org/bibliography/124254}
}

Capitalization of names

On occasion, author and editor names appear in all capital letters.

Relaton-DOI will change these to title case, so long as the name is more than two letters long.

You may need to review the results, to catch camel case exceptions such as "MacDonald".

Uncorrectable and irrecoverable Crossref data (caveats)

There are however areas where Crossref is missing information, and for which nothing can be done but to emend the record after fetching it.

In particular:

Page numbers of journal articles missing

Crossref quite often omits the page numbers of a journal article, even though it retains the volume and issue of the article.

Note
Page numbers are no longer essential for online access; but if a journal is published in print, or even in a medium emulating print (PDF), page numbers are still expected in citations.

Missing series information

Crossref usually does not include the series of a monograph in its data.

Undifferentiated volume numbering for restarted journals

Some journals restart their volume numbering while keeping the same title; in a few cases, they do so multiple times.

Bibliographies indicate this where applicable, before the volume number ("New Series", "3rd Series"). Crossref does not differentiate between different runs (i.e. numberings) of journals, so that information cannot be included in any citations.

Missing host item

The details of the host item may be impractical to retrieve, particularly if the host item is a large reference work, containing many items each with their own DOI.

Non-normalized place of publication

The place of publication is free text, and is not systematically broken down into city, region, country.

Author names not broken down properly

Occasionally, a personal name is not broken down into forename and surname, but is presented in its entirety as the surname.

Non-normalized organization names and their abbreviations

Records may use abbreviations of organisations instead of the full names, and some records may mix the two in different fields (e.g. both "IEEE" and "Institute of Electric and Electronic Engineers").

Missing publication identifiers

The identifiers of standards and reports are often not included in the record.

Conclusion

Relaton now supports bibliographic information fetching using DOI identifiers through the Crossref service.

Despite the inherent data deficiencies of the Crossref database, it is nonetheless a boon for users who wish to retrieve citable information.

As well said in old adage, something is better than nothing. And to keep us feeling warm and fuzzy, any improvement in Crossref metadata by the publishers will benefit all Relaton users!