An Online Data-Sharing Solution
Dataverse Citations

The Dataverse Network Project standardizes data set citation. Until this project, citations to data were inconsistent or nonexistent in many publications, with only future access and scholarly recognition highly uncertain. The citation standard used here offers proper recognition to authors as well as permanent identification (via global persistent identifiers rather than only URLs that tend to change frequently) and, through universal numerical fingerprints (UNFs), a guarantee to the scholarly community that future researchers will be able to verify that the data retrieved is identical to that used in a publication decades earlier, even if it has changed storage media, operating systems, hardware, and statistical program format.

Here's a real example of a replication data-set citation (from International Studies Quarterly, King and Zeng, 2007: PDF, p.209):

Gary King; Langche Zeng, 2006, "Replication Data Set for `When Can History be Our Guide? The Pitfalls of Counterfactual Inference'" hdl:1902.1/DXRXCFAWPK UNF:3:DaYlT6QSX9r0D50ye+tXpA== Murray Research Archive [distributor]

This citation has three components readable by humans -- the author, title and year -- two that are machine-readable and one that is optional. The unique global identifier begins "hdl:" (which refers to the international handle system); the universal numerical fingerprint starts "UNF:". The identifier is designed to persist even if URLs -- or the web itself -- is replaced with something else. When the citation appears on line, the identifier is hot-linked to the URL that references the identifier, which works in browsers available today; when in print, the URL is also included in the citation. Four features make the UNF especially useful:

  • The UNF algorithm's cryptographic technology ensures that the alphanumeric identifier will change when any portion of the data set changes. Not only does this assure future researchers that they're using the same data set referenced in a 10-year-old journal article, it allows the data set's owner to keep track of each iteration of his or her research. If the original data set is updated or incorporated into a new, related data set, the algorithm will generate a unique UNF each time.
  • The UNF is determined by the content of the data, not the format in which it was stored. Let's say the data set was created in Stata or R. Now, it's five years later, you'd like to look at it again, but the data has been converted to the next big thing (NBT). You can use NBT, recompute the UNF, and verify for certain that the data set you're downloading is the same one created in Stata or R. That is, UNF will not change.
  • Knowing only the UNF, a journal's editors can be confident that they're discussing one specific data set that can never be changed, even if they do not have permission to see the data. In a sense, the UNF is the ultimate summary statistic.
  • The UNF's non-invertible cryptographic properties guarantee that learning the UNF of a data set, conveys no information about the content of the data. Authors can take advantage of this property to distribute the full citation of a data set -- including the UNF -- even if the data is proprietary or highly confidential data, all without the risk of disclosure.

Citations may also have optional features, in standard format, such as "Murray Research Archive [distributor]", which gives a type in square brackets selected from a given controlled vocabulary.

Learn more: Micah Altman and Gary King. 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data," D-Lib Magazine, Vol. 13, No. 3/4 (March). (Abstract: HTML | Article: PDF)

(C) Dataverse Network Project