How Hypothesis interacts with document metadata

How Hypothesis interacts with document metadata2019-11-11T10:47:20-08:00
  1. Home
  2. Help
  3. Tutorials and How-Tos
  4. How Hypothesis interacts with document metadata

Hypothesis uses common metadata conventions to identify and alias documents so we can sync annotations across multiple versions of the same document. Among the metadata conventions used are:

  • Search-oriented HTML tags
  • DOI-related HTML tags
  • PDF metadata
  • PDF-related HTML metadata

Some of these conventions are deployed to guide search engines; others to help organize scholarly literature. You can also use them to influence how Hypothesis identifies and aliases documents.

This help article will explain how the Hypothesis system uses common metadata conventions, and provide recommendations to ensure your documents are identified and aliased appropriately by Hypothesis.

Search-oriented HTML tags

The most common search-oriented HTML tags are <link rel="canonical"> and <link rel="alternate">.

Briefly,<link rel="canonical"> is used by Google to consolidate content that can be accessed at multiple URLs and <link rel="alternate"> points to a syndication feed or alternate versions of content. Hypothesis unifies documents that contain the same <link rel="canonical"> tag and does not unify documents that contain the same <link rel="alternate"> tag (with one exception, detailed below).

Google offers the following guidance for <link rel="canonical">; tags:

If you have a single page accessible by multiple URLs, or different pages with similar content (for example, a page with both a mobile and a desktop version), Google sees these as duplicate versions of the same page. Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.

When an HTML page is annotated, if this tag is present, Hypothesis will use it to create an internal identifier for the document and associate two aliases with it:

  • The URL of the document being annotated (document URL)
  • The URL named in the <link rel="canonical"> tag (canonical link)

If the document URL isn’t already among the aliases associated with the canonical link, Hypothesis adds the missing alias. From then on, annotations posted to either version accrue to the same document identifier.

To show how this works using Google’s example of a page with desktop and mobile versions, let’s assume you’ve created HTML pages with the following URLs:

If you want Google to prefer the desktop version, you’ll add <link rel="canonical" href="http://example.com/desktop"> to the <head> of both pages.

Let’s assume no one has annotated your website before, so it’s not currently in our database.

The first time someone annotates http://example.com/desktop.html, the following will happen:

Next, let’s say someone annotates http://example.com/mobile.html. Here’s what will happen:

Note that there is currently no way for a Hypothesis user to remove aliases interactively or by means of the Hypothesis API – meaning once a document has been associated with a canonical URL, there isn’t a way to “un-associate” it.

The Mozilla Developer Network offers the following guidance for <link rel="alternate">:

If the type is set to application/rss+xml or application/atom+xml, the <link> defines a syndication feed. The first one defined on the page is the default.

Otherwise, the <link> defines an alternative page, of one of these types:

  • for another medium, like a handheld device (if the media attribute is set)
  • in another language (if the hreflang attribute is set)
  • in another format, such as a PDF (if the type attribute is set)
  • a combination of these

In general, Hypothesis does not use alternate link relation to alias documents.

Let’s say you have a web page, http://example.com/english and http://example.com/spanish. Annotations that anchor to phrases in the /english page won’t anchor to phrases on the /spanish page, so it doesn’t make sense for Hypothesis to treat these pages as the same document. Even if both pages contain the same <link rel="alternate"> tags, annotations made on one page will not show on the other.

There is one case where Hypothesis will handle an alternate link relation in the same way as a canonical link relation: if your http://example.com/mobile.html page includes includes <link rel="alternate" href="/desktop.html">, the Hypothesis system will alias your /mobile.html and /desktop.html versions.

In the scholarly world, the digital object identifier (DOI) can unify various online appearances of the same article. For example, a research paper might be hosted by its journal of origin at one URL and by PubMed Central at another URL. The DOI provides a common way to cite the article.

There are two ways for a scholarly web page to declare its URL as an alias for a DOI. The most common method is a de facto standard known as the Highwire Press tag set, popular because it’s supported by Google Scholar. Another way comes from the Dublin Core Metadata Initiative. It’s common for both of these to be included in scholarly metadata. For the Hypothesis system, either is sufficient to establish a DOI/URL mapping for the purpose of annotation.

Highwire Press tags

An article published by Cell includes the following tag: <meta name="citation_doi" content="10.1016/j.ajhg.2017.02.007">.

That same article at PubMed Central includes the same tag.

When someone annotates this article, the Hypothesis client tells the Hypothesis server that the article’s URL is an alias of the DOI. The first time the server sees such an assertion, it adds the alias.

When the Hypothesis client then searches for annotations on the article, it searches for two URLs at once. First, for the URL of the article at PubMed or Cell. Second, for the DOI. Here’s the API query the client makes at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5384036/:

https://hypothes.is/api/search?uri=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5384036/&uri=doi:10.1016/j.ajhg.2017.02.007

(You can make the same queries in the Hypothesis dashboard. Here’s a search for the journal URL. Here’s a search for the DOI. Here’s one for both. They all get the same result.)

Dublin Core metadata

When implementing Dublin Core metadata for articles with DOIs, a meta tag like this one is added to the <head> of a document:

<meta name="dc.identifier" content="10.1016/j.ajhg.2017.02.007">.

Unlike with citation_doi in Highwire Press tags, the dc:identifier is not always a DOI. The value for dc:identifier should arguably be prefixed with doi: when referring to a DOI, but in practice commonly is not. So the Hypothesis system checks the dc:identifier value, and creates a URL-to-DOI alias only when the value matchers DOI syntax. Patterns that match include:

<meta name="dc.identifier" value="10.000/123″>
<meta name="dc.identifier" value="doi:10.000/123″>

Note that the following patterns are valid DOI syntax:

<meta name="dc.identifier" value="http://dx.doi.org/10.000/123″>
<meta name="dc.identifier" value="https://doi.org/10.000/123″>

But in these cases, the aliased URI does not coalesce. If you’re using Dublin Core metadata to establish URL aliases for DOIs and the annotations are not syncing as expected, check the formatting of your tags to insure that http://doi.org/ or https://doi.org/ is not included in the value attribute of the dc.identifier tag.

Since the value of the dc.identifier tag needn’t contain a DOI, it can be used with dc.relation.ispartof to bind annotations to a document in a way that’s future-proofed against URL change. In this way, Dublin Core metadata is preferable to canonical links when creating aliases for web documents.

For HTML documents that don’t have DOIs, the dc.identifier can be a string or number generated by a formal identification system, or just a URL. The dc.relation.ispartof will connect your document to your site. This is illustrated nicely by eLIFE:

The scholarly articles there gain URL independence thanks to the DOI-related metadata tags we’ve already seen. But not every article on the site has a DOI. Here’s one that doesn’t. Instead, it makes these two declarations:

<meta name="dc.identifier" content="blog-article/e3d858b3″>
<meta name="dc.relation.ispartof" content="elifesciences.org">

Together they form this URL-independent identifier:

urn:x-dc:elifesciences.org/blog-article/e3d858b3

It works, in Hypothesis, like the URL-independent identifier for an article with a DOI:

doi:10.1126/science.51.1305.8

And like a PDF fingerprint:

urn:x-pdf:db49e0a7b073bbadeb889a910835b716

When Hypothesis sees the combination of dc.identifier and dc.relation.ispartof it joins the values of the two tags to create a URL-independent identifier.

In this example it’s still domain-dependent, insofar as elifesciences.org appears in urn:x-dc:elifesciences.org/blog-article/e3d858b3. But eLIFE could move that post to another URL tomorrow without losing annotations bound to it. And that same declaration can occur in a copy of the blog post that’s published to blog.elifesciences.org, or to a site that syndicates the article from eLIFE, or to a completely different domain operated by eLIFE in the future.

The Hypothesis system makes use of two different kinds of PDF-related metadata.

PDF fingerprints

The fingerprint of a PDF is embedded in the document itself, and is used by Hypothesis to sync annotations across copies of a PDF served from various locations. PDF fingerprinting is what allows Hypothesis users to collaboratively annotate local copies of PDFs.
If you want annotations to sync across multiple copies of a PDF, you’ll want to check the fingerprint of each copy to ensure they’re all the same.

Sometimes it’s desirable to prevent annotations from syncing across PDF copies. If you’re a teacher providing a PDF for two sections of a course, and you want to segment discussion on a per-section basis, you’ll need to create two versions of the PDF with different fingerprints.

For more on Hypothesizing with PDFs, see:

Publishers can use an HTML <meta> tag to point from an HTML version of an article to a corresponding PDF. For example, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0168597 includes:

<meta name="citation_pdf_url"
content="https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0168597&type=printable">

This declaration sets up an equivalence, in the Hypothesis server, between the HTML and PDF versions of the article. Note that the value of the tag’s content attribute is a URL that points to a PDF directly, not to an HTML page that embeds the PDF. Annotations posted to either show up on both. Note that this form of aliasing coexists with, and complements, fingerprint-based aliasing.

Appendix

Examples of search-oriented tags

See Effects of search metadata on annotation

Examples of scholarly tags

See Effects of scholarly metadata on annotation

Was this article helpful?

Related Articles

Community, Privacy, Accessibility, and Research at Hypothesis