Hypothesis uses common metadata conventions to identify and alias documents so we can sync annotations across multiple versions of the same document. Among the metadata conventions used are:
- Search-oriented HTML tags
- DOI-related HTML tags
- PDF metadata
- PDF-related HTML metadata
Some of these conventions are deployed to guide search engines; others to help organize scholarly literature. You can also use them to influence how Hypothesis identifies and aliases documents.
This help article will explain how the Hypothesis system uses common metadata conventions, and provide recommendations to ensure your documents are identified and aliased appropriately by Hypothesis.
Search-oriented HTML tags
The most common search-oriented HTML tags are
<link rel="canonical"> and
<link rel="canonical"> is used by Google to consolidate content that can be accessed at multiple URLs and
<link rel="alternate"> points to a syndication feed or alternate versions of content. Hypothesis unifies documents that contain the same
<link rel="canonical"> tag and does not unify documents that contain the same
<link rel="alternate"> tag (with one exception, detailed below).
Google offers the following guidance for
<link rel="canonical">; tags:
If you have a single page accessible by multiple URLs, or different pages with similar content (for example, a page with both a mobile and a desktop version), Google sees these as duplicate versions of the same page. Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.
When an HTML page is annotated, if this tag is present, Hypothesis will use it to create an internal identifier for the document and associate two aliases with it:
- The URL of the document being annotated (document URL)
- The URL named in the
<link rel="canonical">tag (canonical link)
If the document URL isn’t already among the aliases associated with the canonical link, Hypothesis adds the missing alias. From then on, annotations posted to either version accrue to the same document identifier.
To show how this works using Google’s example of a page with desktop and mobile versions, let’s assume you’ve created HTML pages with the following URLs:
If you want Google to prefer the desktop version, you’ll add
<link rel="canonical" href="http://example.com/desktop"> to the
<head> of both pages.
Let’s assume no one has annotated your website before, so it’s not currently in our database.
The first time someone annotates http://example.com/desktop.html, the following will happen:
- Hypothesis creates a “document” in its database
- This “document” is given two aliases: the document URL (http://example.com/desktop.html) and the canonical URL (in this case, also http://example.com/desktop.html).
- The annotations made here will show at http://example.com/desktop.html, and at http://example.com/mobile.html, because the /mobile.html page has
<link rel="canonical" href="http://example.com/desktop">in the
Next, let’s say someone annotates http://example.com/mobile.html. Here’s what will happen:
- Hypothesis sees the
<link rel="canonical">in the
<head>of the page
- Hypothesis adds the document URL, http://example.com/mobile.html, as an alias for the canonical link (http://example.com/desktop.html)
- The annotations made here will show up at http://example.com/mobile.html and at http://example.com/desktop.html because both URLs are associated in the Hypothesis system with the canonical link, http://example.com/desktop.html.
Note that there is currently no way for a Hypothesis user to remove aliases interactively or by means of the Hypothesis API – meaning once a document has been associated with a canonical URL, there isn’t a way to “un-associate” it.
The Mozilla Developer Network offers the following guidance for
typeis set to
<link>defines a syndication feed. The first one defined on the page is the default.
<link>defines an alternative page, of one of these types:
- for another medium, like a handheld device (if the
mediaattribute is set)
- in another language (if the
hreflangattribute is set)
- in another format, such as a PDF (if the
typeattribute is set)
- a combination of these
In general, Hypothesis does not use alternate link relation to alias documents.
Let’s say you have a web page, http://example.com/english and http://example.com/spanish. Annotations that anchor to phrases in the
/english page won’t anchor to phrases on the
/spanish page, so it doesn’t make sense for Hypothesis to treat these pages as the same document. Even if both pages contain the same
<link rel="alternate"> tags, annotations made on one page will not show on the other.
There is one case where Hypothesis will handle an alternate link relation in the same way as a canonical link relation: if your http://example.com/mobile.html page includes includes
<link rel="alternate" href="/desktop.html">, the Hypothesis system will alias your
DOI-related HTML tags
In the scholarly world, the digital object identifier (DOI) can unify various online appearances of the same article. For example, a research paper might be hosted by its journal of origin at one URL and by PubMed Central at another URL. The DOI provides a common way to cite the article.
There are two ways for a scholarly web page to declare its URL as an alias for a DOI. The most common method is a de facto standard known as the Highwire Press tag set, popular because it’s supported by Google Scholar. Another way comes from the Dublin Core Metadata Initiative. It’s common for both of these to be included in scholarly metadata. For the Hypothesis system, either is sufficient to establish a DOI/URL mapping for the purpose of annotation.
Highwire Press tags
An article published by Cell includes the following tag:
<meta name="citation_doi" content="10.1016/j.ajhg.2017.02.007">.
That same article at PubMed Central includes the same tag.
When someone annotates this article, the Hypothesis client tells the Hypothesis server that the article’s URL is an alias of the DOI. The first time the server sees such an assertion, it adds the alias.
When the Hypothesis client then searches for annotations on the article, it searches for two URLs at once. First, for the URL of the article at PubMed or Cell. Second, for the DOI. Here’s the API query the client makes at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5384036/:
Dublin Core metadata
When implementing Dublin Core metadata for articles with DOIs, a meta tag like this one is added to the
<head> of a document:
<meta name="dc.identifier" content="10.1016/j.ajhg.2017.02.007">.
citation_doi in Highwire Press tags, the
dc:identifier is not always a DOI. The value for
dc:identifier should arguably be prefixed with
doi: when referring to a DOI, but in practice commonly is not. So the Hypothesis system checks the
dc:identifier value, and creates a URL-to-DOI alias only when the value matchers DOI syntax. Patterns that match include:
<meta name="dc.identifier" value="10.000/123″>
<meta name="dc.identifier" value="doi:10.000/123″>
Note that the following patterns are valid DOI syntax:
<meta name="dc.identifier" value="http://dx.doi.org/10.000/123″>
<meta name="dc.identifier" value="https://doi.org/10.000/123″>
But in these cases, the aliased URI does not coalesce. If you’re using Dublin Core metadata to establish URL aliases for DOIs and the annotations are not syncing as expected, check the formatting of your tags to insure that
https://doi.org/ is not included in the
value attribute of the
value of the
dc.identifier tag needn’t contain a DOI, it can be used with
dc.relation.ispartof to bind annotations to a document in a way that’s future-proofed against URL change. In this way, Dublin Core metadata is preferable to canonical links when creating aliases for web documents.
For HTML documents that don’t have DOIs, the
dc.identifier can be a string or number generated by a formal identification system, or just a URL. The dc.relation.ispartof will connect your document to your site. This is illustrated nicely by eLIFE:
The scholarly articles there gain URL independence thanks to the DOI-related metadata tags we’ve already seen. But not every article on the site has a DOI. Here’s one that doesn’t. Instead, it makes these two declarations:
<meta name="dc.identifier" content="blog-article/e3d858b3″>
<meta name="dc.relation.ispartof" content="elifesciences.org">
Together they form this URL-independent identifier:
It works, in Hypothesis, like the URL-independent identifier for an article with a DOI:
And like a PDF fingerprint:
When Hypothesis sees the combination of
dc.relation.ispartof it joins the values of the two tags to create a URL-independent identifier.
In this example it’s still domain-dependent, insofar as
elifesciences.org appears in
urn:x-dc:elifesciences.org/blog-article/e3d858b3. But eLIFE could move that post to another URL tomorrow without losing annotations bound to it. And that same declaration can occur in a copy of the blog post that’s published to
blog.elifesciences.org, or to a site that syndicates the article from eLIFE, or to a completely different domain operated by eLIFE in the future.
The Hypothesis system makes use of two different kinds of PDF-related metadata.
The fingerprint of a PDF is embedded in the document itself, and is used by Hypothesis to sync annotations across copies of a PDF served from various locations. PDF fingerprinting is what allows Hypothesis users to collaboratively annotate local copies of PDFs.
If you want annotations to sync across multiple copies of a PDF, you’ll want to check the fingerprint of each copy to ensure they’re all the same.
Sometimes it’s desirable to prevent annotations from syncing across PDF copies. If you’re a teacher providing a PDF for two sections of a course, and you want to segment discussion on a per-section basis, you’ll need to create two versions of the PDF with different fingerprints.
For more on Hypothesizing with PDFs, see:
- How to OCR-Optimize PDFs
- Annotating Locally-Saved PDFs
- Hosting PDFs for Annotation
- Hosting PDFs for Annotation
PDF-related HTML metadata
Publishers can use an HTML
<meta> tag to point from an HTML version of an article to a corresponding PDF. For example, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0168597 includes:
This declaration sets up an equivalence, in the Hypothesis server, between the HTML and PDF versions of the article. Note that the value of the tag’s content attribute is a URL that points to a PDF directly, not to an HTML page that embeds the PDF. Annotations posted to either show up on both. Note that this form of aliasing coexists with, and complements, fingerprint-based aliasing.