How Hypothesis interacts with document metadata

How Hypothesis interacts with document metadata2018-11-12T09:44:42+00:00
  1. Home
  2. Help
  3. Tutorials and How-Tos
  4. How Hypothesis interacts with document metadata

Various HTML tags influence how the Hypothesis system identifies and aliases documents. Some of these tags are commonly deployed to guide search engines, others to help organize the scholarly literature. If you’re already using such tags as prescribed by the standards and conventions that govern them, then you needn’t change that practice in order to accommodate Hypothesis. But you’ll want to be aware of how such tags influence the Hypothesis system, so we’ll document all the relevant behaviors here.

If you’re not already publishing such tags, but would like to influence how Hypothesis identifies and aliases documents, you can adopt them for that purpose. Or you can use a less-common approach that we’ll explain at the end.

But first, let’s review how Hypothesis piggybacks on common metadata conventions to identify and alias documents.

How Hypothesis interacts with search-oriented HTML tags

rel=”canonical”

The most common of these is <link rel=”canonical”>. Google offers this
guidance:

If you have a single page accessible by multiple URLs, or different pages with similar content (for example, a page with both a mobile and a desktop version), Google sees these as duplicate versions of the same page. Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.

If you don’t explicitly tell Google which URL is canonical, Google will make the choice for you, or might consider them both of equal weight, which might lead to unwanted behavior.

Let’s take Google’s example and consider a desktop version of a page at http://example.com/desktop.html and a mobile version at /mobile.html. You want Google to prefer the desktop version so you add <link rel="canonical" href="http://example.com/desktop"> to both pages.

The first time either version is annotated, Hypothesis creates an internal identifier for the document they commonly represent, and associates two aliases with that identifier: the URL of the document itself, and the URL named in the canonical link relation. When either version is subsequently annotated, Hypothesis looks up the canonical URL. If the document URL isn’t already among the aliases — as happens if there is one annotation on /desktop.html and none on /mobile.html, or vice versa — Hypothesis adds the missing alias. From then on, annotations posted to either version accrue to the same document identifier.

When the Hypothesis client searches for annotations, for desktop.html or for /mobile.html, it will search for the canonical URL, /desktop.html, and find all annotations posted to either version.

If you subsequently republish both versions without the link tag that declares /desktop.html to be canonical, a search from either version will continue to find annotations posted to either, because /desktop.html and /mobile.html are retained as aliases. There is as yet no way for a Hypothesis user to remove aliases interactively or by means of the Hypothesis API.

(That’s generally true for Hypothesis document metadata, as of October, 2018. For example, the value of the HTML title attribute on the head element controls the document titles displayed in the Hypothesis dashboard. If you republish a page with a changed value for that attribute, the dashboard will not reflect that change.)

rel=”alternate”

Google offers this guidance:

Add <link rel=”alternate” hreflang=”lang_code”… > elements to your page header to tell Google all of the language and region variants of a page. This is useful if you don’t have a sitemap or the ability to specify HTTP response headers for your site.

It wouldn’t make sense for Hypothesis to handle the alternate link relation in the same way as the canonical link relation. Annotations that anchor to phrases in the English version of a page won’t anchor to the corresponding phrase in the French version. So, Hypothesis doesn’t try to do that.

Here the Mozilla Developer Network’s guidance for another link relation:

If the type is set to application/rss+xml or application/atom+xml, the link defines a syndication feed. The first one defined on the page is the default.

Suppose that http://example.com/rss.xml is an RSS feed, and that /desktop1.html and /desktop2.html are links in that feed. Both pages might include <link rel="alternate" type="application/rss+xml" href="/rss.xml">. Again it wouldn’t make sense for Hypothesis to handle the alternate link relation in the same way as the canonical link relation. The result would be that the Hypothesis client would search for, and try to anchor, annotations on any page declaring that alternate link relation. On any given page, annotations for that page would anchor, all others would show up in the client’s Orphans tab. So, again, Hypothesis doesn’t create aliases based on this link relation.

There is a case where Hypothesis handles the alternate link relation in the same way as the canonical link relation. If /mobile.html includes <link rel="alternate" href="/desktop.html">, the effect on annotation is the same as with <link rel="canonical" href="/desktop.html">. But there’s no annotation-related reason to use that pattern. If an alternate link doesn’t refer to a syndication feed, MDN says, then it:

defines an alternative page, of one of these types:

  • for another medium, like a handheld device (if the media attribute is set)
  • in another language (if the hreflang attribute is set),
  • in another format, such as a PDF (if the type attribute is set)
  • a combination of these

We’ve covered the media and hreflang cases. To declare PDF and HTML variants as aliases there’s another tag, citation_pdf_url, which we discuss below. It’s true that the alternate link relation behaves the same way as the canonical link relation with respect to Hypothesis annotation. But there’s no reason to use it for that purpose. Just use the canonical link relation.

How Hypothesis interacts with scholarly metadata

In the scholarly world, the digital object identifier (DOI) can unify various online appearances of the same article. For example, a research paper might be hosted by its journal of origin at one URL and by PubMed Central at another URL. The DOI provides a common way to cite the article.

There are two ways for a scholarly web page to declare its URL as an alias for a DOI. The most common method is a de facto standard known as the Highwire Press tag set, popular because it’s supported by Google Scholar. For example, an article published by Cell includes the tag <meta name="citation_doi" content="10.1016/j.ajhg.2017.02.007">. That same article at PubMed Central includes the same tag.

When the Hypothesis client posts an annotation to that article, it tells the Hypothesis server that the article’s URL is an alias of the DOI. The first time the server sees such an assertion, it adds the alias.

When the Hypothesis client then searches for annotations on the article, it searches for two URLs at once. First, for the URL of the article at PubMed or Cell. Second, for the DOI. Here’s the API query the client makes at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5384036/:

https://hypothes.is/api/search?
uri=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5384036/&
uri=doi:10.1016/j.ajhg.2017.02.007

(You can make the same queries in the Hypothesis dashboard. Here’s a search for the journal URL. Here’s one for the DOI. Here’s one for both. They all get the same result.)

Another way to get the same result, based on the Dublin Core Metadata Initiative, uses this kind of tag: <meta name="dc.identifier" content="10.1016/j.ajhg.2017.02.007">. The citation_doi and dc.identifier tags are often both included in scholarly metadata. For the Hypothesis system, either is sufficient to establish a DOI/URL mapping for the purpose of annotation.

One difference between the two methods is worth mentioning. When a page declares citation_doi, the tag’s value is unambiguously a DOI. The Hypothesis server creates a DOI/URL mapping if none already existed. When a page declares dc:identifier, though, the tag’s value is not necessarily a DOI. The value should arguably be prefixed with doi: but in practice commonly is not. So the Hypothesis server checks the dc:identifier value sent from the client, and creates aliases only when the value matchers DOI syntax. Patterns that match include:

<meta name=”dc.identifier” value=”10.000/123″>

<meta name=”dc.identifier” value=”doi:10.000/123″>

Note that the filter also matches these patterns:

<meta name=”dc.identifier” value=”http://dx.doi.org/10.000/123″>

<meta name=”dc.identifier” value=”https://doi.org/10.000/123″>

But in these cases, the aliased URI does not coalesce.

How Hypothesis interacts with PDF metadata

The Hypothesis system makes use of two different kinds of PDF-related metadata. One kind is embedded in the PDF, most notably the fingerprint that Hypothesis uses to coalesce annotations across copies served from various locations.

Sometimes it’s desirable to prevent such coalescence. If you’re a teacher providing a PDF for two sections of a course, and you want to segment discussion on a per-section basis, you’ll need to create two versions of the PDF with different fingerprints. To do that:

  1. Load the PDF in Chrome
  2. Click on the Print button
  3. Use the Save as PDF option in the Print dialog

The saved copy will have a different fingerprint.

To check a fingerprint:

  1. Load the PDF into Chrome (e.g. by dragging it from your computer’s file viewer into Chrome)
  2. Open Chrome’s Customize and Control menu (upper right, vertical triple dots)
  3. Select More Tools -> Developer Tools -> Console
  4. You’ll see something like this: PDF fcfbb33577a82236301d15889a77df36 [1.3 Mac OS X 10.12.3 Quartz PDFContext / Adobe InDesign CS2 (4.0.5)] (PDF.js: 1.1.215)

The fingerprint in this case is fcfbb33577a82236301d15889a77df36.

For more on Hypothesizing with PDFs, see:

https://web.hypothes.is/help/how-to-ocr-optimize-pdfs/

https://web.hypothes.is/help/annotating-locally-saved-pdfs/

https://web.hypothes.is/help/hosting-pdfs-for-annotation/

http://jeremydean.org/blog/getting-started/hosting-and-annotating-pdfs-in-wordpress/

Publishers can use an HTML meta tag to point from an HTML version of an artricle to a corresponding PDF. For example, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0168597 includes:

<meta name=”citation_pdf_url”
content=”https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0168597&type=printable”>

This declaration sets up an equivalence, in the Hypothesis server, between the HTML and PDF versions of the article. Note that the value of the tag’s content attribute is a URL that points to a PDF directly, not to an HTML page that embeds the PDF. Annotations posted to either show up on both. Note that this form of aliasing coexists with, and complements, fingerprint-based aliasing.

What happens when URLs change?

So far we’ve seen how Hypothesis interacts with metadata conventions that web publishers may already be using to influence search engines and/or coalesce citation of syndicated content. In both cases, Hypothesis leverages a common convention to identify, and alias, annotated documents.

Thanks to these mechanisms, annotations can be future-proofed against URL change in two ways. If an annotation’s target document declares a DOI, Hypothesis will bind the annotation to all documents that declare the same DOI. Similarly, if an annotation’s target document is a PDF, Hypothesis will bind it to all copies of that PDF. In both cases, the annotation binds to the document in a way that does not depend on the document’s URL.

Ideally, of course, every annotation would bind to a URL-independent identifier and be future-proofed against URL change. Hypothesis has a way to make that happen, and it’s illustrated nicely at eLIFE. The scholarly articles there gain URL independence thanks to the DOI-related metadata tags we’ve already seen. But not every article on the site has a DOI. Here’s one that doesn’t. Instead, it makes these two declarations:

<meta name=”dc.identifier” content=”blog-article/e3d858b3″>

<meta name=”dc.relation.ispartof” content=”elifesciences.org”>

Together they form this URL-independent identifier:

urn:x-dc:elifesciences.org/blog-article/e3d858b3

It works, in Hypothesis, like the URL-independent identifier for an article with a DOI:

doi:10.1126/science.51.1305.8

And like a PDF fingerprint:

urn:x-pdf:db49e0a7b073bbadeb889a910835b716

When Hypothesis sees the combination of dc.identifier and dc.relation.ispartof it joins the values of the two tags to create a URL-independent identifier. In this example it’s still domain-dependent, insofar as elifesciences.org appears in urn:x-dc:elifesciences.org/blog-article/e3d858b3. But eLIFE could move that post to another URL tomorrow without losing annotations bound to it. And that same declaration can occur in a copy of the blog post that’s published to blog.elifesciences.org, or to a site that syndicates the article from eLIFE, or to a completely different domain operated by eLIFE in the future.

Recommendations

If you’re publishing HTML documents that have DOIs, Hypothesis is designed to work with existing metadata conventions. If you are using one or both of citation_doi and dc:identifier for their intrinsic purposes, annotations will bind to your documents in an URL-independent way. If you’re not already using them, you can implement either or both to achieve that effect.

If you’re publishing HTML documents that point to corresponding PDFs, Hypothesis again is designed to work with an existing convention, citation_pdf_url that you can leverage or newly adopt to coalesce annotations across HTML and PDF versions. Note that when you associate an HTML URL with a PDF URL in this way, you are also binding both to the set of annotations made on all copies of the PDF, since they share a common fingerprint.

If you’re publishing PDF documents, be aware that annotations coalesce across copies of the same PDF file. If you want disjoint sets of annotations for the same document, make a new version using the method shown above.

Finally, if you’re publishing a web document that doesn’t have a DOI, and isn’t a PDF, but you want to bind annotations to that document in a way that’s future-proofed against URL change, you can use dc.identifier with dc.relation.ispartof as shown in the eLIFE example.

Appendix

Examples of search-oriented tags

See Effects of search metadata on annotation

Examples of scholarly tags

See Effects of scholarly metadata on annotation

Was this article helpful?

Related Articles

Community, Privacy and Accessibility at Hypothesis