Hypertext links generally point to the top of documents and other web-connected media. Intra-document anchors exist, but these usually are established at fixed places by the author of the original document, and are not useful for others in pointing to arbitrary locations in documents they do not create.
Recently, annotation has emerged as an important new way for others to point into things and contribute independent thinking and other kinds of useful information. Central to the success and interoperability of annotation standards such as Open Annotation (openannotation.org) is the adoption of a long-term, stable approach (or possibly approaches) to anchoring annotations by selecting fragments that are robust against changes over time in the underlying document. This is referred to as intra-document anchoring.
A related, but different problem is that often documents exist in multiple formats (HTML, PDF, DOC, TXT, etc), or may be accessible in paginated or unpaginated versions. It would be helpful if an annotation created in one would also be automatically visible in the other. Also, often the same or nearly the same content (for instance with line numbers in one place, but without in another) may be available in many places (e.g. The Bible, Shakespeare, US constitution, popular song lyrics, news stories, or other examples of widely disseminated information). Should we seek ways of enabling the shared annotation of these type of documents wherever they exist? This class of problems collectively are referred to as inter-document anchoring or also content-based anchoring (aka the content addressable web).
Over the years there have been a range of suggested solutions to some or all of these problems, ranging from XPointer (XPath) to the NY Timesâ€™ Emphasis project. (Weâ€™ve included a reading list at the end) Also, importantly, some have proposed strategies for how to combine solutions with a confidence metric for whether the anchor is still valid, depending on minor or major changes in the underlying document, and how to fail gracefully if the anchor can no longer be reliably matched.
We acknowledge the wide range of previous work in this field over the past decades, and we can imagine using a variety of existing strategies in combination to provide a robust solution to these problems. We can also envision generalized approaches that try to solve intra- and inter-document challenges simultaneously (for instance by using content hashes). Proposers are free to tackle any of the above, either singly, or in combination.
What weâ€™re sure of is that a) this is a problem that many have thought about for much longer than we have, b) that this is a grand challenge that many need solved, and c) this is a perfect opportunity to bring these twin forces together.
Ours is an open-source, mission-driven effort. Proposers commit in advance that all strategies and algorithms will be donated to the public domain in perpetuity without restriction, for the benefit of humanity, and released as open source software via github (or another similar open source repository). Also, all proposers agree that their submissions (winning or not) can be made public in the original form they provide.
Please be sure to place your proposal in the context of the prior work that has come in this space.
Proposers also should indicate whether they would be willing to participate in a peer-review process, by reviewing a number of (for instance, five) randomly selected other proposals, selecting one or more to go on to the next stage. This is optional and will not affect your proposal.
Winning submissions will ultimately be selected by a panel of judges as well as ourselves. We may select a number of different proposals that taken together represent the best overall approach.
Winners should be willing to work together to produce a revised, comprehensive algorithm and overall strategy which can be implemented in code. It is our hope, but not a requirement, that those individuals might become long term contributors to this effort.
Winners are also strongly encouraged to publish their work separately.
The best strategies may have some of the following properties:
- They are conceptually and architecturally simple.
- They are built with an eye towards the future. What is likely to remain constant over time? Can we imagine the same approach working 100 years from now? (A good question is: What has remained relatively stable in the past?)
- They are able to be implemented now– they donâ€™t depend on new technologies or standards yet to come.
- They acknowledge that because of the inherent instability of web content, anchors can never be known for certain, and we may take different actions based on how confident we are.
- They outline how we fail gracefully when we cannot reattach anchors.
- New anchors can be determined solely from information available on the document to which they refer.
- They are not expensive to compute on a per instance basis.
- Queries to see whether annotations exist for a given anchor would not require excessive bandwidth or complex multipart negotiations w/ remote services. (Simple, short / fast queries are ok).
- They contemplate the diversity of text content and formats, including the lowest common denominator problem– in other words some pages are highly structured (and that structure can be traversed to efficiently rediscover things that have moved), but some pages are not.
Deadline: November 16, 2012
Submissions will be accepted through November 16, 2012. Once submissions have been received, we will review them internally, and depending on volume, may request that proposers assist in the initial peer review phase. Winners will be announced by December 17, 2012.
A $10,000 prize will be split evenly amongst no more than three winners. The funding for this prize is part of an overall grant that Hypothes.is received from the Alfred P. Sloan Foundation.
Corpus of data / Examples
Here is a spreadsheet that includes three tabs with examples of each of the three classes of problems: many versions, many formats and many copies.
Suggested Reading List
- Robust Intra-Document Locations
(Phelps / Wilensky)
- Linked-Data Aware URI Schemes for Referencing Text Fragments
(Hellmann, et al)
- Robust annotation positioning in Digital documents
(Brush, et al — Also see patent US 7747943)
- Post-mortem on EPUB3 Spec for Canonical Fragment Identifiers
- A debate on EPUB3 Canonical Fragment Identifiers
- Using CSS Selectors as Fragment Identifiers
- Fragment Identifiers for Plain Text Files
(Wilde and Baschnagel)
- NY Times Emphasis Project
- Robust Web Content Extraction
(Kowalkiewicz, et al)
- Content Permanence via Versioning and Fingerprinting
(Simonson, et al)
A Content-based Approach for Discovering Missing Anchor Text for Web Search
(Yi and Allan)
- The Heart of Connection – Hypermedia Unified by Transclusion
- Referential Integrity of Links in Open Hypermedia Systems (Davis)
- HTTP Extensions for a Content Addressable Web
- On the resemblance and containment of documents
- Copy Detection Mechanisms for Digital Documents
- Shingleprinting code for estimating document similarity
- Finding Similar Things Quickly
- Heavy Metal Umlaut animation contest