Robust Anchoring

Challenge Background

Hypertext links generally point to the top of documents and other web-connected media. Intra-document anchors exist, but these usually are established at fixed places by the author of the original document, and are not useful for others in pointing to arbitrary locations in documents they do not create.

Recently, annotation has emerged as an important new way for others to point into things and contribute independent thinking and other kinds of useful information. Central to the success and interoperability of annotation standards such as Open Annotation (openannotation.org) is the adoption of a long-term, stable approach (or possibly approaches) to anchoring annotations by selecting fragments that are robust against changes over time in the underlying document. This is referred to as intra-document anchoring.

A related, but different problem is that often documents exist in multiple formats (HTML, PDF, DOC, TXT, etc), or may be accessible in paginated or unpaginated versions. It would be helpful if an annotation created in one would also be automatically visible in the other. Also, often the same or nearly the same content (for instance with line numbers in one place, but without in another) may be available in many places (e.g. The Bible, Shakespeare, US constitution, popular song lyrics, news stories, or other examples of widely disseminated information). Should we seek ways of enabling the shared annotation of these type of documents wherever they exist? This class of problems collectively are referred to as inter-document anchoring or also content-based anchoring (aka the content addressable web).

Solve It!

Over the years there have been a range of suggested solutions to some or all of these problems, ranging from XPointer (XPath) to the NY Timesâ€™ Emphasis project. (Weâ€™ve included a reading list at the end) Also, importantly, some have proposed strategies for how to combine solutions with a confidence metric for whether the anchor is still valid, depending on minor or major changes in the underlying document, and how to fail gracefully if the anchor can no longer be reliably matched.

We acknowledge the wide range of previous work in this field over the past decades, and we can imagine using a variety of existing strategies in combination to provide a robust solution to these problems. We can also envision generalized approaches that try to solve intra- and inter-document challenges simultaneously (for instance by using content hashes). Proposers are free to tackle any of the above, either singly, or in combination.

What weâ€™re sure of is that a) this is a problem that many have thought about for much longer than we have, b) that this is a grand challenge that many need solved, and c) this is a perfect opportunity to bring these twin forces together.

Process

Ours is an open-source, mission-driven effort. Proposers commit in advance that all strategies and algorithms will be donated to the public domain in perpetuity without restriction, for the benefit of humanity, and released as open source software via github (or another similar open source repository). Also, all proposers agree that their submissions (winning or not) can be made public in the original form they provide.

Please submit a summary of your proposal in a one-page executive summary, as well as a longer form description as necessary. Also, implementations in code form (Javascript for instance) are strongly encouraged.

Please be sure to place your proposal in the context of the prior work that has come in this space.

Peer-review

Proposers also should indicate whether they would be willing to participate in a peer-review process, by reviewing a number of (for instance, five) randomly selected other proposals, selecting one or more to go on to the next stage. This is optional and will not affect your proposal.

Judging

Winning submissions will ultimately be selected by a panel of judges as well as ourselves. We may select a number of different proposals that taken together represent the best overall approach.

Winners should be willing to work together to produce a revised, comprehensive algorithm and overall strategy which can be implemented in code. It is our hope, but not a requirement, that those individuals might become long term contributors to this effort.

Winners are also strongly encouraged to publish their work separately.

Considerations

The best strategies may have some of the following properties:

They are conceptually and architecturally simple.
They are built with an eye towards the future. What is likely to remain constant over time? Can we imagine the same approach working 100 years from now? (A good question is: What has remained relatively stable in the past?)
They are able to be implemented now– they donâ€™t depend on new technologies or standards yet to come.
They acknowledge that because of the inherent instability of web content, anchors can never be known for certain, and we may take different actions based on how confident we are.
They outline how we fail gracefully when we cannot reattach anchors.
New anchors can be determined solely from information available on the document to which they refer.
They are not expensive to compute on a per instance basis.
Queries to see whether annotations exist for a given anchor would not require excessive bandwidth or complex multipart negotiations w/ remote services. (Simple, short / fast queries are ok).
They contemplate the diversity of text content and formats, including the lowest common denominator problem– in other words some pages are highly structured (and that structure can be traversed to efficiently rediscover things that have moved), but some pages are not.

Deadline: November 16, 2012

Submissions will be accepted through November 16, 2012. Once submissions have been received, we will review them internally, and depending on volume, may request that proposers assist in the initial peer review phase. Winners will be announced by December 17, 2012.

Prize: $10,000

A $10,000 prize will be split evenly amongst no more than three winners. The funding for this prize is part of an overall grant that Hypothes.is received from the Alfred P. Sloan Foundation.

Corpus of data / Examples

Here is a spreadsheet that includes three tabs with examples of each of the three classes of problems: many versions, many formats and many copies.