SciBot: Machine and human annotators working together

By judell | 4 March, 2016

Last April, at I Annotate Hack Days, several of the developers who showed up wanted to use the Hypothesis API not only to read annotations but also to create them. With help from Randall Leeds, Raymond Yee built an API wrapper that included a way to make authenticated calls to the Hypothesis API. Since then, variants of that library — in Python and Perl — have been used by a handful of developers working on projects that require programmatic writing of annotations, and/or reading of private annotations such as those in groups.

But the method Raymond used — through no fault of his own — wasn’t ideal. His code had to pretend to be a user, and log in as a user would in order to get hold of the API token that the Hypothesis client sends to the server. We’d rather have offered a proper way for developers to acquire such tokens but hadn’t yet made that possible.

Now we have. If you’re logged in to Hypothesis, visit https://hypothes.is/profile/developer to find yours. If you’re not a developer you’ll never need this token. But if you are, it’s your ticket to a world of Hypothesis integration.

To illustrate use of the token I’ll focus here on a tool called SciBot, which is the brainchild of our biosciences director Maryann Martone and some of her colleagues at the Neuroscience Information Framework (NIF) project. NIF is “a dynamic inventory of Web-based neuroscience resources: data, materials, and tools.” As one of its core activities, NIF has defined and promoted a mechanism to identify such resources when mentioned in scientific papers. It entails a registry of Research Resource Identifiers (RRIDs) and a protocol for including RRIDs in scientific papers.

Here’s an example of some RRIDs cited in Dopaminergic lesioning impairs adult hippocampal neurogenesis by distinct modification of a-synuclein:

Free-floating sections were stained with the following primary antibodies: rat monoclonal anti-BrdU (1:500; RRID:AB_10015293; AbD Serotec, Oxford, United Kingdom), rabbit polyclonal anti-Ki67 (1:5,000; RRID:AB_442102; Leica Microsystems, Newcastle, United Kingdom), mouse monoclonal antineuronal nuclei (NeuN; 1:500; RRID:AB_10048713; Millipore, Billerica, MA), rabbit polyclonal antityrosine hydroxylase (TH; RRID:AB_1587573; Millipore), goat polyclonal anti-DCX (1:250; RRID:AB_2088494; Santa Cruz Biotechnology, Santa Cruz, CA), and mouse monoclonal anti-a-syn (1:100; syn1; clone 42; RRID:AB_398107; BD Bioscience, Franklin Lakes, NJ).

The term “goat polyclonal anti-DCX” is not necessarily unique. So the author has added the identifer RRID:AB_2088494, which corresponds to this record in NIF’s registry. RRIDs are embedded directly in papers, rather than attached as metadata, because, as Dr. Martone says, “papers are the only scientific artifacts that are guaranteed to be preserved.”

RRIDs included directly in articles are guaranteed to be preserved, But there is no guarantee they’ll be correct. An RRID might be misspelled. Or a correct RRID might point to a flawed record in the registry. Could annotation enable a process of computer-assisted validation? Thus was born the idea of SciBot. It’s a machine/human partnership that involves:

A bookmarklet

When clicked by a validator who is viewing an article in a browser, the bookmarklet relays the text of the article to the SciBot service.

The SciBot service

When it receives the article sent from the bookmarklet, the service:

1. Scans the article for RRIDs.

2. Queries the Hypothesis API to check whether each found RRID has already been annotated. If so, it won’t add a duplicate annotation.

3. Calls the NIF registry for each found RRID and retrieves its record there.

4. Calls the Hypothesis API to post an annotation that includes the lookup result and anchors to the first occurrence of that RRID in the text of the article. The annotation is posted to a private group where the team of NIF validators are collaborating, and carries a tag (e.g. RRID:AB_2088494) to indicate SciBot thinks it has found a valid RRID.

The API token is needed for steps 2 and 4. In step 2, SciBot is querying a private Hypothesis group. So it must authenticate using the token of a Hypothesis user who belongs to that group. In step 4, SciBot is posting an annotation. That too requires authentication as a user, whether or not — as in this case — the annotation targets a group.

Human validation and troubleshooting

The validation team developed a tag vocabulary to classify the results. RRIDCUR:Validated means the identifier, and its connection to the registry, are both as expected. RRIDCUR:Unrecognized means that an identifier that should have been recognized was not. Searching for these tags provides feedback used to tune the regular expression that matches RRIDs. RRIDCUR:Unresolved means that an identifier that should have been found in the registry wasn’t.

Human validators can also, in some cases, enhance the available information. The tag PMID:26609158, for example, means that the article being validated is available in PubMed at that ID. Depending on the context in which an article appears, such cross-references are often, but not always, available. Annotation is a way to enrich article metadata and help interconnect syndicated copies of articles.

The door is open!

You can find SciBot on GitHub, and that’s the best place to ask about and discuss the details of how it works. We feature it here because it’s a great example of the kinds of integrations that become possible when you can create annotations programatically. Raymond Yee cracked that door open last April. Now, with official API tokens, it’s wide open. If you step through and create an interesting and useful integration, let us know. Annotating this blog post is a great way to do that!