Customise Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorised as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyse the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customised advertisements based on the pages you visited previously and to analyse the effectiveness of the ad campaigns.

No cookies to display.

Annotating to extract findings from scientific papers

By judell | 15 December, 2015

David Kennedy is a neurobiologist who periodically reviews the literature in his field and extracts findings, which are structured interpretations of statements in scientific papers. He recently began using Hypothesis to mark up the raw materials for these findings, which he then manually compiles into a report that looks like this:

 

kennedy_0

 

The report says that two analysis methods were used: Voxel-based morphometry (VBM) and voxel-based relaxometry (VBR). The relevant statement in the paper is:

“Voxel-based morphometry (VBM) and voxel-based relaxometry (VBR) were subsequently performed.”

To extract these two facts, Dr. Kennedy annotates the phrases “Voxel-based morphometry (VBM)” and “voxel-based relaxometry (VBR)” with comments like “Analysis method: VBM” and “Analysis method: VBR”. You can see such annotations here:

 

kennedy_1

 

This was really just a form of note-taking. To create the final report, Dr. Kennedy had to review those notes and laboriously compile them. How might we automate that step? To explore that possibility we defined a protocol that relies on a controlled set of tags and a convention for expressing sets of name/value pairs. Here’s that same article annotated according to that protocol:

 

 

The tag MonthlyMorphometryReport identifies the set of annotations that belong in the report. Although it has no formal meaning, we agreed that the tag prefix AcquisitionMethod: targets that section of the report, and that tags prefixed that way can otherwise be freeform, so annotations tagged with AcquisitionMethod:VBR and AcquisitionMethod:VBM will both land in that section.

Sometimes tags aren’t enough. Consider the statement:

VBM detected significant tissue changes within the substantia nigra, midbrain and dentate together with significant cerebellar atrophy in patients (FWE, p < 0.05). Iron deposition in the caudate head and cavitation in the lateral globus pallidus correlated with UDRS score (p < 0.001). There were no differences between groups with VBR.

It expresses a set of findings, such as “There were no differences between groups with VBR,” and a set of related facts. Here’s how we annotated the findings and the associated facts:

 

kennedy_3

 

We use Finding:VBM1 for facts extracted from the sentence “VBM detected significant tissue changes within the substantia nigra, midbrain and dentate together with significant cerebellar atrophy in patients (FWE, p < 0.05)” and Finding:VBM2 for facts extracted from the sentence “Iron deposition in the caudate head and cavitation in the lateral globus pallidus correlated with UDRS score (p < 0.001)”.

Given this protocol, a script can now produce output that matches the handwritten report:

 

kennedy_4

 

Clearly this approach cries out for the ability to declare and use controlled vocabularies, and we aim to deliver that. But even it its current form it shows much promise. In domains like bioscience, where users like Dr. Kennedy are familiar with principles of structured annotation and tolerant of conventions required to enable it, Hypothesis can already be an effective tool for annotation-driven data extraction. Its native annotation and tagging capability can enable users to create useful raw material for downstream processing, and its API delivers that raw material in an easily consumable way.

Share this article