Fund: On-Demand Web Archiving Completion

By bigbluehat | 13 July, 2015

Ilya Kreymer has completed the requirements for the On-Demand Web Archiving project funded via the Open Annotation Fund! You can read his write-up of the project below (re-posted from the webrecorder.io blog):

Introducing Browsertrix

The final result of the On-Demand Web Archiving Project is the creation of a new tool called Browsertrix, designed to automate web archiving through the browser in a general way and available at: https://github.com/ikreymer/browsertrix

The first iteration of the tool supports archiving a single page by loading it in a headless Chrome or Firefox, running in Selenium and containerized in Docker.

This can be integrated with Hypothes.is by causing any page to automatically be archived whenever an annotation is made.

While initial plans suggested using PhantomJS as the headless browser, it was decided to automate real browsers (Chrome and Firefox) through Selenium instead. The reasons for this are two-fold: The availability of ready-made Docker images for Selenium made setting up a headless Chrome and Firefox much simpler. Second, due to the complexity of many sites, using real browsers would result in the most accurate archive of the user experience of a web page and avoid any subtle differences that may occur in PhantomJS vs Chrome or Firefox. By using Selenium, support for additional browsers, including PhantomJS, can also be added as needed.

The tool uses Docker Compose to connect various Docker containers, including Selenium Chrome, Firefox and python workers which connect with either Chrome or Firefo, a Redis instance for storing shared state, and a web app for handling user requests.  Using the docker compose scale feature, the number of Chrome and Firefox workers can be scaled dynamically as needed.

Here is the front page of Browsertrix in its current version:

(For demo purposes, this site is also hosted at archivethis.website but this is not meant for production use)

The tool works by receiving requests at the /archivepage endpoint, including the url, archive to use and browser. For example, a request to: /archivepage?url=hypothes.is&archive=webrecorder&browser=chrome will result in loading of http://hypothes.is/ in Chrome using the webrecorder.io service.

The supported services are currently webrecorder.io and Internet Archive Save Page Now.

The above request can be made through the UI by entering hypothes.is and clicking “Archive This Website!”:

When using webrecorder.io, the user can download the full web archive WARC file for their own storage. (Note that webrecorder.io does not permanently store the archive, although this is available in the new webrecorder beta, which will also be supported soon).

When recording with Chrome, the response includes a full log of all the embedded URLs recorded (at this time, Firefox does not provide this functionality). They can be seen by looking at the raw response:

For more info on the API and the JSON response format, please look at: https://github.com/ikreymer/browsertrix

If the request does not complete within a timeout (30 secs), a response indicating that the url has been queued is received instead. This indicates that the user should retry the request to see if the archiving has completed.

By default, the response is cached for 30 seconds so that new requests to archive use the existing copy, although this can be changed in the settings.

Any errors will be displayed to the user as well.

For example, when saving to Internet Archive Wayback Machine, the archiving is subject to the site’s policies, so that sites blocked by robots.txt restrictions can not be archived. This will be reported as an error:

Additional archiving handlers can also be added as needed to Browsertrix.

It is the hope that Browsertrix can be developed into a full-fledged browser based crawling system.

For now, this first iteration allows for a flexible mechanism for Hypothes.is (and others) to fully archive any number of pages, one page at a time.

Share this article