Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions writeup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
British Library/UCL Open Books for e-Research
=============================================

Summary
-------

RITS worked with digital scholarship experts and historians to use Legion to analyse a corpus of over sixty thousand digitised public domain books. Provided by the British Library, the collection of books dates from the seventeenth to nineteenth centuries. Research questions such as "how often are different diseases mentioned" were answered. Analyses such as these would have taken over six days on a normal personal computer could be answered in under an hour on the UCL high-performance computing cluster, Legion. We built a framework to enable researchers to express complex textual analyses in simple python functions, and optimised the data layout to make best use of the parallel file system capabilities of Legion.

![Frequency of mentions of diseases in the ](https://github.com/UCL-dataspring/visualisations/blob/master/diseases/outputs/diseases%20(WEB).png?raw=true)

Background
----------

In February this year, [Professor Melissa Terras](http://www.ucl.ac.uk/dis/people/melissaterras) of [UCL Digital Humanities](http://www.ucl.ac.uk/dh) and Dr James Baker of the British Library [Digital Research Team](http://britishlibrary.typepad.co.uk/digital-scholarship/), pitched to JISC's [Research Data Spring programme](http://www.jisc.ac.uk/rd/projects/research-data-spring).

The Data Spring programme is providing funding to a variety of pilot projects in order to find “new technical tools, software and service solutions, which will improve researchers’ workflows and the use and management of their data”.

The idea that UCL and the British Library (BL) pitched is that the BL has numerous digital datasets, but not the processing power for users to run advanced queries against or analyse them. Rapid, indexed full text search is easy enough, through [] , but many questions require more complex queries, looking for terms in proximity to each other or to illustrations, and cross-correlating words, publishing locations to build temporal and geospatial visualisations
of change.

BL pitched is that the

We will use UCL’s world leading Research Computing to open up this digital data, investigating the needs and requirements of a service that will allow researchers to undertake complex searching of the BL’s digital content.



Approach
--------

Outcomes
--------