In order to make books that are still under copyright available for computational text analysis, HathiTrust Research Center has devised a machine-readable data format called Extracted Features. This repository contains data and Jupyter Notebooks for analyzing over three thousand works of speculative fiction in HathiTrust, all published between 1900-1999.
The tutorials here assume a working familiarity with Python and Jupyter Notebooks. For those new to both, The Programming Historian's "Introduction to Jupyter Notebooks" by Quinn Dombrowski, Tassie Gniady, and David Kloster is a great starting point. We'll also be working with Pandas, a Python library for working with tabular data. Melanie Walsh's Intro to Cultural Analytics course includes a fantastic overview of Pandas (as well as the conceptual and ethical challenges inherent to data work in the humanities).
- 👉
htrc_sf_experiments.ipynb
is the main tutorial. Start there. - instructions for running TF-IDF on a volume (TK)
- 📁
/data/SF_Extracted_Features_Full
contains Extracted Features files for each volume (aka book) - 📁
/data/thompson-mimno-SF-final-matches.tsv
list of all SF volumes in HTRC identified by David Mimno and Laure Thompson