Using HathiTrust

One of the biggest struggles throughout this whole process has been locating data for analysis. Due to American copyright practices, a ton of classics that are written as early as post 1930s are still in copyright and thus cannot be legally obtained through databases such as Project Gutenberg and Internet Archive. Thus to obtain these texts, I can OCR each individual work, which does work for some of these, but is overall very labor intensive and time consuming.

An answer to this is HathiTrust and the HTRC analytics environment that is linked with HathiTrust. HathiTrust is a virtual online library that grants a wide range of access to individuals associated with one of their many partner institutions. It contains many texts and segments of texts that are still in full copyright. Luckily for me, Pomona College happens to be a partner institution, and thus I can make use of the wide array of books that are found on HathiTrust.

However, since these texts are still in copyright, analyzing our works is not as quick and simple as pulling the works from HathiTrust and downloading them to our computer. There are a couple of steps to go through, which can be slightly confusing and daunting. Since HathiTrust does want to make sure you aren’t pirating a bunch of texts straight off of their library, to analyze the texts you want to analyze, you need to utilize their environment: HTRC analytics.

I will attempt to break down my general process of performing analysis on a collection of texts with HathiTrust and HTRC analytics.

For this tutorial, I will be using a large corpus of science fiction works created by Alex Wermer-Colan and co. at Temple University. I will also be using Python.

  1. Creating a collection

To start, we must first log into HathiTrust through our member organization. Once logged in, we create a collection by navigating to ‘My Collections’ on the left of the page, and clicking ‘New Collection’ in the top right of the page. Name your collection and set it to public.

Now we can begin choosing texts to add to our collection. You can search for texts you’d like to add to your collection with the search bar. Upon finding the text you’d like to add, simply tick the box next to it and click ‘Add Selected’ in the right of the screen. Once you have added all the texts you want to analyze, you are ready to move on to the next step.

2. HTRC Analytics

Now we navigate to ‘https://analytics.hathitrust.org/capsules’. This will be where we analyze the collection we created in HathiTrust. First, we most log in. There are several options for analysis in HTRC Analytics, but for this tutorial we will be creating a data capsule.

On the top right corner, we click the create data capsule button. For our purpose, we select Research capsule since this gives us a larger body of texts and more room for analysis.

We fill out all the required fields and we also tick the box marked ‘Request Capsule with computational access to the full HathiTrust Corpus’. This will add a couple questions to our capsule creation application but will also allow us access to certain texts that are copyrighted. We select create capsule, and voila, we have our data capsule ready for use. Note that if you did request the full HathiTrust Corpus it may take a little while for your request to be accepted.

3. Using the Data Capsule

After creating our data capsule, we should see it appear under our ‘Data Capsules’ page. We can click start capsule to start it up. It’ll take a little while to launch, but once it does you have the option of using the capsule through the terminal or remoting into it in the form of a virtual machine. The data capsule essentially exists as a separate desktop that those copyrighted texts are stored in.

Once our capsule is started, we can click on the ID number where we will receive the option to ‘Connect via Terminal’ or ‘Connect via Remote Desktop’. For this, we will be connecting through a remote desktop. We want to connect to our remote desktop in maintenance mode. This is important as maintenance mode will allows us to get our analysis script into Hathi.

Now, we want to have the script we want to run on our collection of texts ready. We need to get that script onto the Hathi data capsule virtual machine. There are a couple of ways to do it:

If you select the Advanced Features button at the button of the page, you can set up an SSH key with this Hathi virtual machine which will allow you to pass in your scripts with something like rsync or pscp.

You can also package up your scripts in a virtual environment with all the necessary dependencies and executables in one folder. You can then email that to yourself or add it to some sort of cloud storage and download it to your Hathi virtual machine.

You can also package the script into all of its dependencies into an executable file with something like PyInstaller. Send that to yourself somehow and then download it onto the Hathi virtual machine.

You now have your script on the Hathi virtual desktop.

Now, we want to switch from maintenance mode to secure mode. Secure mode does not allow us to access to most of the computers features as it is intended to preserve the copyright of the texts.

Once we are in secure mode, we want to open up a terminal from the virtual desktop. We also want to have our collection URL ready. To get the collection URL, navigate to the collection you previously created and copy the link that is present in the middle left of the page labeled ‘Link to this collection’: NOT THE URL OF THE WEBPAGE.

Next, we run the command “htrc download ‘your collection url’ -c” (without the quotation marks). This will bring the texts into your collection into your workset folder in your virtual machine. You can now run your script on the txt file.

Now, we want the results our script yields us. However, we can’t simply just email it or upload it. We need to export our results through Hathi. Only NON-CONSUMPTIVE results can be exported. This is stuff like sentiment analysis results and word frequencies, not actual ordered and readable text.

We first need to open a terminal in our virtual desktop. We then type in “cd /media/secure_volume”. Now, lets say my sentiment analysis results of the collection are in ‘/home/tai/results/sentiment_polarities.csv’: we type this into the terminal ‘release results add/home/tai/results/sentiment_polarities.csv’. Once we have added all of the results we want we type in the command ‘releaseresults done’ and our files will be delivered to the email address we registered with.

And there you are, we now have analyzed a series of copyrighted texts in a manner that is not illegal nor shady! Exciting stuff. Hopefully all of this wasn’t too convoluted.

Leave a comment

Design a site like this with WordPress.com
Get started