Yesterday, the Libraries had the pleasure of a visit by three representatives of the HathiTrust Research Center.  The HTRC exists to facilitate scholarly work with the large corpus of digitized materials in the HathiTrust repository. Their goal is to “to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyber infrastructure.” Our guests – Robert McDonald, Miao Chen, and Zong Peng – provided a high-level overview of the Research Center, facilitated a hands-on session with some of the data and tools, and led us in a community discussion about local interest in text mining and the HathiTrust corpus.

Attendees at the hands-on workshop learn to access and mine data from the HathiTrust

Attendees at the hands-on workshop learn to access and mine data from the HathiTrust

There was a lot of information presented about how the HTRC came into being, its current architecture and projects, and plans for its future. I think the most useful thing for me, though, was getting a better sense of their current capabilities for – and attitude towards – working with researchers who want to use the data.

They want people to use the data and take advantage of the tool set they are building. It’s not only their primary reason for being, it’s also necessary to the development process to have users who can contribute to testing and requirements gathering. That said, the HTRC architecture and services are still very much a work in progress, and are not yet in a place where lots of unmediated, self-service research is possible. That means that most research projects will require an investment of HTRC staff time to facilitate, and staff time is, naturally, at a premium. So what do you do when you need people to use your services, but your services are still in development? You prioritize, of course. The HTRC is focusing on a small number of research projects where the researcher is: 1. at a HathiTrust member institution (which OSU is!), and 2. willing to partner with them on the necessary development. This is not to say that they aren’t willing to support other projects, especially if they can do so in a simple way – like a data dump that the researcher can manipulate locally.

The other really interesting thing for me was to learn about how carefully they are walking the line between facilitating research and protecting the copyrighted material in the HT from unsanctioned access and use. Legal action around the Google Books project and the HathiTrust has meant heightened scrutiny of security measures at all levels. As with other uses of copyrighted material, the focus seems to be on figuring out what is the smallest amount of access necessary to accomplish the research at hand, and on facilitating that access in a responsible way. It’s not an easy task, but I was impressed with how well they were handling it.

If you are interested in working with the data in the HathiTrust, I would encourage you to contact Miao Chen, the Assistant Director for Education and Outreach, at miaochen@indiana.edu.