Harvard Law School Library Innovation Lab Preserves Federal Data

by Lisa Peet
Mar 06, 2025 | Filed in News

The Harvard Law School Library Innovation Lab (LIL) has created a data vault to download, authenticate, and provide access to copies of public government data that may be in danger of disappearing. The project will collect major portions of the datasets tracked by data.gov, federal GitHub depositories, and PubMed—information of value for researchers, scholars, and policymakers. When the public-facing site launched on February 6, the data vault had collected metadata and primary contents for more than 300,000 datasets available on data.gov.

data.gov Archive logo, blue on white The Harvard Law School Library Innovation Lab (LIL)—a team of librarians, technologists, and lawyers—has created a data vault to download, authenticate, and provide access to copies of public government data that may be in danger of disappearing. The project will collect major portions of the datasets tracked by data.gov, federal GitHub depositories, and PubMed—information of value for researchers, scholars, and policymakers. When the public-facing site launched on February 6, the data vault had collected 16 terabytes of metadata and primary contents for more than 300,000 datasets available on data.gov. It will be updated regularly as new data sets become available.

The project is funded partially by a grant from the Filecoin Foundation for the Decentralized Web, which supports the development of open-source software and open protocols for decentralized data storage and retrieval networks, and by a grant from the Rockefeller Brothers Fund. LIL will continue to seek funding as the initiative evolves.

NECESSARY WORK

In the first 10 days of Donald Trump’s current presidential term, more than 2,000 datasets disappeared from data.gov, the largest repository of U.S. government open data on the internet. And while this is not necessarily the result of deliberate takedowns by the new administration—some data loss can be the routine result of administration change, or casualties of link rot—the need to preserve government data in a discoverable format is critical, particularly in light of the large amounts of climate change data deleted during Trump’s first term and his recent spate of executive orders and demands to modify agencies’ language around environmental and diversity issues.

Several organizations have been capturing online government information before presidential administration changeovers for years. The End of Term (EOT) Web Archive, for example, has been making records of government websites before and after presidential administration changes since 2008. The Internet Archive supports EOT archiving as part of its Wayback Machine collection, which has been active for more than 25 years. The Environmental Data Governance Institute (EDGI), a cross-disciplinary research collaborative formed in 2016, documents and analyzes changes to environmental data and governance practices. (For an extensive list of data preservation work, see the Data Rescue Project’s Current Efforts page.)

The data vault project continues the work of preserving government records that the Harvard Law School Library has been doing for centuries. The web preservation and citation tool Perma.cc preserves millions of links used by courts and law journals. Caselaw Access Project has digitized and provides free and open access to 40,000 volumes of American case law.

This effort focuses on datasets rather than web archives. “As I started to think about how the End of Term Archive works, I realized that there were some things that it might miss that would be important to me as an archivist,” LIL Director Jack Cushman told LJ. A web crawler works by clicking from link to link—a bulk collection method that has enabled the End of Term Archive to amass some 600 terabytes of data in the runup to the administration change. But it could miss data sets that might require different types of searches or a form to download, or JavaScript to render the interface. “A generic crawler won’t be able to do any of those things, so it’ll end up with the description of the data set but not the data set itself in some cases. Which means, then, that if that part of the thing you’re archiving goes away, you can’t do that research anymore.”

Cushman consulted with Amanda Watson, Harvard Law School Library’s assistant dean for library and information resources, and Jonathan Zittrain, professor of international law and vice dean, library and information resources; all agreed that this was a necessary initiative. “It is so important that we be able to remember where we came from, and the data and records that have existed,” said Cushman. “That’s why libraries and archives exist. That’s why librarians and archivists do their work.”

MANY MOVING PARTS

Cushman has developed a tool to collect everything in data.gov, using custom software to go into data.gov’s API, collect all of its data sets, and then fetch the catalog of every link within them—“for each of those 300,000 data sets, if it says there’s this spreadsheet and there’s this XML file and there’s this map, just go grab each of those things and include those in an archive,” he explained. In addition to its capture, the data needs to be packaged with the appropriate metadata and provenance, stored securely, and provided with a discovery layer so that it can be found and accessed. “All of the things that we are used to in maintaining any collection turn out to apply to this problem of trying to preserve federal data.”

The tool uses the BagIt standard, a set of file system digital content conventions. Files are stored in a folder and cryptographically signed to add proof of where they originated and that they were collected on a particular date. That way, Cushman said, “even if I was gone and everyone I worked with was gone and Harvard was gone, you could take your copy and convince anyone in the world that this was made by someone with this email address before this date.”

The discovery aspect is critical as well. Files will be hosted on Source Cooperative, a free platform for publishing and sharing public interest data. Making copies of the data will be a relatively simple process—“not quite one click, but pretty close”—and those copies will have easy-to-run discovery layers that don’t need a big budget to implement built on top of them. “The goal is to make it so that running a copy is cheap and reliable and has strong provenance, so that we move away from the life we’re living in now, where there really is only one copy of record and it’s easy for that to disappear,” Cushman said.

There has historically been little interest in preserving multiple copies of digital items, he noted. “When we were working with print materials, if something was popular it would be in 20,000 libraries. And then as it gets less popular, you have regional consortia to figure out how we’re going to keep one copy.” With digital materials, however, “there’s overwhelming incentive to invest in one copy and not duplicates, and that creates so much vulnerability.”

And even when the material exists in multiple locations, it still may not be readable down the line. Federal government websites operate thousands of data viewers that help users navigate a particular data set. The data itself may be more readily copied than the viewers, so “if a lot of viewers go down, there will be a period of time when, though the data still exists, it’s not as easy to answer questions from it as it used to be,” said Cushman. Emerging technology, including—but not limited to—AI, will be useful for building general-purpose tools to help users interact with the data being preserved.

Cushman sees this work as part of a continuum among volunteer organizers and hopes that LIL’s work will solve at least one class of problems involved in safeguarding data. “I don't think the current conversation about what is happening in the federal web changes how libraries and archives should be thinking, but I think it underscores how we should be thinking,” he added. “We need to find new strategies to undo that effect of only having one copy. The ice keeps getting thinner and thinner.”

To notify LIL of data you believe should be part of this collection, please contact them at lil@law.harvard.edu.

Lisa Peet

lpeet@mediasourceinc.com

Lisa Peet is Executive Editor for Library Journal.

Add Comment :-

0 COMMENTS

Comment Policy:

Be respectful, and do not attack the author, people mentioned in the article, or other commenters. Take on the idea, not the messenger.
Don't use obscene, profane, or vulgar language.
Stay on point. Comments that stray from the topic at hand may be deleted.
Comments may be republished in print, online, or other forms of media.
If you see something objectionable, please let us know. Once a comment has been flagged, a staff member will investigate.

Fill out the form or Login / Register to comment:

(All fields required)

First Name should not be empty !!!

Last Name should not be empty !!!

email should not be empty !!!

Comment should not be empty !!!

Please check the reCaptcha

CONTINUE READING?

Non - Subscribers

Subscribers

AWARDS

Academic Movers Q&A: Shavonn Matsuda on Bringing Hawaiian Language Organization Systems to the Library

by Amy Rea

ARCHIVES & PRESERVATION

Writing and Implementing a Statement to Remediate Harmful Language in the Library Catalog | Peer to Peer Review

by Andrea Schuba

INDUSTRY NEWS

On Critical Cataloging: Q&A with Treshani Perera | Equity

by Lisa Peet

NEWS

Re-envisioning Web Preservation with The Weekly List | Peer to Peer Review

by Jordan Mitchell and Katie Rawson

NEWS

KCLS, BiblioCommons Machine Learning Pilot Profiled at LibLearnX

by Matt Enis

RECOMMENDED

REVIEWS+

Run Your Week: Big Books, Sure Bets & Titles Making News | July 17 2018

Neal Wyatt Jul 17, 2018

The Other Woman by Daniel Silva leads holds this week. Former President Obama has more summer reading. Downton Abbey is heading to the movies.

TECHNOLOGY

Materials on Hand | Materials Handling

Matt Enis, May 16, 2018

Automated systems are helping libraries move staff to patron-facing work, while manufacturers innovate new design features.

PROGRAMS+

LGBTQ Collection Donated to Vancouver Archives

Lisa Peet, Jun 21, 2018

Longtime archivist, former head of the Vancouver Public Library’s history division, and queer rights activist Ron Dutton donated more than 750,000 items documenting the British Columbia LGBTQ community to the City of Vancouver Archives in March.

ALREADY A SUBSCRIBER? LOG IN

We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing

Harvard Law School Library Innovation Lab Preserves Federal Data

NECESSARY WORK

MANY MOVING PARTS

Get Print. Get Digital. Get Both!

Add Comment :-

Comment Policy:

CONTINUE READING?

Added To Cart

RELATED

Academic Movers Q&A: Shavonn Matsuda on Bringing Hawaiian Language Organization Systems to the Library

Writing and Implementing a Statement to Remediate Harmful Language in the Library Catalog | Peer to Peer Review

On Critical Cataloging: Q&A with Treshani Perera | Equity

Re-envisioning Web Preservation with The Weekly List | Peer to Peer Review

KCLS, BiblioCommons Machine Learning Pilot Profiled at LibLearnX

Run Your Week: Big Books, Sure Bets & Titles Making News | July 17 2018

Materials on Hand | Materials Handling

LGBTQ Collection Donated to Vancouver Archives

Log In

REGISTER FREE to keep reading

If you are already a member, please Log In

Success.

Create a Password to complete your registration. Get access to:

ALREADY A SUBSCRIBER? LOG IN

ALREADY A SUBSCRIBER? LOG IN

Thank you for visiting.

SUBSCRIPTION OPTIONS

Already a subscriber? Log In

Thank you for visiting.

Already a subscriber? Log In

Already a subscriber? Log In