The Harvard Law School Library Innovation Lab (LIL) has created a data vault to download, authenticate, and provide access to copies of public government data that may be in danger of disappearing. The project will collect major portions of the datasets tracked by data.gov, federal GitHub depositories, and PubMed—information of value for researchers, scholars, and policymakers. When the public-facing site launched on February 6, the data vault had collected metadata and primary contents for more than 300,000 datasets available on data.gov.
The Harvard Law School Library Innovation Lab (LIL)—a team of librarians, technologists, and lawyers—has created a data vault to download, authenticate, and provide access to copies of public government data that may be in danger of disappearing. The project will collect major portions of the datasets tracked by data.gov, federal GitHub depositories, and PubMed—information of value for researchers, scholars, and policymakers. When the public-facing site launched on February 6, the data vault had collected 16 terabytes of metadata and primary contents for more than 300,000 datasets available on data.gov. It will be updated regularly as new data sets become available.
The project is funded partially by a grant from the Filecoin Foundation for the Decentralized Web, which supports the development of open-source software and open protocols for decentralized data storage and retrieval networks, and by a grant from the Rockefeller Brothers Fund. LIL will continue to seek funding as the initiative evolves.
In the first 10 days of Donald Trump’s current presidential term, more than 2,000 datasets disappeared from data.gov, the largest repository of U.S. government open data on the internet. And while this is not necessarily the result of deliberate takedowns by the new administration—some data loss can be the routine result of administration change, or casualties of link rot—the need to preserve government data in a discoverable format is critical, particularly in light of the large amounts of climate change data deleted during Trump’s first term and his recent spate of executive orders and demands to modify agencies’ language around environmental and diversity issues.
Several organizations have been capturing online government information before presidential administration changeovers for years. The End of Term (EOT) Web Archive, for example, has been making records of government websites before and after presidential administration changes since 2008. The Internet Archive supports EOT archiving as part of its Wayback Machine collection, which has been active for more than 25 years. The Environmental Data Governance Institute (EDGI), a cross-disciplinary research collaborative formed in 2016, documents and analyzes changes to environmental data and governance practices. (For an extensive list of data preservation work, see the Data Rescue Project’s Current Efforts page.)
The data vault project continues the work of preserving government records that the Harvard Law School Library has been doing for centuries. The web preservation and citation tool Perma.cc preserves millions of links used by courts and law journals. Caselaw Access Project has digitized and provides free and open access to 40,000 volumes of American case law.
This effort focuses on datasets rather than web archives. “As I started to think about how the End of Term Archive works, I realized that there were some things that it might miss that would be important to me as an archivist,” LIL Director Jack Cushman told LJ. A web crawler works by clicking from link to link—a bulk collection method that has enabled the End of Term Archive to amass some 600 terabytes of data in the runup to the administration change. But it could miss data sets that might require different types of searches or a form to download, or JavaScript to render the interface. “A generic crawler won’t be able to do any of those things, so it’ll end up with the description of the data set but not the data set itself in some cases. Which means, then, that if that part of the thing you’re archiving goes away, you can’t do that research anymore.”
Cushman consulted with Amanda Watson, Harvard Law School Library’s assistant dean for library and information resources, and Jonathan Zittrain, professor of international law and vice dean, library and information resources; all agreed that this was a necessary initiative. “It is so important that we be able to remember where we came from, and the data and records that have existed,” said Cushman. “That’s why libraries and archives exist. That’s why librarians and archivists do their work.”
Cushman has developed a tool to collect everything in data.gov, using custom software to go into data.gov’s API, collect all of its data sets, and then fetch the catalog of every link within them—“for each of those 300,000 data sets, if it says there’s this spreadsheet and there’s this XML file and there’s this map, just go grab each of those things and include those in an archive,” he explained. In addition to its capture, the data needs to be packaged with the appropriate metadata and provenance, stored securely, and provided with a discovery layer so that it can be found and accessed. “All of the things that we are used to in maintaining any collection turn out to apply to this problem of trying to preserve federal data.”
The tool uses the BagIt standard, a set of file system digital content conventions. Files are stored in a folder and cryptographically signed to add proof of where they originated and that they were collected on a particular date. That way, Cushman said, “even if I was gone and everyone I worked with was gone and Harvard was gone, you could take your copy and convince anyone in the world that this was made by someone with this email address before this date.”
The discovery aspect is critical as well. Files will be hosted on Source Cooperative, a free platform for publishing and sharing public interest data. Making copies of the data will be a relatively simple process—“not quite one click, but pretty close”—and those copies will have easy-to-run discovery layers that don’t need a big budget to implement built on top of them. “The goal is to make it so that running a copy is cheap and reliable and has strong provenance, so that we move away from the life we’re living in now, where there really is only one copy of record and it’s easy for that to disappear,” Cushman said.
There has historically been little interest in preserving multiple copies of digital items, he noted. “When we were working with print materials, if something was popular it would be in 20,000 libraries. And then as it gets less popular, you have regional consortia to figure out how we’re going to keep one copy.” With digital materials, however, “there’s overwhelming incentive to invest in one copy and not duplicates, and that creates so much vulnerability.”
And even when the material exists in multiple locations, it still may not be readable down the line. Federal government websites operate thousands of data viewers that help users navigate a particular data set. The data itself may be more readily copied than the viewers, so “if a lot of viewers go down, there will be a period of time when, though the data still exists, it’s not as easy to answer questions from it as it used to be,” said Cushman. Emerging technology, including—but not limited to—AI, will be useful for building general-purpose tools to help users interact with the data being preserved.
Cushman sees this work as part of a continuum among volunteer organizers and hopes that LIL’s work will solve at least one class of problems involved in safeguarding data. “I don't think the current conversation about what is happening in the federal web changes how libraries and archives should be thinking, but I think it underscores how we should be thinking,” he added. “We need to find new strategies to undo that effect of only having one copy. The ice keeps getting thinner and thinner.”
To notify LIL of data you believe should be part of this collection, please contact them at lil@law.harvard.edu.
We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing