Talking into history
The social features of the Polar Bear site make it feel new and vital every time you look at it, not a usual feeling to encounter with archives. Many comments are left by people familiar with the subjects, and they provide additional facts or links to other online content with complementary information. One comment says, “Anthony Rataczak was my maternal uncle. He was born and raised in Poland and came to the United States with his father, John, and his mother, Frances, six sisters, Ann, Mary, Angela, Helen, Pauline, and Celia and one brother, Joseph, in 1905.” For archivists used to working with texts and documents of uncertain origin and content, this collaborative data discovery is thrilling. Another feature of this archive is its “link paths,” a shaded list of links appearing at the bottom of each page that points to other files in the collection with the caption, “Researchers who viewed this page also viewed.” This allows newer visitors to the site to gain knowledge about related materials before even becoming familiar with the collection. These collaborative archive 2.0 projects are very much the exception, even in digital archives. Matienzo says the shift toward decentralization of knowledge is coming up against an institution that is based, to a certain degree, on believing that knowledge could be held, preserved, and protected. “I think archivists are starting to recognize the importance of blogs and other electronic content,” he says. “They know that people are developing new uses for content, they know that technology is decentralizing knowledge. [How] archivists/repositories deal with that is a result of a lot of factors, particularly existing institutional culture.”What librarians know
Archivists are facing the same problems librarians have been dealing with for years, perhaps even decades. As a larger percentage of library content becomes digital and licensing becomes a greater hurdle to maintaining “content” than it is to preservation, how do we keep information from disappearing into a digital black hole? How does the shift in formats translate into a shift in procedures and policies and expectations? Librarians get it: the content we steward is shifting from print to digital. Our libraries require more hard drive space in addition to more shelf space. Patrons need to know how to click and type as well as how to read. And, yet, what of posterity? How will our paths and trackings through the digital realm be accumulated, organized, even archived? This question becomes further complicated by the webby-ness of our online interactions and content production. Content is still being generated in static letter, essay, and book formats, but it's also arriving online, prelinked and connected. While the correspondence between Freud and Jung has been collected, trying to track and save the hyperlinkedness of blogs, comments, IMs, and emails is much more complex. As a blogger, I write and link to other things online, and it's become increasingly difficult to write essays without using hyperlinks. At the 2006 Society of American Archivists conference, I was pleasantly surprised by what I heard, though I became concerned for the future of preserving digital information. As archivist Thomas Lannon said, “This 'unfixedness' of blogging in its electric form is what gives the technology the power of immediacy but also its weakness in impermanence.”Scope notes
Peter Lyman, professor at the School of Information Management and Systems (SIMS) at University of California–Berkeley, describes four issues in archiving digital media: cultural, legal, economic, and technical. Legal and economic issues are what they've always been. Lyman explains the cultural problem of archiving digital content as one of scope and immediacy. “All documents follow a life cycle from valuable to outdated, but then, perhaps, some become historically important,” he says. “[T]he web is not stored in attics; it just disappears. For this reason, conscious efforts at preservation are urgent. The hard questions are how much to save, what to save, and how to save it.” The Library of Congress (LC) comments in a Council on Library and Information Resources report about the ebb and flow of availability of web content: “The web is growing steadily and at the same time is continually disappearing. The average life of a web page is only 44 days; 44 percent of the web sites found in 1998 could not be found in 1999.” This cultural problem of ephemeral content ties in to the technical problem. “First, information must be continuously collected, since it is so ephemeral,” Lyman explains. “Second, information on the web is not discrete; it is linked. Consequently, the boundaries of the object to be preserved are ambiguous.”What is a blog, really?
That boundaries problem is the crux of the trouble with blogs. We still talk about them as if they were discrete, as if all blogs had similar purposes and presentation. People say they like or dislike blogs, which is like saying you like or dislike the mail, or magazines. A blog is just a method of communicating, online. A blog is usually defined as a web page with rotating content—the newest material enters at the top of the page, and the older material shuttles off the bottom to the archives. Many blogs have multiple authors, and some group blogs have several thousand contributors. Most are created with a content management system, which is blogging software that provides an easy-to-use front-end interface to what is essentially a small relational database. What you are looking at when you read a blog is a query against that database, which may or may not generate static HTML pages.History via conversation
For anyone who is concerned with our cultural history, much of it is taking place online, on blogs. Making sense of our collective past, especially our recent past, must take the digital narratives of events into account. Blogs are the diaries and letters of the present day. At the same time, they often refer to other blogs, news reports, digital images, videos, and sound recordings. Understanding a point in the past is made simpler with blogs because readers are able to find easily contemporaneous informal stories recounting the same events. Yet this creates a problem—figuring out how to store this linked set of web sites and media for later retrieval and/or reassembly. The Cluetrain Manifesto: The End of Business as Usual (Perseus) by Rick Levine et al. asserts that markets are conversations. What is becoming more clear is that history is also a conversation. The truism that “history is written by the winners” is no longer quite as relevant in an age where barriers to publishing are low and benefits of participation are high. I was first made aware of the value of blogs in reporting after an earthquake I experienced in Seattle in 2001. It was brief, and by the time it was over, I jumped online to see what had happened elsewhere in the city. The local news outlets had nothing on their web sites yet. This was a significant event that literally had not happened yet online. I checked MetaFilter.com, a large group blog that I participate in, and people were reporting what had happened from all over the Pacific Northwest. As news outlets began to get reports online, links were included in the thread nearly in the order they arrived online. Many of those links, though now defunct, still provide interesting insight into how news of a small-scale disaster propagates.Blog proliferation
The past few years have seen media outlets start their own blogs, further blurring the line between the informality of individual blogs and the authoritativeness of major media outlets. In “TalkLeft, Boing Boing, and Scrappleface: The Phenomenon of Weblogs and Their Impact on Library Technical Services,” Paul Moller and Nathan Rupp write, “Mainstream media outlets such as the New York Times have begun to maintain blogs, which often link to stories in other newspapers.... Other media outlets, from the Boulder, Colorado, Daily Camera to CNN, have begun employing blogs as a means of further connecting with their audience.” Community-blogged archives exist for other major events such as Hurricane Katrina and the Indian Ocean earthquake. The actual understanding of what happened on 9/11 changed minute to minute—reassembling the history of that day is impossible if you rely on traditionally published media that report facts after they have become understood and codified. This attempt to catch the sense of newsworthy events while they are happening is a boon for historians and a challenge for archivists. In the overview description of the MINERVA project's September 11th Web Archive, LC states, “With the growing role of the web as an influential medium, records of historic events could be considered incomplete without materials that were born digital and never printed on paper.” While we have grown accustomed to news and media outlets maintaining their own recent archives, most blogs live on servers completely at the whim of the owner of the site. The content seems permanent until, in Lyman's words, “it just disappears,” as has happened many times to community blog and social software sites. For instance, the hard drive crash of diary-x.com obliterated 120,000 online diaries and blogs overnight. When Couchsurfing.com experienced multiple database crashes with insufficient backups, 90,000 people lost their personal profiles, email contact lists, and trip diaries. Both sites were able to reconstruct some of their content and structure from RSS feeds, Google caches, and less-recent backups, but much of that digital content was never replaced.Where do the bytes go?
The first question is not what but how to archive. A blog looks different on different days, but it may even look different to different readers. People viewing a blog via its RSS feed will miss design changes, just as people who read a blog sorted by tag or category may not grok how posts relate to one another on a time line. Firefox even allows us to set our own stylesheets and page presentation via an add-on called Greasemonkey, which means that the blog version that I see may be different from everyone else's version, my own idioblog. As we become more comfortable with bibliographic citations that include indicators such as “Retrieved on November 13, 2006,” how exactly do we retrieve that? How do we guarantee that we see what that person saw? The Internet Archive has been creating and maintaining an archive of web content since 1996. It has been trying to create “a three-dimensional index that allows browsing of web documents over multiple time periods.” The Wayback Machine is its archive, accessible online via a URL search. It has a massive server farm containing two petabytes of data, expanding at a rate of 20 terabytes per month. These bytes, if turned into printed text, would vastly exceed the contents of the Library of Congress. Even this project has limitations. Since the data is obtained by crawling—following hyperlinks from page to page and grabbing the text and images on each page—content that is generated dynamically can be difficult or impossible to archive. If you have your blog set up to generate pages on the fly (the default for WordPress systems), the Internet Archive may not be able to capture your blog at all. If archiving blogs becomes a priority, what has to shift to make this equation work? The PANDORA Project of the National Library of Australia also began archiving (and cataloging!) selected Australian web sites in 1996. More recently, it enlisted the help of the Internet Archive to “harvest” every site in the .au domain over a six-week period during June and July 2005. This netted “185 million unique documents...from 811,523 hosts, and 6.69 terabytes of raw data.” However, owing to the uncertain legal status of those collected documents, they are not currently being made available online. Do archivists need to find a way to improve on this, or are we already looking at best practices? It is certain that the Internet Archive's process for harvesting web content is clearly becoming the standard method as more state library systems contract with them to store and preserve digital records.“This matters”
Institutional culture is the wall to scale to provide open access and inroads to our cultural history. Small projects, as well as larger initiatives like the Internet Archive, are making a serious, positive dent in how we look at what we know. This is true both in the literal sense of how the content appears and in the more metaphorical sense of how it feels to us. Librarians and archivists are already surrounded by history and constantly share what we know with those who are searching. We have new opportunities to open the doors to our collections wider and to share the work we do as well as the thrill of discovery. The more we say “this matters,” and the more we say “we can do this,” the more we find people to help us. Archiving and making digital content available is going to become an even larger part of our jobs. Let's start with blogs.Author Information |
Jessamyn West maintains the blog librarian.net and works as a community technologist in central Vermont |
We are currently offering this content for free. Sign up now to activate your personal profile, where you can save articles for future viewing
Add Comment :-
Comment Policy:
Comment should not be empty !!!