MINERVA: Archiving Born-Digital Material at the Library of Congress

March 31, 2004

While a Digital Conversion Specialist at the Library of Congress, I became interested in the Library’s efforts to archive websites, and the MINERVA project in particular. I wrote an article about MINERVA and published it in the Internet column of the March 2004 issue of Slavic and Eastern European Information Resources, a Haworth Press journal.

INTRODUCTION

As the Internet grows—as it matures as a source of reliable information, and the sheer volume of Web sites increases—librarians try to capture and archive the most valuable of its resources. The task is made all the more challenging by the fleeting nature of “born-digital” materials, e.g., Web pages and multimedia files, that are originally created in digital format and do not exist as physical objects.

MINERVA (Mapping the Internet Electronic Resources Virtual Archive) (http://www.loc.gov/minerva/) is the Library of Congress’ (LC) largest archive of born-digital materials. The Election 2000; September 11, 2001; and Election 2002 collections are publicly available. Winter Olympics 2002, September 11 Remembrance, the 107th Congress, and the War on Iraq have been collected and will be available online once production work is complete.

Although archiving born-digital content is a new practice with new techniques (e.g., “crawling,” collecting digital items via the Internet with software called a Web crawler), LC staff who work on MINERVA collections have found that many of the issues to consider when planning and executing such a project are not so new: selection, copyright, metadata, cataloging, and user interface. This article will discuss the major issues in each of these areas of digital archiving, using MINERVA as a model. Wherever possible, special considerations for archiving foreign materials will be mentioned.

SELECTION

Given the vast amount of born-digital information on the Internet, clear selection guidelines are the first thing a digital archivist needs. (Some organizations practice bulk collection of all publicly accessible Web sites, but this article deals only with selective archiving of designated sites related to a chosen theme.)

There are two general approaches to selective archiving: event-based and subject-based. It should be apparent from the names of MINERVA’s collections, listed above, that selection of born-digital items to archive at LC has so far been event-based. The event driving each of these collections (except for September 11, 2001) was known about in advance; each collection cycle had a beginning and end; most selections were made before the collection began, and the archives were processed after collection.

A separate initiative at LC is considering subject-based archiving. Angel Batiste of the African and Middle Eastern Division has proposed a pilot project to archive online government documents of South Africa, many of which seem to be available only in digital format. In this case, collection would be ongoing, with no defined end. This implies that, although selection guidelines should be clear and documented, actual selections for archiving may change over time as some publications disappear from the Web, new publications come online, etc.

Once an approach is settled on, the process is much like the acquisition of print materials: selecting officers choose the websites to collect. However, MINERVA staff assist selecting officers in ways that aren’t necessary with the print acquisitions. For instance, when a printed serial is complete, the publisher issues it, usually on a set schedule; the interval of archiving these publications is determined by the publisher. Digital publications, by contrast, may be updated weekly, daily, hourly, or on an ad hoc basis. The archivist, not the publisher, must determine an appropriate interval of archiving.

COPYRIGHT

Copyright is a major concern for would-be archivists of born-digital items. Copyright law varies from country to country, and must be considered both in the country in which the crawling and archiving is conducted, as well as the country where the digital item was published.

Even if copyright law is followed to the letter, and it is determined that permission from the copyright owner is not required, it is still strongly advisable to notify the owner. LC uses an opt-in policy for dissemination of U.S.-published content: a formal letter is sent to the copyright owner describing MINERVA, the digital archive, and LC’s intent to collect their digital content. The letter asks permission to display the content offsite. If permission is not granted, the site will still be crawled, but the content will only be accessible onsite at LC. For foreign and creative sites (cartoons, poetry, etc.), LC seeks permission for both crawling and display.

CRAWLING

The acquisition of printed items is initiated with a purchase, and completed later with the receipt of the item. By contrast, born-digital items selected for archiving can be crawled and stored on servers immediately.

For all collections since the prototype, LC has contracted the crawling to the public nonprofit organization Internet Archive (IA) (http://www.archive.org/), who in turn subcontracts to the private company Alexa Internet (http://www.alexa.com/) to do the actual crawling. The many URLs to be crawled are divided among multiple servers, which make a first light pass, so as to not overburden Web sites. Later, they follow up to ensure all the necessary files were collected. The process is not perfect and results are mixed; for instance, sometimes supplemental files like images are not collected and cause display errors on the archived Web pages. Alexa’s crawlers also experience some difficulty with URLs that contain spaces and dynamic menus.

There are economies of scale in crawling for institutions the size of LC and projects with the scope of MINERVA. However, smaller organizations with smaller budgets and smaller projects can take advantage of free, open-source tools, like the “Web site copier” HTTrack (http://www.httrack.com/), which LC used for MINERVA’s prototype site.

METADATA

Missing images and broken URLs are immediately obvious, but the hidden problem of unreliable and missing metadata can be serious: MINERVA team member Gina Jones estimates that less than half of the Web documents she checks have adequate metadata. Their creators may not understand the importance of metadata since the files “work”—i.e., display properly in current Web browsers—with little or none. But metadata are vital to the proper cataloging and display of born-digital material. Digital archivists look for two general types of metadata: structural/administrative and content.

Structural/administrative metadata state the document’s author, date created, etc. They also name the standard by which the document was encoded, in the form of a Document Type Definition (DTD), so that it can be interpreted and displayed properly. In the long term, the ability to view an archived Web site will depend upon the software “knowing”; how the file is encoded, even when the standard has become obsolete. A DTD is no guarantee that a document will display properly in years to come, but it should be included at the very least.

Content metadata describe the contents of the document and aid effective indexing and searching. The benefits are seen not only when popular search engines (Google, Yahoo!, Altavista, etc.) index live sites, but when a site is archived and indexed for a collection such as one of MINERVA’s. Appropriate use of the “description” and “keywords” tags aids automated crawling and improves searching (as will good titles, headings, and textual content).

CATALOGING

Because metadata in the pages themselves are most often unreliable or missing, cataloging is impossible to automate and must be done manually. For each MINERVA archive, LC staff create a collection-level AACR2/MARC catalog record for the LC Integrated Library System (ILS); at the item level, they are experimenting with title-level descriptive records for each Web site within the collection using the Metadata Object Description Schema (MODS). The item-level cataloging for two of the collections, September 11, 2001 and Election 2002, is subcontracted by IA to WebArchivist.org <http://www.webarchivist.org/>, an organization co-directed by faculty at the University of Washington and the State University of New York (SUNY) Institute of Technology.

Cataloging the digital archives is an enormous, labor-intensive task, and much of the publicly available archive will never be cataloged. Of the estimated 30,000 sites collected for the September 11, 2001 archive, LC plans to catalog less than 10%. (All the sites hand-picked by Library of Congress were selected for cataloging, as well as sites recommended by WebArchivist.org. A keyword search was run to find additional sites. The lists were consolidated and duplicates eliminated; the final list is about 2,500 sites.)

For these reasons, many sites in the September 11, 2001 archive are only accessible by their exact URLs. LC is developing ways to help users search the entire archive more effectively, using the search engine Inktomi (http://www.inktomi.com) to index the homepages of the 30,000 archived sites. SUNY tested semi-automated cataloging for the Election 2002 archive, using forms in which students checked boxes to indicate the types of content a site offered (e.g., candidate biography, platform statement). Some of the data gathered during this process was used in the creation of item-level catalog records.

USER INTERFACE

Those involved with digital archives at LC take it for granted that the archives are destined for public access. There is no point to collection of born-digital items if there is not a good user interface making the content of the archive accessible.

MINERVA’s interface can be found via the LC Web site, but the close observer will note that none of MINERVA’s digital collections are served directly from LC. Rather, they reside on servers at IA or WebArchivist.org. LC does plan to serve MINERVA content eventually; in the meantime, LC and WebArchivist.org researchers are working on better, more consistent interfaces that offer more ways to access the data. For example, see the drill-down interface to the Election 2002 collection, created for the Library of Congress and currently hosted at WebArchivist.org (by September 2003 it will be hosted at LC). The interface (http://webarchivist.org/minerva/DrillSearch) offers four categories of choices: office, party, state, and candidate name. Clicking on any of the choices in these categories presents the user with a list of sites; the results list can be narrowed further by clicking on a choice in another category.

CONCLUSION

The vast number of Web sites, the variety of approaches to selection, the technical obstacles, the storage space requirements, the cataloging, the interface development—creating a digital archive is no simple task. With these challenges, there are two ways to succeed: with a collaborative approach like LC’s, institutions of different sizes partner and divide the work. Smaller institutions should look to a decentralized approach, in which they maintain smaller, specialized archives of a manageable size.

Comments are closed.