Archive Series: The Internet and the Future of Digital Preservation

“We don’t know where this Internet is going, and once we get there it will be very instructive to look back.”¹

—Donald Heath, president of the Internet Society in Reston, Virginia

By my door, along with a stack of items to be dropped off at charities and recycling centers, sits a bag of 3.5” floppy disks. They’re left over from college days, and we’ve moved them from house to house, thinking someday we’d get around to seeing what’s on them, printing off anything we want, and ditching the disks. It hasn’t happened, and in a frustrated cleaning spurt I decided to just toss the lot. We haven’t looked at them in twenty years, what would we ever miss?

Yet there they sit; and I can’t bring myself to throw them away because once they’re gone whatever might be on them is gone too.

That so much of our present history is recorded digitally—our lives lived digitally, in fact—presents a real issue for cultural anthropologists. Disks degrade with time, software changes, and eventually, data becomes unrecoverable. From family memories to national records, information about decades of life vanish.

How will we preserve evidence of the impact social media had on the Arab Spring, or the way the Internet has helped conspiracy theorists reach large audiences? We can print some of it out, and some of it will be recorded in news reports, books, and film, but the sheer volume of output requires someone to make choices.

And while final editions of creative works are often produced in physical form—a book, painting, celluloid film, photo—early drafts may have been produced on a computer. The loss of these would leave an untold story about process and revision.

THE INTERNET ARCHIVE

This is where the digital historians at the Internet Archive step in. If you’ve ever searched for a copy of an old or defunct website, you may have used the group’s Wayback Machine search, which stores copies of webpages whose owners have opted in. As of 2015, there were 452 billion versions of web pages searchable through Wayback. ²

Internet Archive was founded in 1996 as a non-profit offshoot of a for-profit venture, Alexa Internet, a data analysis company focused on web traffic.³ Alexa found that as their robots trolled the web collecting data, they could simultaneously snapshot websites, creating a record of the virtual world at any point in time. In 1998, Alexa donated some two terabytes of those records to the Library of Congress.⁴

Today, the Internet Archive has grown beyond just Alexa, and partners with dozens of organizations worldwide to preserve digital history. A separate branch of the project, called Archive-It, was launched in 2006 to allow institutions to build and create their own digital archives. Copies of these files are stored on the Archive’s servers.

Once all those records are stored, labeled, and filed, what then? The ongoing mission of the Internet Archive, and ultimately one of its most important tasks, is preservation. Their website details the three areas of concern:

Accidents: Any medium or site used to store data is potentially vulnerable to accidents and natural disasters. Maintaining copies of the Archive’s collections at multiple sites can help alleviate this risk. Part of the collection is already handled this way, and we are proceeding as quickly as possible to do the same with the rest.

Migration: Over time, storage media can degrade to a point where the data becomes permanently irretrievable. Although DLT tape is rated to last 30 years, the industry rule of thumb is to migrate data every 10 years. We no longer use tapes for storage, however. Please take a look at our page on our Petabox system for more information on our storage systems.

Data Formats: As advances are made in software applications, many data formats become obsolete. We will be collecting software and emulators that will aid future researchers, historians, and scholars in their research.⁵

INFORMATION OVERLOAD

Web pages are essentially shells, filled with continually changing information. To catalog a cross section of the data flowing through the web each day, the Archive has created special initiatives. There is a Political TV Ad Archive (the topic I was researching when I became interested in the Archive); there is NASA’s a repository of images and historic film; there is Archive-It, a smaller, subscription-based preservation service used for businesses and nonprofits; and perhaps one of their best known projects is Open Library, a kind of Wikipedia for books. The Archive has also created the Petabox, a one million gigabyte storage system being implemented in academic and government institutions for long-term storage.⁶

Alongside all those initiatives, the team at the Archive fight for Net Neutrality and open access. Their work has been censored by governments (India, China, and Russia so far) and they’re regularly attacked by data hostage takers and hackers, which continually keeps engineers and staff scrambling on a nonprofit budget.

RECORDING EVERYTHING

At the Midwest Regional Digitization Center in Indiana, hundreds of thousands of books are being scanned, along with any ephemera that is found between the pages. Pressed flowers, scraps of paper, and bookmarks are all considered part of the history of the book, and recorded for posterity.⁷

By uploading books in the public domain, the Internet Archive has avoided legal problems over copyright, though concerns over author rights were raised when the full catalog of the out-of-print sci-fi magazine Omni was digitized and uploaded. A similar digitization project launched by Google in 2004 has been more comprehensive with their choices and allowed open access to texts, leading to class action lawsuits brought by authors and publishers over copyright infringement. As a member of the Open Book Alliance, the Archive was opposed to the terms of settlement that limit Google’s ability to offer books for free, but so far they’ve stayed away from offering similarly controversial content. ⁸

With data-gathering well under way, the Archive has turned its efforts to figuring out how to make the data accessible. Large portions of storage are still masked behind digital language that requires coding knowledge to even begin to understand. Organizing the information for future researchers, historians, and laypeople is a massive effort. Indeed, the very record of their efforts will become part of the heritage of the Internet.

The ancient Library of Alexandria (for which Alexa was named) was once mankind’s largest repository of knowledge and cultural heritage. Though circumstances are unclear, the library and most of its collection was destroyed, set afire either at once or in stages between 48 B.C. and 642 A.D. The Internet Archive also suffered through fire at its San Francisco headquarters, losing one of its scanning centers and all the attendant equipment and about twenty boxes of books and films.⁹

But we’ve learned over the centuries. The Archive hosts data centers in San Francisco, Redwood City, and Richmond, CA, and the entire collection is mirrored in Amsterdam and at the new Egyptian library of Alexandria, opened in 2002. History: we seem doomed to repeat it, unless we learn from it.

Resources and oddities at the Internet Archive:

Speed runs. Videos of the fastest completions of video games.

FedFlix, the best training films, history, and promotional media made by the US Government.

Download audiobooks of classic titles as well as things like the 9/11 Commission Report.

The Grateful Dead archive. There are also archives for Cracker, .moe, Blues Traveler and more.

Feature films in the public domain, some hard to find anywhere else.

The Internet Arcade, copies of stand-alone arcade-machine games.

Works Cited

1. “In California, Creating a Web of the Past.” The Washington Post, July 28, 1996, accessed June 23, 2016. https://archive.org/stream/technicalarticle19unse/technicalarticle19unse_djvu.txt”>https://archive.org/stream/technicalarticle19unse/technicalarticle19unse_djvu.txt”>https://archive.org/stream/technicalarticle19unse/technicalarticle19unse_djvu.txt.

2. Kalev Leetaru, “How Much Of The Internet Does The Wayback Machine Really Archive?” Forbes Magazine, November 15, 2015, accessed June 23, 2016, http://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive/#4350751f88d4.

3. “About the Internet Archive,” Archive.org, accessed June 16, 2016, https://archive.org/about/.

4.”News from the Library of Congress,” The Library of Congress, October 13, 1998, accessed June 16, 2016, http://www.loc.gov/today/pr/1998/98-167.html.

5. “About the Internet Archive.”

6. “Internet Archive Projects,” Archive.org, accessed June 16, 2016, https://archive.org/projects/.

7. Wendy Hanamura, “Guess what we find in books? A Look Inside our Midwest Regional Digitization Center,” Internet Archive Blogs, March 11, 2016, https://blog.archive.org/2016/03/11/guess-what-we-find-in-books-a-look-inside-our-midwest-regional-digitization-center-by-jeff-sharpe/.

8. World Heritage Dictionary, s.v. “Archive-It,” accessed August 11, 2016, Project Gutenberg Self-Publishing Press (WHEBN0028741374) http://www.gutenberg.us/articles/archive-it

9. “Fire Update: Lost Many Cameras, 20 Boxes. No One Hurt.” Internet Archives Blogs, November 6, 2013, https://blog.archive.org/2013/11/06/scanning-center-fire-please-help-rebuild/.

Karen Veazey
Associate Editor