The Internet is the world's living, ever-evolving database, certainly the largest depository of information ever assembled. But as it is...

Share story

The Internet is the world’s living, ever-evolving database, certainly the largest depository of information ever assembled.

But as it is assembled and reassembled, changing by the second, researchers are concerned that too often, backing up is hard to do.

To some, nothing is older than a minutes-old Web page (to update the old saw about a day-old newspaper), but for others there is value in the Web sites of yester-minute and archivists have been wrestling with means of preserving today’s Internet for tomorrow’s scholars and researchers.

Without a working archive, these experts fear, future-generation Web surfers might never know who Client 9 was or what topics were generating the most interest on this week.

Impossible task

By virtue of the Internet’s sheer enormousness and its warp-speed evolution, the task of archiving its content in its entirety is impossible, like trying to catalog every grain of sand on the world’s beaches.

But as it is easy to take a photograph of a beach, it also is possible to grab snapshots of the Internet, or specific portions of it, to preserve for future generations.

And that’s exactly what researchers at the Internet Archive, the Library of Congress, the National Archives and libraries worldwide are working on.

There have been some remarkable strides already, starting with the Mountain View, Calif.-based Internet Archive and its Wayback Machine, where its creator hopes to build a sort of second coming of the Library of Alexandria, the long-ago destroyed institution that housed much of the ancient world’s recorded knowledge.

The project has archived some 85 billion Web pages on computers that measure data in large quantities called petabytes.

100 days

“The average life span of a Web site is about 100 days, so you have to be proactive about getting and saving them,” said Brewster Kahle, who founded the nonprofit Internet Archive and began sculpting his vision of a working Internet library in 1996.

“We knew that this was coming. You could tell that there was going to be an online digital world, and we wanted to make sure there was a library built,” said Kahle, who is as dedicated to his gargantuan task as he is unassuming in discussing it.

So, on a regular basis, the Internet Archive releases a robot program called the Heritrix, which conducts Web crawls, bounding about the Internet and collecting Web sites by the millions.

Each crawl collects about 4 billion sites, which are saved in the Wayback Machine. Anyone can access the collection at, type in a site name and view archived past versions of it.

Initially funded by Kahle, the project has since received money from dozens of individuals and institutions — including the Mellon Foundation — and works worldwide with government agencies and libraries.

The robot crawlers collect only open public sites; those who don’t want their sites archived can add a bit of code to block the bot.

Kahle, a middle-aged Internet pioneer who sold startup companies to and AOL, said the Internet Archive has no endgame and its Web crawlers will continue indefinitely collecting petabytes of data as the Internet continues to dilate.

It can be difficult to wrap your brain around the enormity of a petabyte, which is about 1,000,000,000,000,000 bytes.

Science Grid This Week, a publication of the Fermilab, sums it up like this: If a byte is a single character on a keyboard and you typed one character per second, it would take more than 30 million years to create a petabyte-length document.

Another example: Say you had a fleet of personal computers and each one had a 50 gigabyte hard drive. You would need 20,000 of those PCs to hold a petabyte of data.

So when the Internet Archive says it has 2 petabytes worth of data stored, that’s one supersized library.

Only a fraction

Still, it’s just a fraction of the information stored on millions of Internet servers around the world. And what doesn’t get archived can end up disappearing forever into the digital ether.

Gregory S. Hunter, a professor at the Palmer School of Library and Information Science at Long Island University and one of the nation’s leading experts on electronic archiving, agreed that some Web sites are precious commodities that must be preserved.

Unlike Kahle’s all-inclusive approach, most other archives are consumed with the often vexing task of determining what data is worth saving and what belongs in the digital scrap heap.

Do we really want to preserve every teenager’s MySpace page? Well, Hunter says, we may want to save some of them so future researchers can understand the phenomenon of social networking.

“It’s very important that we preserve some Web sites as evidence of what has been created, or what was happening at a given point in the past. Newspaper Web sites are a good example. It’s a cultural question. We want to preserve things that reflect society in all its beauty and ugliness for future generations,” Hunter said.

Federal project

Hunter is the principal archivist for a project to build the federal government’s Electronic Records Archive (ERA), which would preserve or “appropriately dispose” of any government electronic record.

The ERA, a project of the National Archives, passed a milestone in December with the successful test of its software system developed by Lockheed Martin.

Now comes the hard part.

“As archivists, we think that by making appropriate judgments we can help sort out the wheat from the chaff,” Hunter said.

“If we save every bit of information, what good would it do us? If we keep nothing, that would do us no good either. Archivists are trying to find that middle point.”