Found / Profile

The Internet’s Keepers?

Wayback Machine Director Mark Graham outlines the scale of everyone's favorite archive.

by Nathan Matisse via Ars Technica on October 7, 2018

The Internet Archive headquarters in San Francisco.

flickr.com/blmurch

...can boot up in a browser-based emulator for research or leisure. Officially, that section involves 300,000-plus overall software titles, “so you can actually play Oregon Trail on an old Apple C computer through a browser right now—no advertising, no tracking users,” Graham says.

In total, Graham says the Internet Archive adds four petabytes of information per year (that's four million gigabytes, for context). The organization’s current data totals 22 petabytes—but the Internet Archive actually holds on to 44 petabytes' worth. “Because we’re paranoid,” Graham says. “Machines can go down, and we have a reputation.” That NASA-ish ethos helped the non-profit once survive nearly $600,000 worth of fire damage—all without any archived data loss.

Universal access to knowledge (and facts, so many facts)

The mission statement of the Internet Archive throughout its 22 years has been simple: “universal access to all knowledge.” Doing that in the Web-era means deploying a small army of bots, of course, and Graham notes the Internet Archive constantly has software crawling for content. Roughly 7,000 simultaneous processes reach across the Web to snag 1.5 billion things per week. Some things like the Google or The New York Times home pages may get looked at many times in a day; other stuff may be less frequent.

“We try to get everything, but it’s challenging,” Graham notes. “Embeds, Javascripts, interactive apps—we can’t get some of this stuff, but we’re working on this.”

That working-on-it cache includes things like ephemeral media like Snapchat or public Telegram groups, and the Wayback Machine maintains on-the-ground contacts in places where some media archives or servers may be at risk (Graham notes partners in Egypt recently, for instance).

The upshot of all this is that the Wayback Machine has evolved into something with far more utility than simply amusing trips to LiveJournals of yore. Ars has used it numerous times, for everything from catching changes in Comcast’s net neutrality pledge to seeing how Defense Distributed’s organizational description evolved. And Graham points to a recent 2018 controversywhen President Trump tweeted that Google didn’t promote the State of the Union on its homepage (as it had done in the past). Before Google responded, the company reached out to the Internet Archive with a simple question—have a copy?

“I love Google, but their job isn’t to make copies of the homepage every 10 minutes,” Graham says. “Ours is.”