Rumour has it that one of the candidates for Librarian of Congress is Brewster Kahle, the founder and director of the non-profit digital library Internet Archive.1 That he may be considered for the post is a testament to Kahle’s commitment to mass digitization, the cornerstone of modern librarianship.
A visionary of the digital preservation of knowledge and an outspoken advocate of the open access movement (the memorial for the Internet activist Aaron Swartz was held at the Internet Archive’s headquarters in San Francisco), Kahle has been part of the many ventures that have created our cyber age. At MIT, he was on the project team of Thinking Machines, a precursor of the World Wide Web. In 1989 he created WAIS (Wide Area Information Server), the first electronic publishing system, which was designed to search and make information available. He left Thinking Machines to focus on his newly founded company, WAIS, Inc., which was sold to AOL two years later for a reported $15 million. In 1996 he co-founded Alexa Internet, which was built on the principles of collecting Web traffic data and analysis.2 The company was named after the Library of Alexandria, the largest repository of knowledge in the ancient world, to highlight the potential of the Internet to become such a custodian. It was sold for c. $250 million in stock to Amazon, which uses it for data mining.
Alongside Alexa Internet, in 1996 Kahle founded the Internet Archive to archive Web culture (Fig. 1). The Wayback Machine, the engine that kicked off the Internet Archive, now holds 439 billion captures of ‘digital-born’ content: video, television, and, of course, websites (some issues of 19, for example, have been stored here). In 2001, the Archive turned to books. Unlike other digital ventures, most notably Google Books or the HathiTrust, the Internet Archive does not pose any restrictions on their collections. Its motto is ‘universal access to all knowledge’. It has also more recently begun to store books.
Though the Internet Archive is more than a nineteenth-century research library, it is one of the largest repositories of digitized nineteenth-century materials in the world. It is difficult to imagine doing research in nineteenth-century studies without using the Internet Archive. Freely available, the Archive’s holdings, its multiple digitized copies of single books, its interface with its word search, uploading and downloading functions, sound application, and beta-feedback are invaluable to nineteenth-century scholars worldwide. And the impact of the Internet Archive on teaching is just as important, both inside and outside the classroom.
In this interview with Ana Parejo Vadillo, Kahle (Fig. 2) discusses his vision for digital libraries, the economy of digitization, and library deaccessioning. He also talks about scanning and the love for the book that makes it possible.
I have been reading your 1996 manifesto ‘Archiving the Internet’ which was submitted to Scientific American for the March 1997 issue. I would like to go back to that early piece, in which you argue that the aim of this new venture will be the ‘preservation of our digital history’. Is the aim of the Internet Archive still the same?
That is interesting. I have not read that in years. I include it below [see Appendix]. It is a bit spooky because this is exactly the course we have been on and are still on. We are achieving much of what was written then: archiving the Web, having end users and researchers use it. But we have gone further. We are striving for Universal Access to All Knowledge.
As a regular user of the Internet Archive, it seems to me that in addition to preserving digital history, the Archive’s mission is also to translate and preserve our hard culture into digital format. Is that the case and, if so, is it working? Are there pitfalls?
The Internet Archive started by being an archive of the Internet, and then moved to being an archive on the Internet. This has meant we have worked with libraries to digitize millions of books, music, and videos, and tried to bring these to the Internet. We are now moving towards building libraries with communities and libraries to share and make permanent the digital materials we are all generating. I would say it is working, but more slowly than I had hoped. I believe we have a massive project to do together, which is to put the best we have to offer within reach of our kids. And kids these days (as well as most of us) turn to the Internet: if it is not online then it is as if it did not exist. Therefore, we need to move all the best works online and then find mechanisms to serve these to anyone who wants them (Fig. 3). We need to do this now because every year that passes in which the twentieth century is not online is another year in which students graduate without having it in the library they use every day: the Internet.
The Internet Archive started collecting web pages in 1996. When did you turn to books? Why did you turn to books?
Universal Access to All Knowledge is the goal. We started with the Web because it was the most ephemeral. In 2000 we started collecting television: we now collect seventy channels from twenty-five countries, twenty-four hours a day. We also started collecting music and digitizing movies. In 2001 we started digitizing books with the Million Book project and in 2005 we started digitizing inside libraries. The reason is to bring our literary heritage to the world in an open way, and not just in closed databases only available to those in privileged institutions. The war over centralization is still going on, but so far the open world has won many of these battles.
Why has the nineteenth-century digital archive grown so strongly?
The nineteenth century is well represented on the Web because it is out of copyright. Copyright was twisted in a rewriting of the law in 1976 to put much of the twentieth century into a legal jail. As Michael Lesk, the father of digital libraries said, we need to get it for real: ‘I fear for the twentieth century. The nineteenth century is out of copyright, and the twenty-first century is already digital. But the twentieth century is in danger because of copyright.’ You see this as well in books offered on Amazon.com: twentieth-century works are not well represented.
Can you talk about the materiality of scanning and the physical labour of those who scan? I have heard you say that those who love books are better at scanning them. Is there an inherent argument about loving and knowing what books are and the actual process of translating them into a new media?
People who love books want to share them. We thought people would scan books for a few months and move to other jobs, but we were wrong. Many scanners have been working for us for over five years. Most of our scanners are college graduates — they just love books and want to see them live on (Figs. 4, 5).
Could scanning resolution become a class system of knowledge?
I am afraid that our digitization will be selective, that only dominant languages, dominant cultures, and dominant points of view will be represented in the digital future. If we are biased in our selection of what we bring to the next generation, then we are committing a crime that will never be forgiven. Fortunately, it only costs ten cents a page to digitize a book, so for a 300-page book, it would cost thirty dollars. We have only digitized so far 2.5 million books, and we need to digitize 10 million to be the equivalent of a Yale or a Princeton or a Boston Public Library. So we need a few hundred million dollars to really complete the job with books. It can be our ‘Carnegie Moment’. It could be the legacy of a set of people who say this is the priority. It is a major opportunity for our generation.
Multiple digital copies in the Internet Archive constantly remind us of the ‘lost copies’ of our paper inheritance, a key issue for nineteenth-century scholarship, particularly in the context of library deaccessioning. Is this discussed when deciding which copies to scan? Do holding institutions have a say? In other words, is there a curatorial philosophy/practice behind book digitization, particularly in the nineteenth century?
We turn to librarians and collectors to direct, and fund, what is scanned. Personally, I made sure the books written by my grandfather were scanned. Currently, we are funding limited. But given the funding (ten cents a page), I believe we can get access to the collections in the great libraries. But then we must also preserve the physical versions. Fortunately that is getting cheaper as well — if most of the access is to the digital versions, then the physical versions can be stored more densely and safely. We do not need to be deaccessioning as much as we have been because it is less expensive to save materials. The Internet Archive now has two large physical archives where we are storing over one million books, and tens of thousands of films, records, and microfilm reels. This can and should be done by everyone. But if others cannot store them, we hope they will think of us (Fig. 6).
19 is an open journal that believes in universal access to knowledge. How do you see the collusion between the copyright system that allows people (writers/artists of any kind) to get paid and the free culture movement that is the Internet?
Unfortunately, the current technologies for charging are making for large central organizations that are dominating and massively limiting access. Comparing the open access journals and PLOS [Public Library of Science] with the commercial publishers makes me think that charging has the perverse effect of crippling access. We need new systems but, in the meantime, the best way to be read is to not charge each reader for access.
What economic models do you think will be needed or can be implemented to sustain nineteenth-century digital archives and the labour underpinning them? If it is important that such resources be open to all, who should pay?
Libraries are a $12 billion a year industry in the US. Digitizing their whole collections would cost about $160 million if done intelligently. If done well, then we would have a system that still reflects what is carved over the door of the Boston Public Library: ‘Free to All.’
From a UK perspective, one wonders about the economy of this knowledge. Why do you think the US is so far ahead in terms of the digital archiving of our hard culture? Is it because of technological innovation or because of its innovative approach to its economy (what you have elegantly termed ‘knowledge economy’)?
Europe does not have a strong tradition of independent non-profit charities like the US does. So many cultural entities are government entities in Europe, and governments are increasingly driven by corporate interests rather than by what could serve the general population. This is reflected in laws that favour ‘collecting societies’: strong corporate copyright laws, lack of libraries posting their digitized public domain materials, and a lack of spending priorities for digitization.
How do you envisage the unfolding of parallel digital and physical/material archives given that neither seems sufficient in isolation?
Access drives preservation. If materials are not digitized then they will be largely forgotten and therefore physical preservation will be underfunded.
You have described the Internet Archive project as the Library of Alexandria 2.0. You often note that ‘universal access to knowledge is now within our grasp’ thanks to the Internet. Is this really plausible? Is this really the ultimate aim of the Internet Archive, more so than the preservation of that knowledge?
‘Universal access to all knowledge’ is a goal for the broad Internet community, but for the pieces missing, the Internet Archive would like to play a role. As this unfolds, we would like to preserve this knowledge and make sure it is accessible for centuries to come.