An insight into Google's book-counting methods

Aug 6, 2010 08:43 GMT  ·  By

The Google Books project faces a lot of hurdles. Some are technical, some are legal, but cataloguing, scanning and making all of the world’s books searchable is a daunting task to say the least. While the legal issues are still being resolved, Google has given us an insight in its attempt to actually count the number of books in the world. It took a lot of steps, a merger of systems and, frankly, a lot of educated guesswork, but Google’s best estimate, for the moment, is that there are 129,864,880 books in the world.

“When you are part of a company that is trying to digitize all the books in the world, the first question you often get is: ‘Just how many books are out there?’ Well, it all depends on what exactly you mean by a ‘book’,” Leonid Taycher, a software engineer at Google, explained.

“We’re not going to count what library scientists call ‘works,’ those elusive ‘distinct intellectual or artistic creations.’ It makes sense to consider all editions of ‘Hamlet’ separately, as we would like to distinguish between -- and scan -- books containing, for example, different forewords and commentaries,” he continued.

As Google notes, the first problem is how exactly you define what a book is. For Google, a book is a ‘tome,’ an edition of a particular work. This definition does not take into account how many copies are out there. However, different editions of the same book are counted this way.

Still, with that figured out, it’s time to tackle the real issue, how many are out there. The problem is, no single cataloguing system is ideal. ISBNs, International Standard Book Numbers, are a good start, but there are issues with it. For one, it’s a relatively modern concept and only been widely used for four decades or so. And even it fails sometimes, different books may have the same ISBN and ISBNs have been assigned to things like CDs and t-shirts.

Google also taps into cataloguing systems used by libraries around the world. But these have their drawbacks as well. In total, Google uses more than 150 sources for cataloguing data. It aggregates all the sources and assigns different levels of relevancy to them. Then Google does what it’s best at, it applies its computing prowess to weed out duplicates. After several algorithms are applied, the number is narrowed down to about 210 million unique records.

But not all of those are books, so Google excludes microforms, audio recordings, videos, maps and so on. After that, we arrive at about 146 million printed books. But from it, Google has to exclude serials as well as government documents. Only after that, the ‘real’ number of existing books is determined. At the latest count, it’s at 129,864,880. But that’s only until Google starts another count.