Excluding duplicate content locations

Jul 28, 2008 09:41 GMT  ·  By

The number of unique Internet pages indexed by Google has reached the outstanding number of 1 trillion, says Google's team on the official blog. Considering the fact that existent technology gives regular end users the capabilities of genuine webmasters, creating a webpage has stopped being something extravagant quite a long time ago. This, in turn, makes the world wide web become more and more intricate by the day.

The Web Search Infrastructure Team from Google explains that the number of webpages is practically infinite, as someone can create a page that, despite not requiring daily management, can still change its content, such as is the case of a calendar. In order to estimate the amount of data that needs to be manipulated to return accurate results to people's queries, companies need to know the approximate number of Internet pages.

Google neglected the pages that display the same content on more URLs and those that are generated mutually, creating more and more addresses with the same data. After this Sisyphus work, of removing the non-singular pages, was completed, the Google team was still able to count more than one trillion pages. The way each page is taken into account at a given moment by the search engine is explained one more time by Jesse Alpert and Nissan Hajaj, Software Engineers at Google. "We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links." they say.

The team's statement regarding the number of pages Google is able to index came just a few days before the launch of a search engine that claims to be the biggest on the market. With 120 billion indexed pages, Cuil's database is almost ten times smaller than Google's, which doesn't say anything about its place in the chart of number of indexed pages. If the data are correct, it sure seems like someone was too hasty with that self-proclamation.