Happening to mention that it offers just such a technology from reCAPTCHA

Jan 28, 2010 07:19 GMT  ·  By

Spam is as big a problem as always, recent studies found that as much as 95 percent of all email is spam. But it's not just email, spammers are always finding new avenues. As any bloggers who hosts their own blogs will tell you, spam in the comments is a huge problem. One way of deterring it is by using a CAPTCHA (completely automated public Turing test to tell computers and humans apart) system which has been around for a few years. Google, which acquired a company in the field, reCAPTCHA, last year, is now taking some time to explain the system to users who may not be familiar with it.

"To level the playing field, you can take steps to make sure that only humans can interact with potentially spammable features of your website. One way to determine which of your visitors are human is by using a CAPTCHA," Michael Wyszomierski, from the Google Search Quality Team, wrote.

"You can easily take advantage of this technology on your own site by using reCAPTCHA, a free service owned by Google. One unique aspect of reCAPTCHA is that data collected from the service is used to improve the process of scanning text, such as from books or newspapers. By using reCAPTCHA, you're not only protecting your site from spammers; you're helping to digitize the world's books," he explained.

reCAPTCHA is one of the best known companies in the field and certainly one of the most innovative. The interesting part doesn't come from the technology itself, which others have implemented in different ways but generally with equally successful results, but the fact that the system is also used to decypher words that OCR (optical character recognition) software has trouble with.

The system serves two different words to the sites which implemented the technology. One of them is a known one, the other is a word the OCR algorithms failed to identify. The system them assumes that if the human reader gets the control word right, the unknown one must be correct too. This is the part that likely interested Google the most when it made the acquisition as book digitization is a very important topic at the company. reCAPTCHA is currently used to digitize the New York Times archive and has already gone through 20 years-worth of newspapers.