The new audio implementation helps transcribe historical audio content

Dec 10, 2008 11:07 GMT  ·  By

The audio implementation of Carnegie Mellon University’s reCAPCTHA service has suffered a modification that improves its security and also helps preserve historical media. The audio version now plays short sentences from old radio shows that speech recognition software failed to transcribe automatically.

CAPTCHA, or the Completely Automated Public Turing test to tell Computers and Humans Apart, is a widely deployed verification system that is implemented in websites in order to block automated registration scripts or spam. The popular reCAPTCHA service is a CAPTCHA implementation that uses scanned text from the Internet Archive or The New York Times archive that cannot be digitized by optical character recognition (OCR) software.

This makes reCAPTCHA, which is a project of the School of Computer Science at Carnegie Mellon University, an important cultural asset in addition to a security system, and it is estimated that the transcriptions done with the help of this service amount to 3,000 man hours of labor per day.

In order to accommodate the visually impaired individuals that surf the web using specialized screen-reading software, the reCAPTCHA service provides an audio alternative for the written one. This audio version recites digits and letters on different voices and with background noise, which then has to be reproduced by the user in order to verify that they are human and not automated scripts.

As Luis von Ahn, the project's executive producer, explains in a blog post, the original audio version lacked both the security and usefulness of the written one, so it required an improvement, especially since a PhD student would soon after present a research paper that would demonstrate how the security of the audio reCAPTCHA could be compromised by an attack using machine-learning techniques.

According to professor von Ahn, the new audio version cannot be subverted through this attack and breaking it “would require major advancements in speech recognition technology.” He outlines the benefits of this new implementation noting that “much like the visual reCAPTCHA has helped to digitize billions of printed words so far, we expect that the audio version will help transcribe large amounts of historical audio content.”

The algorithm used by the new audio reCAPTCHA features phoneme-based encoding, and was designed to allow a small number of errors in order to compensate for spelling mistakes or homophony. The update will be deployed to everyone using the service during the next few weeks, and webmasters using custom themes for embedding reCAPTCHA into their websites are advised to change the instructions for the audio version to “type what you hear,” or something similar.