Google has officially announced the new feature

Jun 23, 2010 13:59 GMT  ·  By

Google sometimes moves too fast for its own good. A new feature has recently been spotted in Google Docs, optical character recognition (OCR) for uploaded PDF and image files, and the company only now got around to actually announcing it. There’s nothing too surprising in the announcement, Google sees the technology as a way for people to convert their old scanned documents to fully editable ones. The company also confirmed that it was using the same OCR technology it used in Google Books.

“[W]hat started as my 20% project is now ready for everyone to use -- Google Docs now officially supports importing scanned documents. What we launched as an experimental feature for the Documents List Data API last year is now available on the upload page: check the ‘Convert text from PDF or image files to Google Docs documents,’ upload your scanned images (JPEG, GIF, PNG) or PDFs, and Google Docs will extract text and formatting from the scans for you to edit away,” Jaron Schaeffer, software engineer at Google Docs, announced.

The technology works for several types of image files and PDFs. In theory, it should make it easier to digitize older documents and make them editable in Google Docs. In practice, results have been spotty. Good-quality files will provide very good results, but any degradation in quality and harder-to-read documents such as hand-written ones don’t fare so well.

No OCR technology is perfect, but Google’s is one of the most advanced out there. It has used it to make more than ten millions books and publications searchable, but maybe what the Docs users get is not on par with what Books gets, unlikely as that may be. For now, the technology works best with English, French, Italian, German and Spanish documents, but Google says other languages and character sets will be supported in the future.