While Google has been scanning PDF files for a decade, there are still misconceptions

Sep 2, 2011 15:40 GMT  ·  By

Google's mission, as far as the search engine goes, is to make all information online easily available to users when they need it. But there's a lot of info online that's not as easily accessible as it could be. One of the big offenders are PDF files.

While the Google indexer has been able to view inside PDF files for a decade now, there are still questions about how these are treated differently, if they are, than regular text (HTML).

Google has put up a list of most frequently asked questions that should, hopefully, put the most common concerns to rest.

"Our algorithms don’t let different filetypes slow them down; we work hard to extract the relevant content and to index it appropriately for our search results," Gary Illyes, Webmaster Trends Analyst, writes.

"But how do we actually index these filetypes, and—since they often differ so much from standard HTML—what guidelines apply to these files? What if a webmaster doesn’t want us to index them?," he asks.

"Google first started indexing PDF files in 2001 and currently has hundreds of millions of PDF files indexed. We’ve collected the most often-asked questions about PDF indexing," he said.

The most obvious question is how good is Google at indexing PDF files.

The company says that text inside PDFs, as long as they're not password protected, is indexed and sometimes it uses OCR to extract text from images.

Images in PDF files, however, are not ranked and Google advises webmasters to place them inside their sites if they want them indexed. Links in PDF though work pretty much like any other link on the web and they do contribute to ranking.

Webmasters can choose to prevent Google from peeking into PDF files, by adding a "X-Robots-Tag: noindex" to the header of the page used to serve the PDF.

While Google would prefer all content to be HTML, PDF files can and do rank well with certain searches, as long as they are deemed relevant to the query. Google answers other common questions, such as how to deal with duplicate content issues for PDF files that mirror HTML content, in the blog post. You can also check out Matt Cutts' video below for some answers.