Google, Yahoo and Live Search Robots Exclusion Protocol

Out of all the aspects of the word wide web equation, search engines are without a doubt the items offering the least amount of control to either end users or content developers and designers. Sure, given sufficient white or black SEO techniques, search engines can be manipulated, but actual control is out of the reach of mere webmasters. There are exceptions, of course, in which not search engines, but the indexing process and tools can be controlled. This is done via the Robots Exclusion Protocol (REP). Earlier this week, Microsoft announced that, together with Google and Yahoo, it would offer insight on their respective way to tackle the protocol.

This means that webmasters will be able to reap the benefits out of a common implementation of REP across Google, Yahoo and Live Search. REP can be used by webmasters to tell the manner in which search engines crawl and display the content of websites. Via the protocol, webmasters can actually communicate with the search engines, even if the conversation is one-sided rather than a dialog. According to Nathan Buggia, Live Search Webmaster team, this is the first time that Microsoft, Google and Yahoo manage to agree on a common way for webmasters to use REP with their respective search engines.

When it was initially introduced, the REP standard was almost exclusively focused on exclusion directives. This is no longer the case. The protocol evolved and is now designed to permit control over how the content gets indexed and displayed. This is the first time when the three major search engines worldwide have come together and are offering a common implementation of REP. The list below, put together by Fabrice Canel and Nathan Buggia, with the Live Search Webmaster Team, details the Robots Exclusion Protocol features that are shared across Google, Yahoo and Live Search:

"1.Robots.txt Directives

Directive: Disallow Impact Use: Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled Use Cases: 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling

Directive: Allow Impact Use: Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule - the longest rule - applies. Use Cases: This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.

Directive: $ Wildcard Support Impact Use: Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages (available by end of June) Use Cases: 'No Crawl' files with specific patterns, for e.g., files with certain file types that always have a certain extension, say '.pdf', etc.

Directive: * Wildcard Support Impact Use: Tells a crawler to match a sequence of characters (available by end of June) Use Cases: 'No Crawl' URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc.

Directive: Sitemaps Location Impact Use: Tells a crawler where it can find your sitemaps. Use Cases: Point to other locations where feeds exist to point the crawlers to the site's content

2. HTML META Directives The tags below can be present as Meta Tags in the page HTML or X-Robots Tags in the HTTP Header. This allows non-HTML resources to also implement identical functionality. If both forms of tags are present for a page, the most restrictive version applies.

Directive: NOINDEX META Tag Impact Use: Tells a crawler not to index a given page Use Cases: Don't index the page. This allows pages that are crawled to be kept out of the index.

Directive: NOFOLLOW META Tag Impact Use: Tells a crawler not to follow a link to other content on a given page Use Cases: Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page.

Directive: NOSNIPPET META Tag Impact Use: Tells a crawler not to display snippets in the search results for a given page Use Cases: Present no abstract for the page on Search Results.

Directive: NOARCHIVE / NOCACHE META Tag Impact Use: Tells a search engine not to show a "cached" link for a given page Use Cases: Do not make a copy of the page available to users from the Search Engine cache.

Directive: NOODP META Tag Impact Use: Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Use Cases: Do not use the ODP (Open Directory Project) title and abstract for this page in Search."

Google, Yahoo and Live Search Robots Exclusion Protocol

Common implementation