Out of all the aspects of the word wide web equation, search engines are without a doubt the items offering the least amount of control to either end users or content developers and designers. Sure, given sufficient white or black SEO techniques, search engines can be manipulated, but actual control is out of the reach of mere webmasters. There are exceptions, of course, in which not search engines, but the indexing process and tools can be controlled. This is done via the Robots Exclusion Protocol (REP). Earlier this week, Microsoft announced that, together with Google and Yahoo, it would offer insight on their respective way to tackle the protocol.
This means that webmasters will be able to reap the benefits out of a common implementation of REP across Google, Yahoo and Live Search. REP can be used by webmasters to tell the manner in which search engines crawl and display the content of websites. Via the protocol, webmasters can actually communicate with the search engines, even if the conversation is one-sided rather than a dialog. According to Nathan Buggia, Live Search Webmaster team, this is the first time that Microsoft, Google and Yahoo manage to agree on a common way for webmasters to use REP with their respective search engines.
When it was initially introduced, the REP standard was almost exclusively focused on exclusion directives. This is no longer the case. The protocol evolved and is now designed to permit control over how the content gets indexed and displayed. This is the first time when the three major search engines worldwide have come together and are offering a common implementation of REP. The list below, put together by Fabrice Canel and Nathan Buggia, with the Live Search Webmaster Team, details the Robots Exclusion Protocol features that are shared across Google, Yahoo and Live Search:
Impact Use: Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled
Use Cases: 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling
Impact Use: Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule - the longest rule - applies.
Use Cases: This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.
Directive: $ Wildcard Support
Impact Use: Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages (available by end of June)
Use Cases: 'No Crawl' files with specific patterns, for e.g., files with certain file types that always have a certain extension, say '.pdf', etc.
Directive: * Wildcard Support
Impact Use: Tells a crawler to match a sequence of characters (available by end of June)
Use Cases: 'No Crawl' URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc.
Directive: Sitemaps Location
Impact Use: Tells a crawler where it can find your sitemaps.
Use Cases: Point to other locations where feeds exist to point the crawlers to the site's content
2. HTML META Directives
The tags below can be present as Meta Tags in the page HTML or X-Robots Tags in the HTTP Header. This allows non-HTML resources to also implement identical functionality. If both forms of tags are present for a page, the most restrictive version applies.
Directive: NOINDEX META Tag
Impact Use: Tells a crawler not to index a given page
Use Cases: Don't index the page. This allows pages that are crawled to be kept out of the index.
Directive: NOFOLLOW META Tag
Impact Use: Tells a crawler not to follow a link to other content on a given page
Use Cases: Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page.
Directive: NOSNIPPET META Tag
Impact Use: Tells a crawler not to display snippets in the search results for a given page
Use Cases: Present no abstract for the page on Search Results.
Directive: NOARCHIVE / NOCACHE META Tag
Impact Use: Tells a search engine not to show a "cached" link for a given page
Use Cases: Do not make a copy of the page available to users from the Search Engine cache.
Directive: NOODP META Tag
Impact Use: Tells a crawler not to use a title and snippet from the Open Directory Project for a given page
Use Cases: Do not use the ODP (Open Directory Project) title and abstract for this page in Search."