16 June 2009 - 22:35robots.txt
Almost 90% of the websites we visit on a daily basis have small little text file in the root directory called robots.txt. Based on our experience, we see a lot of website owners and designers use this file - but have a lot of questions and misconceptions about its purpose. The most common question we are asked - what else does it do other than provides a gateway to bots (spiders) to crawl your website. Well here is what we have learned about this little guy called robots.txt.
Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google, Yahoo and more use them to index the web content, spammers use them to scan for email addresses, and they have many, many other uses. Web site owners use the “robots.txt” file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol (REP).
This robots protocol is an agreement among people writing robots (mainly for search engines) and people publishing web sites, to give the site owner a way to communicate with them and have some control over how robots interact with their website.
Being voluntary, it is a success because most sites want to be indexed and searched. However, there are some sites or portions of sites that would prefer to not have their content indexed. There are robots which ignore the directives, so any private content can be protected by authentication and access control (user name and password).
Robots, including search indexing tools and intelligent agents, should check a special file in the root of each server called robots.txt, which is a plain text file (not HTML). Robots.txt implements the REP (Robots Exclusion Protocol), which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can Allow access to their web content and Disallow access to cgi, private and temporary directories. For example maybe you have a family blog that you want to keep private, or you have minutes of your meetings that are meant only for the board, private spreadsheets that you are password protecting etc.
In June, 2008 World Wide Web search engine companies Yahoo, Google, and Microsoft agreed to extend the Robots Exclusion Protocol. They added elements to robots.txt: an Allow directive, wildcards in URLs, and a link to a sitemap for ease of crawling, IP authentication to identify search engine indexing robots, the X-Robots-Tag header field for non HTML documents, and some additional META robot tag attributes. We will expand on the Allow directive and X-Robots tag later part of the article.Now, we have been formally introduced to Mr. Robots; let’s see what he/she made of. In short, robots.txt is a simple txt file which consists of some elements. Here is the basic structure of a robots.txt file.
Example of robots.txt
# address all other robots using the wild card *
User-agent: *
# address only certain robots using the bot name
User-agent: Googlebot
# to allow all folders to be indexed by robots
Allow: / or Disallow:
# ***the Allow directive is a newer syntax introduced in 2008.
# list of folders robots are allowed to index
Allow: /products/
# to disallow all folders to be indexed by robots
Disallow: /
# list of folders robots are not allowed to index
Disallow: /cgi/
Disallow: /purchase/orders/
Disallow: /customer/
Disallow: /logs/
Disallow: /database/
Disallow: /images/
# list of specific files to exclude from index
Disallow: /purchase/orders.html/
Disallow: /customer/login.html/
# sitemap path
sitemap: http://www.mysite.com/sitemap.xml
# end of robots.txt file
Here are some important things to remember while writing a robots.txt.
- The exact mixed-case directives is required like capitalizing Allow: and Disallow:
- Remember that there is hyphen in between the words User and agent—> User-agent:
- An asterisk (*) after User-agent: means all robots. If you include a section for a specific robot, it may not check in the general all robots section, so repeat the general directives.
- The user agent name can be a substring, such as “Googlebot” (or “googleb“), “Slurp”, and so on. It should not matter how the name itself is capitalized.
- Disallow tells robots not to crawl anything which matches the following URL path specified
- Allow is a new directive: older robot crawlers will not recognize this.
- URL paths are often case sensitive, so be consistent with the site capitalization
- The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
- In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
- One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers
- Wildcards do not lengthen a path — if there’s a wildcard directive path that’s shorter, as written, than one without a wildcard, the one with the path spelled out will generally override the one with the wildcard.
- Sitemap is a new directive for the location of the Sitemap file
- A blank line indicates a new user agent section.
- A hash mark (#) indicates a comment
Here is a bit about what each directive does. The information below is solely based on our experience over the years and what we have learned in practice.
User-agent
The User-agent is the name of the client (browser or robot), sent as part of an HTTP. It will appear in web site log files with version number and name such as Internet Explorer, Opera, or Mozilla (browsers), or Slurp, Googlebot, or MSNBot (search engine robots).
In a robots.txt file, directives may be aimed at all robot clients (User-agent: *) or at a specific one (User-agent: Googlebot). In addition to browsers and the three biggest search engines namely Google, Yahoo and MSN, many other robots also request content from a site.
Disallow
Directive indicates that robots should not access the specific directory, subdirectory or file.
User-Agent: Googlebot
Disallow: /logs /
Disallow: /logs/accesslog.php
Allow
This directive is a newer - introduced in 2008, specifically allows robots to follow certain paths when crawling. This means you can Allow a section of a site, and Disallow a specific subsection. Here’s an example that combines these concepts, applied to all robot crawlers:
User-Agent: *
Allow: /products/sizechart/
Disallow: /products/printable/
Allow: /products/printable/sizechart.pdf/
While the previous robot exclusion protocol assumed that anything not disallowed was allowed, the new rule makes it more explicit.
Wildcards
Wildcards, typed as asterisks (*), are used to specify any number of characters (including zero), in the URL paths. They make it easy to use patterns to direct the robots to the parts of a site that should be indexed, but keep them away from areas that should not. The $ character specifies that the pattern matched must be at the end of the file path.
Disallow: *.pdf$
Disallow: /catalogs/*/previous_editions/*
Allow: /catalogs/current/*
Disallow: /catalogs/future_editions/*
It is to be noted that using wildcards can be very tricky and a lot depends upon how a site is structured mainly from a URL point of view.
Sitemap location
Sitemaps are lists of URLs in XML formats. They can contain information about specific pages, such as the last modified date, which is helpful if your web server doesn’t keep proper track of content change dates, as well as that page’s expected change frequency. A priority tag tells the crawler which pages you think are most important, although it’s unlikely that any of the larger search engines will use that for relevance ranking or even for determining the recrawl rate.
Sitemap: http://www.mysite.com/sitemap.xml
If you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow robots to index all http pages but no https pages, you’d use the robots.txt files as follows, for your http protocol:
User-agent: * Disallow:
And for the https protocol:
User-agent: * Disallow: /
Bots check for the robots.txt file each time they come to a website. The rules in the robots.txt file will be in effect immediately once it is uploaded to the site’s root and the bot comes to the site. How often it is accessed varies on how frequently the bots spider the site based on popularity, authority, and how frequently content is updated. Some sites may be crawled several times a day while others may only be crawled a few times a week. Google Webmaster Central provides a way to see when Googlebot last accessed the robots.txt file. I recommend using robots.txt analysis tool to check your robots.txt file.
Myths about robots.txt
The below some of the questions we have been asked many a times. Based on my experience they are all pure myths.
Myth1: Does my site require robots.txt file in order to be indexed.
The FACT is: No, your site will be indexed whether you create a Robots.txt file or not. A Robots.txt file will not draw robots to your site any faster than normal.
Myth2: Does my site needs a robots.txt file in order to rank higher.
The FACT is: No, your robots.txt file will only tell the robots what pages and links can or cannot be indexed. However, the result of having a Robots.txt file will have a secondary effect on your site’s rankings: if you improve your site’s crawling ability, your site rankings will also improve
Myth3: Can I can block pages completely by using “Disallow” statements.
The FACT is: No, though the Disallow statement is powerful, you cannot guarantee a 100% invisible page. Just because a page or directory is listed in your robots.txt file doesn’t mean the search engines won’t crawl those pages. That’s an important distinction to remember. robots.txt files block indexation but do nothing to stop crawling. If you want to create an invisible page, you should consider the use of the Meta Robots tag employing “noindex/nofollow.”
Myth4: Will more bots access my site better when a robots.txt is included in my site.
The FACT is: No, some search bots are simply out there to scour your site for e-mail addresses for spamming purposes. Knowing how to block them will aid in the ongoing spam war.
No Comments | Tags: Home, SEO Search Engine Optimization
