16 June 2009 - 22:35robots.txt

Almost 90% of the websites we visit on a daily basis have small little text file in the root directory called robots.txt. Based on our experience, we see a lot of website owners and designers use this file - but have a lot of questions and misconceptions about its purpose. The most common question we are asked - what else does it do other than provides a gateway to bots (spiders) to crawl your website. Well here is what we have learned about this little guy called robots.txt.

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google, Yahoo and more use them to index the web content, spammers use them to scan for email addresses, and they have many, many other uses. Web site owners use the “robots.txt” file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol (REP).

This robots protocol is an agreement among people writing robots (mainly for search engines) and people publishing web sites, to give the site owner a way to communicate with them and have some control over how robots interact with their website.

Being voluntary, it is a success because most sites want to be indexed and searched. However, there are some sites or portions of sites that would prefer to not have their content indexed. There are robots which ignore the directives, so any private content can be protected by authentication and access control (user name and password).

Robots, including search indexing tools and intelligent agents, should check a special file in the root of each server called robots.txt, which is a plain text file (not HTML). Robots.txt implements the REP (Robots Exclusion Protocol), which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can Allow access to their web content and Disallow access to cgi, private and temporary directories. For example maybe you have a family blog that you want to keep private, or you have minutes of your meetings that are meant only for the board, private spreadsheets that you are password protecting etc.

In June, 2008 World Wide Web search engine companies Yahoo, Google, and Microsoft agreed to extend the Robots Exclusion Protocol. They added elements to robots.txt: an Allow directive, wildcards in URLs, and a link to a sitemap for ease of crawling, IP authentication to identify search engine indexing robots, the X-Robots-Tag header field for non HTML documents, and some additional META robot tag attributes. We will expand on the Allow directive and X-Robots tag later part of the article.Now, we have been formally introduced to Mr. Robots; let’s see what he/she made of. In short, robots.txt is a simple txt file which consists of some elements. Here is the basic structure of a robots.txt file.

Example of robots.txt

# address all other robots using the wild card *

User-agent: *

# address only certain robots using the bot name

User-agent: Googlebot

# to allow all folders to be indexed by robots

Allow: / or Disallow:

# ***the Allow directive is a newer syntax introduced in 2008.

# list of folders robots are allowed to index

Allow: /products/

# to disallow all folders to be indexed by robots

Disallow: /

# list of folders robots are not allowed to index

Disallow: /cgi/

Disallow: /purchase/orders/

Disallow: /customer/

Disallow: /logs/

Disallow: /database/

Disallow: /images/

# list of specific files to exclude from index

Disallow: /purchase/orders.html/

Disallow: /customer/login.html/

# sitemap path

sitemap: http://www.mysite.com/sitemap.xml

# end of robots.txt file

Here are some important things to remember while writing a robots.txt.

  • The exact mixed-case directives is required like capitalizing Allow: and Disallow:
  • Remember that there is hyphen in between the words User and agent—> User-agent:
  • An asterisk (*) after User-agent: means all robots. If you include a section for a specific robot, it may not check in the general all robots section, so repeat the general directives.
  • The user agent name can be a substring, such as “Googlebot” (or “googleb“), “Slurp”, and so on. It should not matter how the name itself is capitalized.
  • Disallow tells robots not to crawl anything which matches the following URL path specified
  • Allow is a new directive: older robot crawlers will not recognize this.
  • URL paths are often case sensitive, so be consistent with the site capitalization
  • The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
  • In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
  • One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers
  • Wildcards do not lengthen a path — if there’s a wildcard directive path that’s shorter, as written, than one without a wildcard, the one with the path spelled out will generally override the one with the wildcard.
  • Sitemap is a new directive for the location of the Sitemap file
  • A blank line indicates a new user agent section.
  • A hash mark (#) indicates a comment

Here is a bit about what each directive does. The information below is solely based on our experience over the years and what we have learned in practice.

User-agent

The User-agent is the name of the client (browser or robot), sent as part of an HTTP. It will appear in web site log files with version number and name such as Internet Explorer, Opera, or Mozilla (browsers), or Slurp, Googlebot, or MSNBot (search engine robots).

In a robots.txt file, directives may be aimed at all robot clients (User-agent: *) or at a specific one (User-agent: Googlebot). In addition to browsers and the three biggest search engines namely Google, Yahoo and MSN, many other robots also request content from a site.

Disallow

Directive indicates that robots should not access the specific directory, subdirectory or file.

User-Agent: Googlebot

Disallow: /logs /

Disallow: /logs/accesslog.php

Allow

This directive is a newer - introduced in 2008, specifically allows robots to follow certain paths when crawling. This means you can Allow a section of a site, and Disallow a specific subsection. Here’s an example that combines these concepts, applied to all robot crawlers:

User-Agent: *

Allow: /products/sizechart/

Disallow: /products/printable/

Allow: /products/printable/sizechart.pdf/

While the previous robot exclusion protocol assumed that anything not disallowed was allowed, the new rule makes it more explicit.

Wildcards

Wildcards, typed as asterisks (*), are used to specify any number of characters (including zero), in the URL paths. They make it easy to use patterns to direct the robots to the parts of a site that should be indexed, but keep them away from areas that should not. The $ character specifies that the pattern matched must be at the end of the file path.

Disallow: *.pdf$

Disallow: /catalogs/*/previous_editions/*

Allow: /catalogs/current/*

Disallow: /catalogs/future_editions/*

It is to be noted that using wildcards can be very tricky and a lot depends upon how a site is structured mainly from a URL point of view.

Sitemap location

Sitemaps are lists of URLs in XML formats. They can contain information about specific pages, such as the last modified date, which is helpful if your web server doesn’t keep proper track of content change dates, as well as that page’s expected change frequency. A priority tag tells the crawler which pages you think are most important, although it’s unlikely that any of the larger search engines will use that for relevance ranking or even for determining the recrawl rate.

Sitemap: http://www.mysite.com/sitemap.xml

If you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow robots to index all http pages but no https pages, you’d use the robots.txt files as follows, for your http protocol:

User-agent: * Disallow:

And for the https protocol:

User-agent: * Disallow: /

Bots check for the robots.txt file each time they come to a website. The rules in the robots.txt file will be in effect immediately once it is uploaded to the site’s root and the bot comes to the site. How often it is accessed varies on how frequently the bots spider the site based on popularity, authority, and how frequently content is updated. Some sites may be crawled several times a day while others may only be crawled a few times a week. Google Webmaster Central provides a way to see when Googlebot last accessed the robots.txt file. I recommend using robots.txt analysis tool to check your robots.txt file.

Myths about robots.txt

The below some of the questions we have been asked many a times. Based on my experience they are all pure myths.

Myth1: Does my site require robots.txt file in order to be indexed.

The FACT is: No, your site will be indexed whether you create a Robots.txt file or not. A Robots.txt file will not draw robots to your site any faster than normal.

Myth2: Does my site needs a robots.txt file in order to rank higher.

The FACT is: No, your robots.txt file will only tell the robots what pages and links can or cannot be indexed. However, the result of having a Robots.txt file will have a secondary effect on your site’s rankings: if you improve your site’s crawling ability, your site rankings will also improve

Myth3: Can I can block pages completely by using “Disallow” statements.

The FACT is: No, though the Disallow statement is powerful, you cannot guarantee a 100% invisible page. Just because a page or directory is listed in your robots.txt file doesn’t mean the search engines won’t crawl those pages. That’s an important distinction to remember. robots.txt files block indexation but do nothing to stop crawling. If you want to create an invisible page, you should consider the use of the Meta Robots tag employing “noindex/nofollow.”

Myth4: Will more bots access my site better when a robots.txt is included in my site.

The FACT is: No, some search bots are simply out there to scour your site for e-mail addresses for spamming purposes. Knowing how to block them will aid in the ongoing spam war.

No Comments | Tags: Home, SEO Search Engine Optimization

20 May 2009 - 6:37Email Upgrade May 20, 2009

May 20, 2009  - We are in the process of upgrading all Signature Hosting Email Services. We anticipate that this upgrade will complete between 9-10 a.m EST.  More details about the upgrade are listed below - please note there is NO data loss.

Q1: What is happening to my email?
A1: We will be changing the way our system operates with your email. This process will take place during the normal maintenance window, so you will not be affected outside of this routine process. Your mail will be held during this time and will be delivered as soon as the process is complete.
Q2: Why will this upgrade help me?
A2: It will expedite tasks within your email system. You will spend less time waiting for mail to download, and you will spend less time waiting for searches to complete when you are looking for a particular message.

Q3: Is there anything I have to do to ensure these changes go smoothly?
A3: We recommend you reduce the size of your mailbox(es) if it is over 1gb in size. Otherwise, everything will be handled automatically.

Q4: When is this upgrade going to occur?
A4: On May 20, 2009 between approximately 12 am and 6 am EST. At this time, your email account will automatically be upgraded.
Q5: How will I know that the change has happened?
A5: You will notice that access to your email is much faster and search tasks are much quicker..
Q6: What kind of problems could I experience with this upgrade?
A6: For a brief period of time, you will not have access to your email. We are doing everything we can to minimize this downtime for you. Most upgrades will take place overnight during our normal maintenance windows. If you try to access your mail using a third party mail client using POP, IMAP, or SMTP such as Outlook, you will receive an error message during the upgrade. Additionally, the Mail section of your control panel will be blank, and any FormMail mailings will be held until the conversion is complete.

Q7: Is there anything I can do to fix any problems I find?
A7: If you are not able to access your mail for an extended period of time, please contact technical support.
Q8: Will any of my other mail services be affected?
A8: No, there are no functional changes in Webmail. Features like SpamAssassin and ClamAV will continue to work as before.

No Comments | Tags: Home, Server Software Updates

8 May 2009 - 21:51Server Software Update Notification: 05-10-2009

Important updates in this Notification:

  • Apache 2.x (PCI Compliance) for Linux
  • PHP 5.x (PCI Compliance) for Linux
  • OpenSSL (PCI Compliance) for Linux
  • OpenSSH (PCI Compliance) for Linux
  • mtop for Linux
  • Accrisoft Freedom for Linux
  • Access database for Linux
  • vroot for Linux
  • Enhanced Webmail for Linux
  • WordPress for Linux
  • Dovecot for Linux
  • Accrisoft Freedom for Linux
  • ClamAV for Linux and v3
  • SpamAssassin for v3
  • Sqlite for v3

The following dist will be completed 5/12/2009 in U.S datacenters and 5/13/09 in European datacenters:

http://support.alpineweb.com/hosting/updates/2009/05092009_server_update.html

No Comments | Tags: Server Software Updates

20 April 2009 - 13:28Server Software Update Notification: 04-18-2009

Important updates in this Notification:

  • PHP 5.x (PCI Compliance) for v3
  • mtop for v3
  • Accrisoft Freedom for v3
  • Access database for v3
  • Enhanced Webmail (Beta) for v3 and v2

The following dist will be completed 4/21/2009 in U.S datacenters and 4/22/09 in European datacenters:

http://support.alpineweb.com/hosting/updates/2009/04182009_server_update.html

No Comments | Tags: Server Software Updates

28 March 2009 - 22:25Server Software Update Notification: 03-28-2009

Important updates in this Notification:

  • Dovecot for v3
  • SpamAssassin for v3
  • OpenSSL (PCI Compliance) for v3, v2, and v1
  • Python for v3
  • Lynx for v3
  • Wget for v3
  • Ruby for v3
  • GnuTLS for v3
  • Enhanced Webmail (beta) for v3 and v2
  • ClamAV for v3 and v2

The following dist will be completed 3/31/2009 in U.S datacenters and 4/1/09 European datacenters:

http://support.alpineweb.com/hosting/updates/2009/03282009_server_update.html

No Comments | Tags: Server Software Updates

19 March 2009 - 10:12Server Software Update Notification: 03-19-2009

Important updates in this Notification:

  • Migration for v3 and v2
  • Enhanced Webmail (beta) for v3 and v2
  • PostgreSQL for v3
  • Python for v3

The following dist will be completed 3/17/2009 in U.S datacenters and 3/18/09 in European datacenters:

http://support.alpineweb.com/hosting/updates/2009/03192009_server_update.html

No Comments | Tags: Server Software Updates

13 March 2009 - 21:06Happy Birthday World Wide Web

Hey all you web surfers, happy 20th anniversery of the World Wide Web. Well at least it’s one of it’s birthdays.

On this date twenty years ago, Tim Berners-Lee submitted what is widely considered the original of the World Wide Web Proposal:

http://www.w3.org/History/1989/proposal.html

An interesting story written a few years ago by James Gillies, co-author of “How The Web Was Born” about the several birthdays and early history of the Web can be found here:

http://news.bbc.co.uk/2/hi/technology/7375703.stm

For more trivia here’s an artice about some of the milestones of the WWW at mirror.co.uk:

http://www.mirror.co.uk/news/top-stories/2009/03/12/no-headline-115875-21190824/

Happy surfing.

No Comments | Tags: Home

2 March 2009 - 14:50Server Software Update Notification: 03-02-2009

Important updates in this Notification:

  • Apache 2.x (PCI compliance) for v3
  • Migration for v3 and v2
  • MySQL 5.0.x for v3
  • WordPress for v3 and v2
  • cURL for v3

The following dist will be completed 3/3/2009 in U.S datacenters and 3/4/09 in European datacenters:

http://support.alpineweb.com/hosting/updates/2009/03022009_server_update.html

No Comments | Tags: Server Software Updates

17 February 2009 - 19:50Server Software Update Notification: 02-17-2009

Important updates in this Notification:

  • Apache 2.x (PCI compliance) for v3
  • PHP 5.x (PCI compliance) for Linux and v3
  • Enhanced Webmail (beta) for v3 and v2
  • Dovecot for Signature

The following dist will be completed 2/17/2009 in U.S datacenters and 2/18/09 in European datacenters:

http://support.alpineweb.com/hosting/updates/2009/02172009_server_update.html

No Comments | Tags: Server Software Updates

3 February 2009 - 12:32Verizon to Fairpoint Switchover

As many of you know Verizon customers in the  New Hampshire and Maine area  have been advised that they need to switch over to Fairpoint. This has been confusing for some of our hosting clients. Clients that use their domain name for email and not a verizon/fairpoint email will not be effected by this change.

For folks that do use your ISP’s email I will include below a link to the Fairpoint and their instructions that you might find useful.

http://www.fairpoint.com/northern_ne/transition/transition_faq_email.htm

No Comments | Tags: Home, Hosting