Robots.txt and search engine optimization

What is robots.txt ?

The robots.txt file is a text file using a specific format that allows Webmasters to control which areas of a website a crawler is authorized to analyze. This text file will be available at a specific URL for a given site, e.g. http://www.monsite.com/robots.txt

In order to understand what a robots.txt is, you must understand how search engine crawlers work (also called Web spiders, Web crawlers or bots) - Google's crawlers, Yahoo's or Bing's. Here are their actions when analyzing a site like www.monsite.com:

  • they start by downloading and parsing the file http://www.monsite.com/robots.txt.
  • they analyze the rules defined in the file to see what URLs they are allowed to download
  • If robots.txt allows them, they download the website's root, i.e. http://www.monsite.com/
  • they analyze the content of this page and extract primarily the list of internal links.
  • all these internal links are in turn downloaded (if the rules of the robots.txt file does not disallow it), and their internal links are extracted
  • the process of downloading and analyzing links continues until there are no other new links to extract.

It is important to understand that robots.txt is by no means a way to secure a website. A 'well behaved' robot will take into account this file and will not download disallowed URLs. But a 'rude' robot - for example, a competitor who wants to analyze your site, has no technical obligation to take into account the set of rules. Of course, all robots of major search engines (Google, Yahoo, Vista) are well behaved. And Yakaferci as well!

Do I need robots.txt for my website?

It is not mandatory for a website to have a robots.txt file. If this file does not exist, the crawler will analyze all the URLs he can find.

To see if you need a robots.txt file on your site, ask yourself this simple question: are there unsecured areas on your website that you do not want to see in the search results of Google, Yahoo, Bing ...? If the answer is yes, then you need a robots.txt. Otherwise, it is not useful.

How to create a robots.txt file ?

To create a robots.txt file, it is recommended to use a simple word processor like Blocnote, Textedit or Notepad.

A robots.txt file contains a set of rules. A rule is defined by three values:

  • the User Agent: who are the rules for? (all crawlers, only Google, only Bing...)
  • Allow / Disallow: is it a rule that allows or, on contrary, filters certain URLs?
  • the regular expression of the URL: what are the website's URLs the rule is intended for?

To create a robots.txt file you can either create the file manually or you can use a tool to generate it automatically. If you need a robots.txt, Yakaferci recommends you to create it manually.

Unless you have good technical knowledge, you should not create a sophisticated robots.txt file. Here are two reasons for that:

  • the more complicated robots.txt is, the higher the risk to encounter an error. And such an error can have a catastrophic consequence: your public pages will no longer be indexed by Google!
  • if you use complicated regular expressions for the URLs, you must take into account that only certain crawlers (especially Google) know how to interpret them correctly. So you will assume the risk that other crawlers might not know how to interpret them.

Robots.txt file example

Here is an example of robots.txt file:

# block the indexation of images by crawlers
User-agent: *
Disallow: /*.jpg$
Disallow: /*.png$
Disallow: /*.gif$
Disallow: /images/
Allow: /

In this example, the crawlers are discouraged to index the images of the website (the whole images files, everything that ends in .jpg, .png, .gif). Everything else can be indexed.

Google and robots.txt

There aren't any official specifications for the robots.txt file. This format is the result of a discussion between computer specialists, that took place in the '90 and that has never been made official.

In the original rules, the Disallow / Allow lines were read from top to bottom. The first rule that correspondend was taken into account.

However, in practice, many webmasters were writing incorrectly robots.txt, for example like this:

User-agent: *
Allow: /
Disallow: /images/

In theory, in this example, URLs starting with / image / are allowed because the directive "Allow: /" is above "Disallow: / images /". However, it is clear that the webmaster's intention was to block the indexing of / images /

This is precisely why Google has adapted its management of robots.txt to take into account the rule corresponding the most accurately to the URL. So, for example, the URL /images/logo.png is more close to /images/ than to /. So Google is going to take into account the rule "Disallow: /images/".

Yakaferci analysis tool takes into account these particularities.

Analyzing the robots.txt file

Yakaferci provides a tool that detects the URLs blocked through robots.txt


Analyze your robots.txt file with our free SEO Page Analyzer
ANALYZE


To start the Yakaferci analysis, simply enter the URL of the page in the area above and click the Analyse button. You will then have access to different sections of the report, especially the one on robots.txt.

More info

If this article made you want to know more about robots.txt, here are a few useful links: