The Robots.txt file was invented in 1994, at a time when the passage of Google robots could bring down a website due to server overload. At the time, it was necessary to limit the number of robots passing over the site for reasons of server capacity.
Today, with server capacities having greatly increased, the usefulness of this file has become quite different. Google continues to crawl your entire site by default, with the aim of indexing, or not, the pages that its robots consider to be useful to Internet users when they search on Google.co.uk.
Why set up a robots.txt file?
Why prevent Google’s robots from crawling certain pages of a website? Quite simply to optimise the crawling time of the robots! This file is used, for example, by websites with a large number of URLs, to prevent robots from spending crawling time on pages whose indexing is not a priority. Under no circumstances is a robots.txt used to prevent a web page from being displayed on the SERPs, it is simply used to manage the traffic of robots on your website in order to indicate their priorities.
When robot crawlers arrive on a site, they start by downloading the robots.txt file: this way, they first analyse the rules associated with the site. Once they have read these rules and instructions, they start exploring the website.
So how do you determine which pages should not be crawled first? It can be very useful to prevent robots from crawling pages containing duplicate content or the internal search engine displayed on your website. This could also include confidential content or internal resources such as specifications or a white paper.
The robots.txt file can prevent crawling of three types of content:
- A web page.
- A resource file.
- A multimedia file.
The robots.txt file applied to a web page will allow you to manage robot traffic on your website. This will help you avoid being overwhelmed by robots or prioritise pages that are more worthy of indexing. When these pages contain a robots.txt, they can still appear on the SERPs, but they will not contain any description.
How do I create a robots.txt file?
The robots.txt files are located at the root of a website, in this form: www.exemple.com/robots.txt. A robots.txt file can contain several rules: disallow, allow and sitemap. The “user agent” directive is mandatory, as it specifies which robot the file is intended for. The asterisk allows all crawlers to be targeted.
It can be a targeted robots.txt, containing the URL of a page or simply containing an instruction. Each rule has its own syntax, and if this syntax is not respected down to the last character, the file will not work.
Here’s an example of a file provided by Google. The first step is to target the “user-agent” robot responsible for crawling the site, in this case the “Google bot”.
User-agent : Googlebot
Disallow : /nogooglebot/
User-agent : *
Allow : /
Sitemap : http://www.exemple.com/sitemap/xml
Please note that the robots.txt file must always be written in lower case and must be located at the root of your website. A website can only contain one robots.txt file.
Where should the Robots.txt file be placed?
A Robots.txt file must be located at the root of a website. It can contain comments with the # command and must be called “robots.txt”. It currently responds to 4 commands: Disallow, to block a page or a group of pages, Allow, to authorise a particular page (by default, Google authorises all pages), Sitemap, to declare your sitemap and User-Agent, to define the type of robots concerned by requests.