The robots.txt file is a tool you can use to control how the search engines view your site. It basically tells search engines how to behave when crawling your content. And they can be extremely valuable for SEO and site management in general.
Some of the things I will talk about in this article include:
- What is a robots.txt file?
- Do I need a robots.txt file?
- How to create a robots.txt file
- Some examples of what to include in a robots.txt file
- Using robots.txt is no guarantee
- robots.txt and WordPress
- Robots.txt File Generators
A Brief History of Crawlers and Web Grapes
Humans have short, selective memories. For example, we take Google for granted. Many often think of it as having had an intelligent directory of (almost) everything on the web ever available.
But the early days of the web were dark and confusing times, brothers and sisters. There was no intelligent way to find anything.
Oh, we had search engines. WebCrawler was the first one most heard of, and quickly joined Lycos. They indexed everything they could find on the web, and it worked. In fact, they worked a little too well.
When you are looking for something specific but you must search through it everything in the world, the search results can be… less than useful. If you ever used WebCrawler, Lycos, or any of the search engines before Google (hello AltaVista!), You remember the pages and results pages that weren’t about what you were ‘ n look for it.
Indexing everything was problematic
The problem with indexing everything is that it can – and often did – yield worthless search results. Searching for “The Grapes of Wrath” was likely to return dozens of pages of results relating to grapes (the fruit) and the Star Trek wrath of Khan a movie, but nothing about John Steinbeck.
To make matters worse, spammers very early identified and exploited the lack of search engine sophistication. This would often result in pages being loaded with words and phrases that had nothing to do with the bad products or Ponzi designs they were trying to inadvertently try web surfers.
The technical barriers associated with making search results “smarter” were still years of overcoming. So instead, we got things like Yahoo !, which was not a search engine at all, but rather a curated list of websites. Yahoo! without finding websites, website owners told Yahoo! where to find them.
If that sounds terribly unscientific and not very inclusive, that’s because it is. But it was the best solution to the chaos and chaos of search engine results anyone could think of. Yahoo! it became a de facto starting point for most people using the web simply because there was nothing better.
Progress of the Machines
The “robots” we are talking about are actually computer programs, not scary machines. The programs that index the web are known to many other names as well, including spiders, bots and reptiles. The names all refer to the same technology.
A couple from Stanford Ph.D. students named Larry and Sergey would eventually find out how to make search results more relevant. However, there were dozens of other search engines scouring the web in the meantime. Robots were continually scouring the web, indexing what they found. But robots are not intelligent life forms, they are machines, so they created some problems.
Mostly, they indexed many things that website owners did not want to be indexed. This included private, sensitive or proprietary information, admin pages, and other things that do not necessarily belong to a public directory.
Also, as the number of robots increased, their negative impact sometimes increased on web server resources. The administrators of those days were not as robust and powerful as they are at present. The masses of spiders and bots all uploading pages from a site in agony could slow down the site’s response time.
Web people needed a way to control the robots, and they found their tool in the humble but powerful robots.txt file.
What Is Robots.txt File?
The robots.txt file is a text-only format that contains instructions that the web crawlers and robots to be to follow.
I say “to be” because there is nothing that requires a crawler or bot to follow the instructions in the robots.txt file. The main players follow most (but not all) of the rules, but some bots out there will completely ignore the directives in your robots.txt file.
The robots.txt file resides in the root directory of your website (e.g., http://ggexample.com/robots.txt).
If you are using subdirectories, such as blog.ggexample.com, or forum.ggexample.com, all subdirectories should also include a robots.txt file.
Crusaders match a simple match against what you have in the robots.txt file and URLs on your site. If a directive in your robots.txt file matches a URL on your site, the crawler will obey the rule you have set.
Do I Need a robots.txt File?
When no robots.txt file is present, search engine crawlers assume they can crawl and index any page they find on your site. If that’s what you want them to do, you don’t need to create a robots.txt file.
But if there are pages or directories that you don’t want to be indexed, you need to create a robots.txt file. Those types of pages include what we talked about earlier. These are the private, sensitive, proprietary and administrative pages. However, it can also include things like “thank you” pages, or pages that contain duplicate content.
For example printer-friendly versions or A / B testing pages.
How to Create a Robots.txt File
A robots.txt file is created in the same way as any text file is created. Open your favorite text editor and save a document like robots.txt. You can then upload the file to your website’s root directory using FTP or cPanel file manager.
Things to note:
- The file name must be robots.txt – all lower case. If any part of the name is capitalized, reptiles won’t read it.
- The entries in your robots.text file are also case sensitive. For example, / Directory / is not the same as / directory /.
- Use a text editor to create or edit the file. Word Processors can add characters or formatting that prevent crawlers from reading the file.
- Depending on how your site was created, there may already be a robots.txt file in the root directory. Check before creating and uploading new robots.txt so that you do not inadvertently overwrite any existing directives.
Some Examples of What to Include
The robots.txt file has several variables and wildcards, so there are many possible combinations. We’ll go over some common and useful entries and show you how to add them.
Before we do that, let’s start with an overview of the available directives: “User-agent,” “Disallow,” “Allow,” “Crawl-delay,” and “Site Map.” Most of your robots.txt entries will use “User-agent” and “Disallow.”
The User-agent function targets a specific web crawler that we want to give instructions to. This will usually be Googlebot, Bingbot, Slurp (Yahoo), DuckDuckBot, Baiduspider (Chinese search engine), and YandexBot (search engine in Russian). There is a long list of consumer agents you can include.
Using Disallow is probably one of the most common attributes. This is the main command we use, to tell a user agent not to crawl a URL.
Allow is another common component of the robots.txt file. And only the Googlebot uses it. It tells Googlebot that it is OK to access pages or sub folders even though the parent page or subfolder is rejected.
The Crawl delay function specifies how many seconds a crawler should wait between pages. Many reptiles ignore this directive – Googlebot in particular – but the crawl rate for Googlebot can be set in Google’s Search Console.
Perhaps one of the more imporant aspects of the robots.txt file is “Site Map.” This is used to identify the location of XML sitemaps for your website, which greatly improves how content is indexed in search engines.
If you want to be found in websites like Google, Bing or Yahoo, it’s almost a requirement to have a sitemap.
So the robots.txt file starts with:
The asterisk (*) is a wild card meaning “everything.” Whatever comes next will apply to all reptiles.
User-agent: * Disallow: /private/
Now we added “Disallow” for the / private / directory. So robots.txt tells all reptiles not to crawl / private / on the domain.
If we wanted to reject only a particular crawler, we would use the name of the crawler in the User-agent line:
User-agent: Bingbot Disallow: /private/
That tells Bing not to crawl anything in the / private / directory.
A slash in the Disallow line would tell Bing (or any User agent you list) that he will not be allowed to crawl anything on the domain:
User-agent: Bingbot Disallow: /
You can also tell the crawlers not to crawl a specific file.
User-agent: * Disallow: /private.html
$ Is another wild card, which indicates the end of a URL. So in the following example, any URL ending with .pdf would be blocked.
User-agent: * Disallow: /*.pdf$
That would keep all reptiles from crawling all PDFs. For example, https://ggexample.com/whitepapers/july.pdf
Multiple Directives in File robots.txt
So far we’ve made simple two-line robots.txt files, but you can have as many entries in the file as you like.
For example, if we wanted to allow Google to crawl everything but not let Bing, Baidu, or Yandex, we would use:
User-agent: Googlebot Disallow: User-agent: Bingbot Disallow: / User-agent: Baiduspider Disallow: / User-agent: YandexBot Disallow: /
Note that we have used a new agent-user line for each directive. The User-agent line can only list a single crawler.
But – one user-agent may have several Allow directives:
User-agent: Baiduspider Disallow: /planes/ Disallow: /trains/ Disallow: /automobiles/
Each Disallow URL must be on its own line.
Using robots.txt is not a Guarantee
Adding a Disallow directive to robots.txt does not guarantee that the search engine will not index the file or URL. While the “good” search engine crawlers will respect your robots.txt settings, some will not.
Just because they’re not crawling on your domain doesn’t mean it shouldn’t be indexed.
That’s because reptiles follow links. So if you reject /whitepapers/july.pdf, the reptiles won’t crawl it. But if someone else links to /whitepapers/july.pdf from their website, the file could be found and indexed by the crawlers.
robots.txt and WordPress
WordPress creates a “virtual” robots.txt file by default. It’s a simple directive that prevents crawlers from trying to crawl your admin panel.
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
The file /wp-admin/admin-ajax.php is allowed because some WordPress themes use AJAX to add content to pages or posts.
If you want to customize the robots.txt WordPress file, create robots.txt as outlined above and upload it to the root of your site.
Note that your uploaded robots.txt will prevent the default WordPress.txt virtual being generated. A site can only have one robots.txt file. So if you need the Allow AJAX directive for your theme, you should add the above lines to your robots.txt.
Some WordPress SEO plugins will generate a robots.txt file for you.
Robots.txt File Generators
I’m going to list some robots.txt file generators here, but in reality, most of them refuse. Now that you know how to do it yourself, their usefulness is in doubt. But if you dig playing with code generators – and hey, who doesn’t? – here you are.
Robots.txt is Useful
Although not all search engine crawlers respect the robots.txt file, it is still extremely useful for SEO and website hosting. From ignoring directories and specific pages to browser storage settings, much can be made of this simple file.