Robots.txt is an important file for the proper operation of the site. This is where search engine crawlers find information about the pages of the web resource that should be scanned in the first place and which one should not be paid attention to at all. The robots.txt file is used when necessary to hide some parts of the site or the entire website from search engines. For example, a location with user personal information or a mirror of the site.
What should I do if the system auditor does not see this file? Read about this and other issues related to the robots.txt file in our article.
How does robots.txt work?
A robots.txt is a txt document with UTF-8 encoding. This file works for http, https, and FTP protocols. The encoding type is very important: if the robots.txt file is encoded in a different format, the search engine will not be able to read the document and determine which pages should be recognized or not. Other requirements for the robots.txt file are as follows:
- all settings in the file are relevant only for the site where the robots.txt is located;
- the file location is the root directory; the URL should look like this: https://site.com.ua/robots.txt;
- the file size should not exceed 500 Kb.
When scannng the robots.txt file, search crawlers are granted permission to crawl all or some web pages; they can also be prohibited from doing so.
You can read more about this here: https://developers.google.com/search/docs/advanced/robots/intro.
Search engine response codes
A web crawler scans the robots.txt file and gets the following responses:
- 5XX – markup of a temporary server error, at which, the scanning stops;
- 4XX – permission to scan each page of the site;
- 3XX – redirect until the crawler gets another answer. After 5 attempts, a 404 error is fixed;
- 2XX – successful scanning; all pages that need to be read are recognized.
If when navigating to https://site.com.ua/robots.txt, the search engine does not find or see the file, the response will be “robots.txt not Found”.
Reasons for the “robots.txt not Found” response
Causes of the “robots.txt not Found” search crawler response may be the following:
- the text file is located at a different URL;
- the robots.txt file is not found on the website.
More information on this video:
Please note! The robots.txt file is located in the main domain directory as well as in subdomains. If you have included subdomains in the site audit, the file must be available; otherwise, the crawler will report an error stating that robots.txt is not found.
Why is this important?
Failure to fix the “robots.txt not Found” error will result in incorrect work of search crawlers due to incorrect commands from the file. This, in turn, may lead to a drop in site ranking, incorrect data on site traffic. Also, if search engines do not see robots.txt, all pages of your site will be crawled, which is undesirable. As a result, you can miss the following problems:
- server overload;
- purposeless crawling of pages with the same content by search engines;
- longer time to process visitor requests.
The smooth operation of the robots.txt file is crucial for the smooth operation of your web resource. Therefore, let’s examine how to fix errors in the work of this test document.
How should robots.txt be corrected?
For search crawlers to respond properly to your robots.txt file, it must be properly debugged. Check the security text document for the following errors:
- Directive values are confused. Disallow or allow should be at the end of the phrase.
- Several page URLs in the same directive.
- Typos in the robots.txt file name or uppercase letters used in the file.
- User-agent is not specified.
- Absence of the directive in the phrase: disallow or allow.
- Inaccurate URL: use $ and / symbols to specify the gap.
You can check the robots.txt file using search engine validation tools. For example, use the Google robots.txt tester tool.
Detect whether Robots.txt is not Found and go ahead to analyse the other issues on it!
Check not only the issue but make a full audit to find out and fix your technical SEO.