Topic : SEO - Restrict crawling where it's not needed with Robots.txt file.



How Robots.txt file Work? InfoBrother

What Is Robots.txt file?

Robots.txt is a file that we can include in our website to tell search engines which webpages or section of the website is private and prevent search engines spider from crawling that private part of our website.

Webmasters create robots.txt file and add it in root directory of website. During the crawling and indexing process, the robots.txt file give instructions to search engines crawlers which part they can crawl and index and which one is private.




?

Why we need Robots.txt file?


Sometimes we have private pages on our website that we don't want to indexed. We might have user account information page, login page or admin page and we don't want these pages to be indexed and some random people landing on them while searching for something else. These pages need to exist in our website but we need to protect them. In this case we need robots.txt file to block search engines crawlers to crawl and index these pages.




Q

What happens if we don't have a robots.txt file?


If a robots.txt file is missing from our website, Search engines crawlers assume that all pages available in the particular website are publicly available and can be crawled and added to their index.



Q

What happens if the robots.txt file is not well formatted?


If the robots.txt file is not formatted well and search engines crawlers did not understand the content of the file, they will ignore the robots.txt file and access all the content of website.




Q

What happen if i accidentally block search engines from accessing my website?


That's a big problem for starters who don't know how to write robots.txt file. When we block search engines to crawl and index our webpages using robots.txt file, they will not crawl and index pages from our website and gradually they will remove any pages that are already available in their index.




»

How to create Robots.txt file?


Creating a robots.txt file is easy. We just need a simple text editor like Notepad, Basic knowledge to write robots file and access to our website's file via FTP or control panel. We use following syntax to write robots.txt file.


User-agent: [user-agent-name]
Disallow: [URL string not to be crawled]

The robots file has a very simple structure. There is some predefined keyword we can use to create robots file. The more common and useful keywords are -








1

User-agent:


User-agent are search engines crawlers. There are many search engines on internet, and each search engine have its own crawler. We can mention the name for specific crawler or we can use asterisk (*) sign to for all crawlers. check the example below.


User-agent: * #includes all crawlers. 
User-agent: Googlebot #only for Google crawler.





2

Disallow:


This directive give instructions to a user-agent not to crawl the given part of the website. The value of disallow can be specific file, URL or directory. Look at the example below.


BlockSyntaxExplanation
Entire WebsiteDisallow: /Don't crawl the entire site.
A directoryDisallow: /Directory/Don't crawl the specified directory and its content.
A WebpageDisallow: /private-page.phpDon't crawl the specified page.
Specific image from Google:user-agent: Googlebot-Image
Disallow: /images/userimage.jpg
Only for Google Crawler - Don't crawl specified image.
All Imagesuser-agent: Googlebot-Image
Disallow: /
Don't crawl any image of the entire website.
Specific file typeDisallow: /*.gif$Don't crawl any file of the entire site with extension .gif.




3

Allow:


This directive give instructions to a user-agent to crawl the given part of the website. Actually we don't need it much because if we don't Disallow the items to be crawled, the crawler will crawl the items itself.

This command is useful for one reason, if we disallow any directory to be crawled, this command will help us to give access to crawlers to crawl a specific part of that directory. Look at this example -


User-agent: * #includes all crawlers.
Disallow: /photos #Disallow access to "Photos" directory.
Allow: /photos/seo-tutorials/ #But allow access to seo-tutorials sub-folder which is located in photos directory.




4

Crawl-delay:


If we have a thousand of pages and we won't want to overload our server with continuous requests, we can use crawl-delay command to specific a delay value to force search engines crawlers wait for a specific amount of time before crawling the next page from our website. Enter crawl-delay value in milliseconds. Consider the following example -


User-agent: * #includes all crawlers.
Crawl-delay: 120 #wait 120 msc before crawling each page.

Google does not support the crawl delay command directly, but we can use Google search console to control the crawl rate for Google.


How to set limit crawling for Google crawler? InfoBrother

To limit the crawl rate on Google search console, follow the given steps:
On the search console home page, click the site that you want.
Click the Gear icon, then Click Settings.
In the Crawl rate section, select the option you want and then limit the crawl rate as desired.

The new crawl rate will be valid for 90 days.




5

Sitemap:


This Directive is supported by the major search engines like Google, Yahoo, Ask and Bing. We use this Directive to specify the location of our XML sitemap. Consider the following example -


User-agent: *
Sitemap:: http://infobrother.com/sitemap.xml

If we don't use this directive or don't specify the location of XML sitemap in the robots.txt file, Search engines are still able to find it.



NOTE: Robots.txt is case-sensitive. This mean that "file" and "File" both are two different names. If the original name of file is "file.php" and you write "Disallow: /File.php", this command will not work.



#

Comments in Robots.txt file:


Comments are preceded by a # and can either be placed at the start of a line of after directive on the same line. Everything after the # will be ignored. These comments are meant for humans only.


User-agent: * #Comment in-line - includes all crawlers.

#Comment in new line - Wait 120 msc before crawling each page.
Crawl-delay: 120 



»

Do's when writing robots.txt file:


    Write Robots.txt file to prevent search engines crawlers to crawl the pages which are not useful.
    Placed robots.txt file in the root directory of website.
    Use Sitemap directive to link the xml sitemap to crawlers.
    Robots.txt file is case sensitive. Write the filename or path same as it is.
    Add only one robots.txt file in your website.




»

Don'ts when writing robots.txt file:


    Avoid allowing useless pages to be crawled.
    Avoid URLs created as a result of proxy services to be crawled.
    Avoid using crawl-delay directive where ever its possible.
    Avoid writing private content on robots.txt file because this file is publicly available.
    Avoid writing other commands instead of predefined directives.




»

How to create robots.txt file perfectly?


Creating robots.txt file is very simple now, because we already learn enough about this file. Now let's have an example to learn how we can create robots.txt file perfectly.

Before Getting start, Let's make sure if we have created any robots.txt file before. Open your browser and navigate to https://www.yourdomain.com/robots.txt (Enter your domain name instead of "yourdomain.com").


Check if robots.txt file already exist? InfoBrother

If browser open any file, its mean we have already one, so we can open that file and edit it instead of creating a new one.


NOTE: Robots.txt file is always located in the root directory - (www or public_html) - depending on our server.




»

How to edit robots.txt file?


If We have already robots.txt file in our website, we can easily edit it. We can use our FTP client to connect to our website's root directory or we can use our web hoisting control panel.

Download the file to your computer, open it with text editor, make the necessary changes and upload the file back to your server - Replace the file.



»

How to create new robots.txt file?


If we don't have robots.txt file already, then we need to create new one. Open text editor, add your directives and save it using "robots.txt" file name and upload the file to the root directory of your website.


Important: Robots.txt file is case-sensitive so make sure the file name is "robots.txt" and all should be in lowercase.




»

Robots.txt file examples:


Typically, our robots.txt file should have the following contents.


User-agent: *
Allow: / 
Sitemap: https://yourdomain.com/sitemap.xml 

The above code allows all bots to access our website without blocking any content. It also specifies the sitemap location to make it easier for search engines to locate it.




»

How to validate our robots.txt file?


We can use Robots.txt Tester tool to know if our robots.txt file block web crawlers from specific URLs on our site. We can use this tool to verify if our robots.txt file is able to block web crawlers to crawl the specific URLs on our website.



robots.txt tester tool - InfoBrother
  • Open Tester tool for your site.
    Check if there is any syntax warning or logic errors in your robots.txt file.
    If there is any error in your robots.txt file, make correction.
    Type the URL of a page that need to be test.
    Select the user-agent you want to simulate in the dropdown list.
    Click the TEST button to test access if it is "Allowed" or "Blocked".
    Click here to see your robots.txt file from your website.





NOTE: Changes we make in the tool editor are not automatically save to our web server. We need to copy and paste the content from the editor into the robots.txt file stored on our server.







I Tried my Best to Provide you complete Information regarding this topic in very easy and conceptual way. but still if you have any Problem to understand this topic, or do you have any Questions, Feel Free to Ask Question. i'll do my best to Provide you what you need.

Sardar Omar.
InfoBrother




Advertising

Advertisement