Introduction & Use of Robots.txt

Naushil Jain
2 min readDec 8, 2020
Introduction & Use of Robots.txt
Photo by Hello I'm Nik 🎞 on Unsplash

A Robots.txt is a text file, Which instructs Crawlers (typically search engine robots) how to crawl pages on their website. This is used mainly to avoid overloading your site with requests. It is not a mechanism for keeping a web page out of all crawlers. To keep a web page out of the crawler, you should use noindex directives, or password-protect your page. The robots.txt file is part of the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users.

Image Credits

Location

A robots.txt file lives at the root of your website. So, for the site www.example.com, the robots.txt file lives at www.example.com/robots.txt. Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt, robots.TXT, or otherwise). It includes one or more rules. Each rule blocks (or allows) access for a given crawler to a specified file path in that website.

Basic Formats of Robots.txt

User-agent: [user-agent name]Disallow: [URL string not to be crawled]

Blocking all web crawlers from all content
User-agent: * Disallow: /

Allowing all web crawlers access to all content
User-agent: * Disallow:

Blocking a specific web crawler from a specific folder
User-agent: Googlebot Disallow: /example-subfolder/

Blocking a specific web crawler from a specific web page
User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html

Test your robots.txt with the robots.txt Tester

The robots.txt Tester tool shows you whether your robots.txt file blocks Google web crawlers from specific URLs on your site. For example, you can use this tool to test whether the Googlebot-Image crawler can crawl the URL of an image you wish to block from Google Image Search.

--

--