What is Robots.txt?Robots.txt is a plain text file located at the root of a website that communicates rules to web crawlers about which pages they are permitted to access. It follows the Robots Exclusion Protocol, a standard dating to 1994 that defines the syntax for crawl directives. While search engines generally respect robots.txt, it
What is Robots.txt?
Robots.txt is a plain text file located at the root of a website that communicates rules to web crawlers about which pages they are permitted to access. It follows the Robots Exclusion Protocol, a standard dating to 1994 that defines the syntax for crawl directives. While search engines generally respect robots.txt, it is advisory, not a security mechanism. For SaaS companies, robots.txt serves two primary purposes: protecting sensitive areas of the site from being crawled and indexed, and optimizing crawl budget by preventing search engines from wasting crawl allocation on low-value URLs.
Key Robots.txt Directives
The three core directives are: User-agent (specifies which crawler the rule applies to, use asterisk for all crawlers), Disallow (prevents the specified path from being crawled), and Allow (overrides a Disallow rule for a more specific path). The Sitemap directive points crawlers to your XML sitemap. For SaaS sites, common Disallow rules include: /wp-admin/, /wp-login.php, /dashboard/, /account/, and URL parameter strings like sort and filter parameters.
Robots.txt and AI Crawlers
As of 2024-2025, major AI companies use dedicated user-agent strings for their training data crawlers: GPTBot from OpenAI, Google-Extended from Google AI training, CCBot from Common Crawl, and Amazonbot from Amazon. SaaS companies must now decide whether to allow or block AI training crawlers, which is a separate consideration from SEO crawlers. Blocking GPTBot prevents OpenAI from using your content for training but does not prevent ChatGPT from citing your content in responses that use live search.
Frequently Asked Questions
Does robots.txt prevent pages from being indexed?
No. Blocking a URL in robots.txt prevents it from being crawled, but Google can still index the URL if it receives backlinks from external sites. To truly prevent a page from appearing in search results, use a noindex meta tag. Be aware that Google cannot read the noindex tag if crawling is blocked by robots.txt.
Can I use robots.txt to block specific AI crawlers?
Yes. Each AI company provides a specific user-agent string you can target with Disallow rules. However, this only affects training data collection: it does not prevent your content from appearing in ChatGPT responses that use Bing search for live web access.