What is robots.txt and why does it matter?
Robots.txt is a small plain-text file placed at the root of your domain (always at /robots.txt) that tells search engine crawlers and other bots which parts of your site they should and should not access. It uses a simple, decades-old protocol called the Robots Exclusion Standard, and it is the first file most well-behaved crawlers fetch when they visit your site.
Unlike noindex meta tags (which control whether a page appears in search results) or canonical tags (which manage duplicate content), robots.txt controls crawling behavior at the URL level. Disallowing a path prevents crawlers from fetching it at all. This is useful for protecting your crawl budget on large sites, blocking aggressive scrapers, and keeping admin or staging URLs out of indexing pipelines.
Anatomy of a robots.txt file
A robots.txt file consists of one or more rule blocks. Each block targets specific crawlers and lists allow/disallow rules:
User-agent: *
Disallow: /admin/
Disallow: /cart/
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Key directives:
- User-agent — specifies which crawler the rules apply to.
*means all crawlers;Googlebottargets only Google's main bot. - Disallow — paths that should not be crawled.
/admin/blocks the admin folder;/blocks everything; empty value disallows nothing. - Allow — explicit permission, used to override broader Disallow rules.
- Sitemap — full URL to your XML sitemap. You can list multiple Sitemap entries.
- Crawl-delay — how many seconds to wait between requests. Most modern bots ignore this.
Should you block AI bots like GPTBot and ClaudeBot?
Since 2023, a wave of AI-focused crawlers have appeared. The major ones include:
- GPTBot (OpenAI / ChatGPT)
- ClaudeBot (Anthropic / Claude)
- CCBot (Common Crawl, used to train many models)
- Google-Extended (controls whether Google trains Gemini and other AI products on your content)
- Anthropic-AI and anthropic-ai (older Anthropic bot)
- PerplexityBot (Perplexity)
- Bytespider (ByteDance / TikTok)
- FacebookBot and Meta-ExternalAgent (Meta AI training)
Whether to block these is a strategic question, not a technical one. Reasons to block:
- You believe AI training without consent is unfair use of your content
- You publish original journalism or research and want to control how it is used
- You sell content (courses, books, premium articles) and AI summarization undermines your business
Reasons to allow:
- You want your content to appear in AI-powered search results (ChatGPT search, Perplexity, etc.)
- You believe the future of search is AI-mediated and visibility there matters
- Your content is informational and you want maximum reach
There is no right answer — it is a publisher choice. Our generator gives you a one-click toggle to block all major AI bots if you choose to.
Common robots.txt patterns
Here are robots.txt patterns we use on different types of sites:
Standard blog or content site
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search/
Sitemap: https://yoursite.com/sitemap.xml
E-commerce store
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /*?orderby=
Disallow: /*?filter_
Disallow: /*?add-to-cart=
Sitemap: https://store.com/sitemap.xml
Maximum lockdown (block all AI crawlers)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Where to put robots.txt
The file must be placed at the absolute root of your domain. For example:
https://yoursite.com/robots.txt— correcthttps://yoursite.com/seo/robots.txt— wrong, ignored by crawlershttps://www.yoursite.com/robots.txt— correct (different domain — needs its own file)
If you have https://yoursite.com and https://www.yoursite.com serving different content (or even just redirecting to each other), each one needs its own robots.txt accessible at its root.
Testing your robots.txt
After uploading, verify with these tools:
- Browser test — Visit
https://yourdomain.com/robots.txt. The file should load as plain text. - Google Search Console — Use the URL Inspection tool to verify Googlebot can fetch your robots.txt and understands the rules.
- Bing Webmaster Tools — Bing has its own robots.txt tester under SEO Reports.
Common robots.txt mistakes
- Blocking CSS or JS — Modern Google crawlers render pages like browsers. Blocking
/css/or/js/can break rendering and hurt rankings. Allow these. - Using Disallow to remove already-indexed pages — Disallow only prevents future crawling. To actually remove indexed pages, use noindex meta tags or the GSC removal tool.
- Trailing slash mismatches —
Disallow: /adminblocks/adminAND/admin/anything.Disallow: /admin/only blocks paths starting with /admin/. Be precise. - Accidentally blocking everything —
Disallow: /on a production site is catastrophic. Always double-check before uploading. - Forgetting the Sitemap line — adding
Sitemap: https://yoursite.com/sitemap.xmltells crawlers where to find your sitemap automatically.
Generate yours above
Use the form above to build a robots.txt file with your specific rules. The generator includes one-click options for blocking AI bots, common admin paths, search/cart URL patterns, and the sitemap reference. Copy the generated text into a file named exactly robots.txt and upload it to your site root.