Free Robots.txt Generator & Validator

Generate a robots.txt file with search engine and AI crawler controls, or validate any site's robots.txt for errors and best practices.

AI Crawler Controls

Training Crawlers

AI Search Crawlers

Google ignores this directive. It is respected by Bing and Yandex.

# robots.txt generated by Content Raptor
# https://contentraptor.com/robots-txt-generator

User-agent: *
Allow: /

robots.txt is just the start. Optimize what crawlers find.

Content Raptor analyzes your pages against top-ranking competitors and shows you exactly what to improve for higher rankings.

Try Content Raptor Free

No credit card required

What is robots.txt?

A robots.txt file is a plain text file placed at the root of your website (e.g., example.com/robots.txt) that tells web crawlers which parts of your site they can and cannot access. It is the first file crawlers check before scanning your pages, making it a foundational piece of technical SEO.

It is important to understand that robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in search results if other pages link to it. To prevent a page from appearing in search results entirely, use a noindex meta tag instead.

Every website should have a robots.txt file, even if it allows all crawlers. It signals to search engines that you are intentionally managing crawler access, and it provides a place to declare your sitemap location.

How robots.txt Works

The robots.txt file uses a simple directive format. Each section starts with a User-agent line specifying which crawler the rules apply to, followed by Allow and Disallow directives.

Directive Syntax

  • User-agent: Specifies which crawler the following rules apply to. Use * for all crawlers or a specific bot name like Googlebot.
  • Disallow: Blocks access to a path. Disallow: /admin/ blocks everything under /admin/. An empty Disallow: means nothing is blocked.
  • Allow: Explicitly permits access to a path, useful for overriding a broader Disallow rule. For example, Allow: /admin/public/ within a block that disallows /admin/.
  • Sitemap: Points crawlers to your XML sitemap. This directive is not tied to any User-agent block and can appear anywhere in the file.

Pattern Matching

  • Paths use prefix matching: Disallow: /private blocks /private, /private/, and /private-page.
  • The wildcard * matches any sequence of characters: Disallow: /*.pdf$ blocks all PDF files.
  • The $ anchors the match to the end of the URL.
  • When multiple rules match, the most specific (longest) rule wins.

Managing AI Crawlers with robots.txt

With the rise of AI-powered tools, a new category of web crawlers has emerged. These crawlers fall into two groups, and the distinction matters for your robots.txt strategy:

AI Training Crawlers

These crawlers collect content to train AI language models. Blocking them prevents your content from being used in model training but does not affect your visibility in AI-powered search results.

  • GPTBot (OpenAI)
  • anthropic-ai, ClaudeBot (Anthropic)
  • CCBot (Common Crawl)
  • Bytespider (ByteDance)
  • Meta-ExternalAgent (Meta)
  • Amazonbot (Amazon)

AI Search/Citation Crawlers

These crawlers fetch content to answer user queries in real time. Blocking them means your content will not appear in AI-generated answers, which could reduce your traffic from these platforms.

  • OAI-SearchBot (OpenAI Search)
  • ChatGPT-User (ChatGPT browsing)
  • PerplexityBot (Perplexity)
  • Google-Extended (Gemini/AI Overviews)

Decision Framework

  • Block AI training only: If you want to prevent your content from being used to train models but still appear in AI search results, block the training crawlers and allow the search/citation crawlers.
  • Block all AI crawlers: If you want to opt out of AI entirely, block both categories. Keep in mind this may reduce future traffic from AI-powered search tools.
  • Allow all: If you want maximum visibility across all platforms, including AI search results and potential citations.

Common robots.txt Mistakes

Blocking Googlebot entirely

Using User-agent: Googlebot with Disallow: / removes your entire site from Google search results. This is the single most damaging robots.txt mistake. If you need to block specific directories, be precise with your paths.

Blocking CSS and JavaScript files

Google needs to render your pages to understand their content. Blocking CSS or JS files through robots.txt prevents Google from seeing your page as users see it, which can hurt rankings.

Using robots.txt instead of noindex

If a page is blocked by robots.txt, Google cannot see the noindex tag on that page. The page could still appear in results (with limited information) if other sites link to it. To truly prevent indexing, use a meta robots noindex tag and make sure the page is not blocked by robots.txt.

Conflicting rules

Having both Allow and Disallow for the same path creates ambiguity. While Google resolves this using the most specific (longest) rule, other crawlers may behave differently.

Missing Sitemap directive

While not required, including a Sitemap directive in your robots.txt is a simple way to help search engines discover all your pages. It is especially useful for new sites or sites with complex URL structures.

robots.txt vs. noindex

These two tools serve different purposes and are often confused. Here is when to use each:

robots.txtnoindex meta tag
ControlsCrawling (whether bots visit the page)Indexing (whether the page appears in results)
LocationSite root (/robots.txt)In the page's HTML head
Best forConserving crawl budget, blocking non-public areasPreventing specific pages from appearing in search results
LimitationBlocked pages can still be indexed via external linksBots must crawl the page to see the noindex tag

For the best results, use robots.txt to manage crawler access and noindex tags to control which pages appear in search results. If you want to prevent indexing, make sure the page is not blocked by robots.txt so search engines can see the noindex directive.

How to Deploy Your robots.txt

WordPress

WordPress generates a virtual robots.txt by default. To customize it, use a plugin like Yoast SEO or Rank Math (both have a robots.txt editor), or create a physical robots.txt file in your site's root directory to override the virtual one.

Shopify

Shopify auto-generates robots.txt. You can customize it by creating a robots.txt.liquid template in your theme. Go to Online Store, then Themes, then Edit Code, and add the file under the Templates folder.

Next.js

Place the file in your public/ directory as public/robots.txt. Next.js serves files from the public directory at the site root automatically.

Nginx / Apache / Vercel

For Nginx or Apache, place the file in your document root (the same directory as your index.html). For Vercel, place it in the public/ folder of your project. After deployment, verify it is accessible at yourdomain.com/robots.txt.

robots.txt Best Practices

Do:

  • Always include a Sitemap directive pointing to your XML sitemap
  • Test your robots.txt with the Validator tab before deploying changes
  • Block admin areas, API endpoints, and internal tool pages
  • Review your robots.txt quarterly to ensure it still matches your site structure
  • Use specific paths rather than broad blocking rules
  • Keep the file under 500KB (Google's size limit)
  • Separate AI training crawler controls from search engine crawler rules

Don't:

  • Block Googlebot or Bingbot from your entire site unless you intentionally want to deindex
  • Block CSS, JavaScript, or image files that search engines need to render your pages
  • Use robots.txt as a security measure (anyone can read your robots.txt file)
  • Block pages you want to prevent from being indexed (use noindex instead)
  • Forget to re-check after site migrations or URL structure changes
  • Set excessively high Crawl-delay values that slow down indexing

Manage What Crawlers Find

Your robots.txt controls which pages get crawled. Content Raptor helps you optimize those pages so they rank higher. Analyze your content against competitors and get actionable recommendations.

Optimize Your Content with Content Raptor

Free 7-day trial available.

This tool generates standard robots.txt files and validates them against the Google robots.txt specification. Always test changes in a staging environment before deploying to production. For real-time crawl rate management, use Google Search Console.