Crawlability Checker

Check if search engines and AI crawlers can access your page. Analyzes robots.txt, meta robots tags, and HTTP headers.

What is Crawlability?

Crawlability refers to a search engine's ability to access and crawl your website's pages. If a page isn't crawlable, search engines like Google and Bing can't index it, meaning it won't appear in search results. This tool checks multiple factors that affect crawlability.

What This Tool Checks

  • robots.txt Rules: Checks if your robots.txt file allows or blocks specific crawlers from accessing the URL.
  • Meta Robots Tag: Analyzes the meta robots tag for directives like noindex, nofollow, noarchive, etc.
  • X-Robots-Tag Header: Checks HTTP response headers for robots directives (an alternative to meta tags).
  • Canonical Tag: Verifies if a canonical URL is set and whether it points to the current page or elsewhere.
  • HTTP Status: Confirms the page returns a successful response (2xx status code).

Crawlers We Check

Search Engine Crawlers

  • Googlebot - Google's web crawler
  • Bingbot - Microsoft Bing's crawler
  • YandexBot - Russian search engine
  • Baiduspider - Chinese search engine
  • DuckDuckBot - Privacy-focused search
  • Applebot - Apple's Siri & Spotlight

AI Crawlers

  • GPTBot - OpenAI's training crawler
  • ClaudeBot - Anthropic's crawler
  • CCBot - Common Crawl (open dataset)

Note: AI crawlers gather data for training. You can block them without affecting search rankings.

Understanding robots.txt

The robots.txt file is a text file at the root of your website that tells crawlers which pages they can and cannot access. It uses simple directives:

User-agent: *
Disallow: /private/
Allow: /public/

User-agent: GPTBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Important: robots.txt blocks crawling, not indexing. Pages can still be indexed if linked from other sites. Use the noindex meta tag to prevent indexing entirely.

Meta Robots vs X-Robots-Tag

Meta Robots Tag

<meta name="robots" content="noindex, nofollow">

Placed in the HTML head. Works for HTML pages. Most common method.

X-Robots-Tag Header

X-Robots-Tag: noindex, nofollow

HTTP header. Works for any file type (PDFs, images, etc.). Set via server config.

Common Robots Directives

DirectiveEffect
noindexPrevents the page from appearing in search results
nofollowTells crawlers not to follow links on this page
noarchivePrevents cached versions in search results
nosnippetPrevents text snippets in search results
noneEquivalent to noindex, nofollow

Common Crawlability Issues

Troubleshooting Guide

  • Page not indexable: Check for noindex in meta tags or X-Robots-Tag header.
  • Crawler blocked: Review your robots.txt rules. Remember that wildcards (*) apply to all crawlers unless overridden.
  • Canonical points elsewhere: This tells search engines to index a different URL instead. Verify this is intentional.
  • HTTP errors: 4xx and 5xx status codes prevent indexing. Fix server or page errors first.
  • AI crawlers blocked but search engines allowed: This is often intentional to allow search indexing while preventing AI training data collection.

Crawlability Best Practices

  • Always have a robots.txt file, even if it allows all crawlers.
  • Use noindex instead of robots.txt to block indexing (robots.txt only blocks crawling).
  • Set self-referencing canonical tags on all indexable pages.
  • Test robots.txt changes in Google Search Console before deploying.
  • Consider blocking AI crawlers if you don't want your content used for training.
  • Regularly audit crawlability after site changes or migrations.

Optimize Your Content for Search

Crawlability is just the first step. Once search engines can access your pages, Content Raptor helps you optimize your content to rank higher against competitors.

Optimize Your Content with Content Raptor

Free 7-day trial available.