Crawlability Checker

Check if search engines and AI crawlers can access your page. Analyzes robots.txt, meta robots tags, and HTTP headers.

What is Crawlability?

Crawlability refers to a search engine's ability to access and crawl your website's pages. If a page isn't crawlable, search engines like Google and Bing can't index it, meaning it won't appear in search results. This tool checks multiple factors that affect crawlability.

What This Tool Checks

robots.txt Rules: Checks if your robots.txt file allows or blocks specific crawlers from accessing the URL.
Meta Robots Tag: Analyzes the meta robots tag for directives like noindex, nofollow, noarchive, etc.
X-Robots-Tag Header: Checks HTTP response headers for robots directives (an alternative to meta tags).
Canonical Tag: Verifies if a canonical URL is set and whether it points to the current page or elsewhere.
HTTP Status: Confirms the page returns a successful response (2xx status code).

Crawlers We Check

Search Engine Crawlers

Googlebot - Google's web crawler
Bingbot - Microsoft Bing's crawler
YandexBot - Russian search engine
Baiduspider - Chinese search engine
DuckDuckBot - Privacy-focused search
Applebot - Apple's Siri & Spotlight

AI Crawlers

GPTBot - OpenAI's training crawler
ClaudeBot - Anthropic's crawler
CCBot - Common Crawl (open dataset)

Note: AI crawlers gather data for training. You can block them without affecting search rankings.

Understanding robots.txt

The robots.txt file is a text file at the root of your website that tells crawlers which pages they can and cannot access. It uses simple directives:

User-agent: *
Disallow: /private/
Allow: /public/

User-agent: GPTBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Important: robots.txt blocks crawling, not indexing. Pages can still be indexed if linked from other sites. Use the noindex meta tag to prevent indexing entirely.

Meta Robots vs X-Robots-Tag

Meta Robots Tag

<meta name="robots" content="noindex, nofollow">

Placed in the HTML head. Works for HTML pages. Most common method.

X-Robots-Tag Header

X-Robots-Tag: noindex, nofollow

HTTP header. Works for any file type (PDFs, images, etc.). Set via server config.

Common Robots Directives

Directive	Effect
noindex	Prevents the page from appearing in search results
nofollow	Tells crawlers not to follow links on this page
noarchive	Prevents cached versions in search results
nosnippet	Prevents text snippets in search results
none	Equivalent to noindex, nofollow

Common Crawlability Issues

Troubleshooting Guide

Page not indexable: Check for noindex in meta tags or X-Robots-Tag header.
Crawler blocked: Review your robots.txt rules. Remember that wildcards (*) apply to all crawlers unless overridden.
Canonical points elsewhere: This tells search engines to index a different URL instead. Verify this is intentional.
HTTP errors: 4xx and 5xx status codes prevent indexing. Fix server or page errors first.
AI crawlers blocked but search engines allowed: This is often intentional to allow search indexing while preventing AI training data collection.

Crawlability Best Practices

Always have a robots.txt file, even if it allows all crawlers.
Use noindex instead of robots.txt to block indexing (robots.txt only blocks crawling).
Set self-referencing canonical tags on all indexable pages.
Test robots.txt changes in Google Search Console before deploying.
Consider blocking AI crawlers if you don't want your content used for training.
Regularly audit crawlability after site changes or migrations.

Optimize Your Content for Search

Crawlability is just the first step. Once search engines can access your pages, Content Raptor helps you optimize your content to rank higher against competitors.

Optimize Your Content with Content Raptor

Free 7-day trial available.