Crawlability Checker
Check if search engines and AI crawlers can access your page. Analyzes robots.txt, meta robots tags, and HTTP headers.
What is Crawlability?
Crawlability refers to a search engine's ability to access and crawl your website's pages. If a page isn't crawlable, search engines like Google and Bing can't index it, meaning it won't appear in search results. This tool checks multiple factors that affect crawlability.
What This Tool Checks
- robots.txt Rules: Checks if your robots.txt file allows or blocks specific crawlers from accessing the URL.
- Meta Robots Tag: Analyzes the meta robots tag for directives like noindex, nofollow, noarchive, etc.
- X-Robots-Tag Header: Checks HTTP response headers for robots directives (an alternative to meta tags).
- Canonical Tag: Verifies if a canonical URL is set and whether it points to the current page or elsewhere.
- HTTP Status: Confirms the page returns a successful response (2xx status code).
Crawlers We Check
Search Engine Crawlers
- Googlebot - Google's web crawler
- Bingbot - Microsoft Bing's crawler
- YandexBot - Russian search engine
- Baiduspider - Chinese search engine
- DuckDuckBot - Privacy-focused search
- Applebot - Apple's Siri & Spotlight
AI Crawlers
- GPTBot - OpenAI's training crawler
- ClaudeBot - Anthropic's crawler
- CCBot - Common Crawl (open dataset)
Note: AI crawlers gather data for training. You can block them without affecting search rankings.
Understanding robots.txt
The robots.txt file is a text file at the root of your website that tells crawlers which pages they can and cannot access. It uses simple directives:
User-agent: * Disallow: /private/ Allow: /public/ User-agent: GPTBot Disallow: / Sitemap: https://example.com/sitemap.xml
Important: robots.txt blocks crawling, not indexing. Pages can still be indexed if linked from other sites. Use the noindex meta tag to prevent indexing entirely.
Meta Robots vs X-Robots-Tag
Meta Robots Tag
<meta name="robots" content="noindex, nofollow">Placed in the HTML head. Works for HTML pages. Most common method.
X-Robots-Tag Header
X-Robots-Tag: noindex, nofollowHTTP header. Works for any file type (PDFs, images, etc.). Set via server config.
Common Robots Directives
| Directive | Effect |
|---|---|
| noindex | Prevents the page from appearing in search results |
| nofollow | Tells crawlers not to follow links on this page |
| noarchive | Prevents cached versions in search results |
| nosnippet | Prevents text snippets in search results |
| none | Equivalent to noindex, nofollow |
Common Crawlability Issues
Troubleshooting Guide
- Page not indexable: Check for noindex in meta tags or X-Robots-Tag header.
- Crawler blocked: Review your robots.txt rules. Remember that wildcards (*) apply to all crawlers unless overridden.
- Canonical points elsewhere: This tells search engines to index a different URL instead. Verify this is intentional.
- HTTP errors: 4xx and 5xx status codes prevent indexing. Fix server or page errors first.
- AI crawlers blocked but search engines allowed: This is often intentional to allow search indexing while preventing AI training data collection.
Crawlability Best Practices
- Always have a robots.txt file, even if it allows all crawlers.
- Use noindex instead of robots.txt to block indexing (robots.txt only blocks crawling).
- Set self-referencing canonical tags on all indexable pages.
- Test robots.txt changes in Google Search Console before deploying.
- Consider blocking AI crawlers if you don't want your content used for training.
- Regularly audit crawlability after site changes or migrations.
Optimize Your Content for Search
Crawlability is just the first step. Once search engines can access your pages, Content Raptor helps you optimize your content to rank higher against competitors.
Optimize Your Content with Content RaptorFree 7-day trial available.