What is robots.txt?
A robots.txt file is a plain text file at the root of your domain that instructs web crawlers which pages or sections of your site they should and shouldn't access. It's the first thing Googlebot checks before crawling your site. Used correctly, it preserves crawl budget by blocking low-value pages. Used incorrectly, it can accidentally de-index your entire site.
- robots.txt controls crawling, not indexing — Google may still index a blocked URL if links point to it.
- To prevent indexing, use a noindex meta tag — not robots.txt.
- Always test your robots.txt rules in Google Search Console before deploying.
- A missing or wrong robots.txt can tank your entire site overnight.
- Block: tag archives, filter pages, session IDs, admin URLs, and internal search results.
How robots.txt Works
robots.txt uses the Robots Exclusion Protocol. You specify user agents (crawlers) and the paths they're allowed or disallowed from accessing.
The most important rule: `Disallow: /` blocks all crawlers from the entire site. This is fine for staging environments — catastrophic in production. Always check your robots.txt after any deployment.
Googlebot respects robots.txt by convention. It's not a security mechanism — a malicious bot will ignore it entirely.
What to Block in robots.txt
Good candidates for blocking: /wp-admin/, tag and category archive pages generating duplicate content, pagination beyond page 2-3, URL parameters generating duplicate content (?sort=, ?filter=, ?session=), internal search results (/search?q=), and print-friendly page versions.
Don't block: CSS and JavaScript files (Google needs these to render and understand pages), important landing pages, or any page you actually want indexed.
Most guides are already outdated.
One email a week. The search stuff that actually matters — what shifted, what died, and what to do about it.
Subscribe free →Disallow: / in your production robots.txt blocks every crawler from your entire site. Pages already indexed will eventually drop out of search results as Googlebot can no longer recrawl them. This single line has caused major traffic losses for large sites — always audit robots.txt immediately after any CMS migration, staging-to-production push, or site rebuild.
| robots.txt Disallow | Noindex Meta Tag |
|---|---|
| Prevents crawling of the page entirely | Allows crawling but removes page from search index |
| Page may still appear in search results if linked externally | Guarantees removal from search index once recrawled |
| Googlebot cannot see page content or follow its links | Googlebot reads the page and processes its links |
| Best for: admin pages, duplicate parameter URLs, staging sites | Best for: thin content, thank-you pages, internal utility pages |
| Takes effect immediately on next crawl attempt | Takes effect after Googlebot recrawls and processes the tag |
| Does not require server rendering to be detected | Requires the page to be crawlable to detect the tag |
How does your site score on SEO?
Paste your URL. Get a score and a fix list across all three disciplines. No form, no email.
Run Free Audit →Frequently Asked Questions
No. robots.txt prevents crawling, not indexing. If a page has external links, Google may index the URL without ever crawling it. To remove a page from Google's index, add a noindex meta tag in the HTML head, or use Google Search Console's URL Removal tool for urgent removals.
It must be at the root domain level: https://yourdomain.com/robots.txt. It cannot be in a subdirectory. Every domain and subdomain needs its own robots.txt.
- 1.Google Search Central — robots.txt documentation
- 2.Robots Exclusion Protocol specification
Read next
XML Sitemap
An XML sitemap is a file that lists all the URLs on your website that you want search engines to crawl and ind…
Canonical Tags
A canonical tag (rel="canonical") is an HTML element that tells search engines which version of a URL is the '…
Crawl Budget
Crawl budget is the number of pages Googlebot will crawl on your website within a given timeframe. It's determ…