SEO · Search Engine Optimisationbeginner3 min read

What is robots.txt?

A robots.txt file is a plain text file at the root of your domain that instructs web crawlers which pages or sections of your site they should and shouldn't access. It's the first thing Googlebot checks before crawling your site. Used correctly, it preserves crawl budget by blocking low-value pages. Used incorrectly, it can accidentally de-index your entire site.

Fact-checked against 2 sourcesLast updated 8 June 2026

Key Takeaways

robots.txt controls crawling, not indexing — Google may still index a blocked URL if links point to it.
To prevent indexing, use a noindex meta tag — not robots.txt.
Always test your robots.txt rules in Google Search Console before deploying.
A missing or wrong robots.txt can tank your entire site overnight.
Block: tag archives, filter pages, session IDs, admin URLs, and internal search results.

In this article

01How robots.txt Works
02What to Block in robots.txt

How robots.txt Works

robots.txt uses the Robots Exclusion Protocol. You specify user agents (crawlers) and the paths they're allowed or disallowed from accessing.

The most important rule: `Disallow: /` blocks all crawlers from the entire site. This is fine for staging environments — catastrophic in production. Always check your robots.txt after any deployment.

Googlebot respects robots.txt by convention. It's not a security mechanism — a malicious bot will ignore it entirely.

What to Block in robots.txt

Good candidates for blocking: /wp-admin/, tag and category archive pages generating duplicate content, pagination beyond page 2-3, URL parameters generating duplicate content (?sort=, ?filter=, ?session=), internal search results (/search?q=), and print-friendly page versions.

Don't block: CSS and JavaScript files (Google needs these to render and understand pages), important landing pages, or any page you actually want indexed.

Stay sharp

Most guides are already outdated.

One email a week. The search stuff that actually matters — what shifted, what died, and what to do about it.

Subscribe free →

🚫

The Most Costly robots.txt Mistake

Disallow: / in your production robots.txt blocks every crawler from your entire site. Pages already indexed will eventually drop out of search results as Googlebot can no longer recrawl them. This single line has caused major traffic losses for large sites — always audit robots.txt immediately after any CMS migration, staging-to-production push, or site rebuild.

✓ DO

✓

Block URL parameters that create duplicate content (e.g., ?sessionid=, ?sort=, ?filter=)

✓

Block internal search results pages (/search?q=)

✓

Block admin and login pages (/wp-admin/, /login/)

✓

Test changes using Google Search Console's robots.txt Tester before deploying

✓

Use a separate robots.txt on staging environments with Disallow: / to prevent accidental indexing

✗ DON'T

✗

Block CSS or JavaScript files — Googlebot needs them to render and evaluate your pages

✗

Block pages you want indexed, even if you think they're low quality

✗

Use robots.txt as a security measure — malicious bots will ignore it entirely

✗

Disallow pages you've linked to in your sitemap — this sends contradictory signals to Googlebot

✗

Forget to check robots.txt after every major deployment or CMS update

ROBOTS.TXT KEY TERMS

User-agent

The crawler being addressed by a rule. Use '*' to target all crawlers, or specify a name like 'Googlebot' to target Google's crawler exclusively.

Disallow

A directive that tells the specified user-agent not to crawl a given path. An empty Disallow value means the crawler is allowed everywhere.

Crawl budget

The number of URLs Googlebot will crawl on your site within a given timeframe. Blocking low-value pages preserves crawl budget for pages that matter.

Robots Exclusion Protocol (REP)

The standard convention that defines how robots.txt files are structured and interpreted by compliant web crawlers.

Sitemap directive

An optional line in robots.txt (e.g., Sitemap: https://example.com/sitemap.xml) that points crawlers to your XML sitemap to aid discovery.

REAL-WORLD EXAMPLE

A Clean robots.txt for a WordPress Site

User-agent: * Disallow: /wp-admin/ Disallow: /search? Disallow: /?s= Disallow: /tag/ Disallow: /page/ Allow: /wp-admin/admin-ajax.php User-agent: AhrefsBot Disallow: / Sitemap: https://example.com/sitemap.xml This example blocks admin areas, internal search results, tag archives, and pagination from all crawlers while explicitly allowing admin-ajax.php (required for some front-end WordPress functionality). AhrefsBot is fully blocked to reduce server load from third-party crawlers. The Sitemap directive helps all compliant crawlers find indexable content.

ROBOTS.TXT PRE-LAUNCH AUDIT CHECKLIST

0/8 complete

Confirm Disallow: / does NOT appear for production under User-agent: *

Verify staging and development environments have Disallow: / to block indexing

Check that no CSS, JS, or image directories are blocked

Confirm key landing pages, product pages, and blog posts are crawlable

Validate the Sitemap directive points to the correct, live sitemap URL

Test the file using Google Search Console's robots.txt Tester tool

Ensure the file is accessible at https://yourdomain.com/robots.txt with a 200 status code

Cross-check that no URLs listed in your XML sitemap are blocked by robots.txt

ROBOTS.TXT VS. NOINDEX: WHICH SHOULD YOU USE?

robots.txt Disallow	Noindex Meta Tag
Prevents crawling of the page entirely	Allows crawling but removes page from search index
Page may still appear in search results if linked externally	Guarantees removal from search index once recrawled
Googlebot cannot see page content or follow its links	Googlebot reads the page and processes its links
Best for: admin pages, duplicate parameter URLs, staging sites	Best for: thin content, thank-you pages, internal utility pages
Takes effect immediately on next crawl attempt	Takes effect after Googlebot recrawls and processes the tag
Does not require server rendering to be detected	Requires the page to be crawlable to detect the tag

Free Tool

How does your site score on SEO?

Paste your URL. Get a score and a fix list across all three disciplines. No form, no email.

Run Free Audit →

Frequently Asked Questions

No. robots.txt prevents crawling, not indexing. If a page has external links, Google may index the URL without ever crawling it. To remove a page from Google's index, add a noindex meta tag in the HTML head, or use Google Search Console's URL Removal tool for urgent removals.

It must be at the root domain level: https://yourdomain.com/robots.txt. It cannot be in a subdirectory. Every domain and subdomain needs its own robots.txt.

Sources & Further Reading

1.Google Search Central — robots.txt documentation
2.Robots Exclusion Protocol specification

What is robots.txt?

How robots.txt Works

What to Block in robots.txt

Most guides are already outdated.

How does your site score on SEO?

Frequently Asked Questions

Read next

XML Sitemap

Canonical Tags

Crawl Budget