SEO · Search Engine Optimisationbeginner3 min read

What is robots.txt?

A robots.txt file is a plain text file at the root of your domain that instructs web crawlers which pages or sections of your site they should and shouldn't access. It's the first thing Googlebot checks before crawling your site. Used correctly, it preserves crawl budget by blocking low-value pages. Used incorrectly, it can accidentally de-index your entire site.

Fact-checked against 2 sourcesLast updated 8 June 2026
Key Takeaways
  • robots.txt controls crawling, not indexing — Google may still index a blocked URL if links point to it.
  • To prevent indexing, use a noindex meta tag — not robots.txt.
  • Always test your robots.txt rules in Google Search Console before deploying.
  • A missing or wrong robots.txt can tank your entire site overnight.
  • Block: tag archives, filter pages, session IDs, admin URLs, and internal search results.

How robots.txt Works

robots.txt uses the Robots Exclusion Protocol. You specify user agents (crawlers) and the paths they're allowed or disallowed from accessing.

The most important rule: `Disallow: /` blocks all crawlers from the entire site. This is fine for staging environments — catastrophic in production. Always check your robots.txt after any deployment.

Googlebot respects robots.txt by convention. It's not a security mechanism — a malicious bot will ignore it entirely.

What to Block in robots.txt

Good candidates for blocking: /wp-admin/, tag and category archive pages generating duplicate content, pagination beyond page 2-3, URL parameters generating duplicate content (?sort=, ?filter=, ?session=), internal search results (/search?q=), and print-friendly page versions.

Don't block: CSS and JavaScript files (Google needs these to render and understand pages), important landing pages, or any page you actually want indexed.

Stay sharp

Most guides are already outdated.

One email a week. The search stuff that actually matters — what shifted, what died, and what to do about it.

Subscribe free →
🚫
The Most Costly robots.txt Mistake

Disallow: / in your production robots.txt blocks every crawler from your entire site. Pages already indexed will eventually drop out of search results as Googlebot can no longer recrawl them. This single line has caused major traffic losses for large sites — always audit robots.txt immediately after any CMS migration, staging-to-production push, or site rebuild.

✓ DO

Block URL parameters that create duplicate content (e.g., ?sessionid=, ?sort=, ?filter=)

Block internal search results pages (/search?q=)

Block admin and login pages (/wp-admin/, /login/)

Test changes using Google Search Console's robots.txt Tester before deploying

Use a separate robots.txt on staging environments with Disallow: / to prevent accidental indexing

✗ DON'T

Block CSS or JavaScript files — Googlebot needs them to render and evaluate your pages

Block pages you want indexed, even if you think they're low quality

Use robots.txt as a security measure — malicious bots will ignore it entirely

Disallow pages you've linked to in your sitemap — this sends contradictory signals to Googlebot

Forget to check robots.txt after every major deployment or CMS update

ROBOTS.TXT KEY TERMS
User-agent

The crawler being addressed by a rule. Use '*' to target all crawlers, or specify a name like 'Googlebot' to target Google's crawler exclusively.

Disallow

A directive that tells the specified user-agent not to crawl a given path. An empty Disallow value means the crawler is allowed everywhere.

Crawl budget

The number of URLs Googlebot will crawl on your site within a given timeframe. Blocking low-value pages preserves crawl budget for pages that matter.

Robots Exclusion Protocol (REP)

The standard convention that defines how robots.txt files are structured and interpreted by compliant web crawlers.

Sitemap directive

An optional line in robots.txt (e.g., Sitemap: https://example.com/sitemap.xml) that points crawlers to your XML sitemap to aid discovery.

REAL-WORLD EXAMPLE
A Clean robots.txt for a WordPress Site

User-agent: * Disallow: /wp-admin/ Disallow: /search? Disallow: /?s= Disallow: /tag/ Disallow: /page/ Allow: /wp-admin/admin-ajax.php User-agent: AhrefsBot Disallow: / Sitemap: https://example.com/sitemap.xml This example blocks admin areas, internal search results, tag archives, and pagination from all crawlers while explicitly allowing admin-ajax.php (required for some front-end WordPress functionality). AhrefsBot is fully blocked to reduce server load from third-party crawlers. The Sitemap directive helps all compliant crawlers find indexable content.

ROBOTS.TXT PRE-LAUNCH AUDIT CHECKLIST
0/8 complete
Confirm Disallow: / does NOT appear for production under User-agent: *
Verify staging and development environments have Disallow: / to block indexing
Check that no CSS, JS, or image directories are blocked
Confirm key landing pages, product pages, and blog posts are crawlable
Validate the Sitemap directive points to the correct, live sitemap URL
Test the file using Google Search Console's robots.txt Tester tool
Ensure the file is accessible at https://yourdomain.com/robots.txt with a 200 status code
Cross-check that no URLs listed in your XML sitemap are blocked by robots.txt
ROBOTS.TXT VS. NOINDEX: WHICH SHOULD YOU USE?
robots.txt DisallowNoindex Meta Tag
Prevents crawling of the page entirelyAllows crawling but removes page from search index
Page may still appear in search results if linked externallyGuarantees removal from search index once recrawled
Googlebot cannot see page content or follow its linksGooglebot reads the page and processes its links
Best for: admin pages, duplicate parameter URLs, staging sitesBest for: thin content, thank-you pages, internal utility pages
Takes effect immediately on next crawl attemptTakes effect after Googlebot recrawls and processes the tag
Does not require server rendering to be detectedRequires the page to be crawlable to detect the tag
Free Tool

How does your site score on SEO?

Paste your URL. Get a score and a fix list across all three disciplines. No form, no email.

Run Free Audit →

Frequently Asked Questions

No. robots.txt prevents crawling, not indexing. If a page has external links, Google may index the URL without ever crawling it. To remove a page from Google's index, add a noindex meta tag in the HTML head, or use Google Search Console's URL Removal tool for urgent removals.

It must be at the root domain level: https://yourdomain.com/robots.txt. It cannot be in a subdirectory. Every domain and subdomain needs its own robots.txt.

Sources & Further Reading
  • 1.Google Search Central — robots.txt documentation
  • 2.Robots Exclusion Protocol specification