GEO · Generative Engine Optimisationadvanced3 min read

What is LLM Training Data?

LLM training data is the corpus of text — web pages, books, academic papers, code repositories, and other sources — that large language models are trained on. Content that was crawled and included in training datasets becomes baked into the model's knowledge. Understanding which sources are included and prioritised in LLM training data is foundational to building a GEO strategy.

60%

of Common Crawl training data comes from just 15% of domains, by token volume

Source: Dolma dataset analysis, 2023

Fact-checked against 3 sourcesLast updated 8 June 2026

Key Takeaways

High-quality, widely-linked content from before training cutoffs is most likely to appear in LLM training data.
Common Crawl, C4, WebText, and The Pile are major public datasets — understanding their crawl criteria is a GEO lever.
Academic and reference sources (Wikipedia, arXiv, government sites) are heavily over-represented in training data.
Content that reads as authoritative, factual, and well-structured is more likely to be retained after quality filtering.
Training data inclusion is a long game — publishing now affects models trained in future iterations.

In this article

01How LLMs Are Trained and What They Learn
02GEO Implications: Getting Into Training Data

How LLMs Are Trained and What They Learn

Language models are trained by exposing them to massive text datasets and teaching them to predict the next token (word fragment). The patterns, facts, and associations they learn come entirely from this training data.

Major public datasets include: Common Crawl (a snapshot of the web crawled by independent crawlers), Wikipedia (comprehensively), books via BookCorpus and Books3, academic papers via Semantic Scholar, and code via GitHub.

Proprietary models (GPT-4, Claude, Gemini) use private training sets that likely include licensed content, filtered web data, and curated high-quality sources. The exact contents aren't public, but the principle is the same: widely-cited, authoritative, high-quality text is over-represented.

GEO Implications: Getting Into Training Data

For future model iterations: publish original research, create authoritative definitions and explanations, get widely cited and linked across the web, and earn mentions in sources that are known to be heavily weighted in training (Wikipedia, academic papers, major publications).

Quality filtering is aggressive. Most content scraped from the web is filtered out before training. Content that survives filtering tends to be: long-form, coherent, grammatically correct, factually consistent, and not spam or boilerplate.

The payoff is long-lasting: well-written content that gets into training data generates citations across every product built on that model, without requiring any ongoing SEO work.

Stay sharp

Most guides are already outdated.

One email a week. The search stuff that actually matters — what shifted, what died, and what to do about it.

Subscribe free →

570GB

Size of filtered Common Crawl text used in GPT-3 training

~45TB

Raw size of a single Common Crawl monthly snapshot

~99%

Estimated share of raw crawled web content filtered out before LLM training

3T+

Tokens in LLaMA 2's training dataset

PUBLIC VS. PROPRIETARY LLM TRAINING DATA SOURCES

Characteristic	Public Datasets (e.g., Common Crawl, Wikipedia)	Proprietary Datasets (e.g., GPT-4, Claude)
Transparency	Fully documented, downloadable	Largely undisclosed
Content scope	Broad web crawl, encyclopedic, academic	Licensed content, curated web, internal data
Quality filtering	Open filter pipelines (e.g., C4, MassiveText)	Proprietary filtering — typically more aggressive
GEO auditability	Can verify if your domain was crawled	Cannot directly verify inclusion
Update frequency	Common Crawl runs ~monthly; models retrain irregularly	Undisclosed; newer models may use more recent cutoffs
Example weight	Wikipedia heavily over-sampled (~3x in GPT-3)	High-quality sources presumed to be over-weighted

✓ DO

✓

Publish original research with clear, citable statistics and findings

✓

Write long-form, coherent content with consistent factual accuracy

✓

Earn backlinks and citations from Wikipedia, academic papers, and major publications

✓

Use canonical, crawlable URLs so training crawlers can discover and index your content

✓

Define industry terms authoritatively — models learn definitions from repeated, consistent usage

✗ DON'T

✗

Publish thin, boilerplate, or templated content — it is aggressively filtered out pre-training

✗

Rely solely on social media posts or gated paywalled content that crawlers cannot access

✗

Duplicate content across multiple pages — deduplication steps remove near-identical text

✗

Use excessive ads, popups, or low-signal page layouts that quality classifiers penalise

✗

Assume a one-time crawl guarantees permanent model inclusion — content must persist and remain authoritative

HOW WEB CONTENT BECOMES LLM KNOWLEDGE: FROM CRAWL TO MODEL

Web Crawling

Crawlers like Common Crawl's CC-Bot or proprietary spiders fetch billions of web pages, storing raw HTML snapshots. Crawl frequency and domain prioritisation influence which content is captured.

Extraction & Deduplication

HTML is stripped to plain text. Near-duplicate documents are removed using hashing techniques (e.g., MinHash). This step heavily reduces dataset size and eliminates low-value repetitive content.

Quality Filtering

Automated classifiers remove spam, adult content, incoherent text, and short low-information pages. Some pipelines (e.g., Google's C4) use a 'Wikipedia-likeness' classifier to score content quality.

Domain & Source Weighting

High-authority sources are intentionally over-sampled. In GPT-3's training mix, Wikipedia was weighted ~3x and curated books datasets were included alongside filtered web data to boost factual grounding.

Tokenisation & Training

Surviving text is tokenised into subword units and fed into the model during pretraining. Facts, associations, and writing styles present in this final corpus become encoded in the model's weights.

ESTIMATED RELATIVE WEIGHTING OF SOURCE TYPES IN LLM TRAINING MIXTURES

Filtered Web (Common Crawl)Largest by volume but lowest quality per token; heavily filtered

WikipediaSmall by size but aggressively over-sampled for factual grounding

Books / Long-form TextBookCorpus, Books3, and licensed content valued for coherent reasoning

Academic PapersArXiv, Semantic Scholar; critical for scientific and technical knowledge

Code RepositoriesGitHub data improves logical reasoning across all task types, not just coding

Curated Q&A / ForumsStack Exchange, Reddit subsets; useful for conversational and instructional patterns

GEO READINESS CHECKLIST: OPTIMISING CONTENT FOR LLM TRAINING DATA INCLUSION

0/7 complete

Content is publicly accessible via crawlable, canonical URLs with no login wall blocking key pages

Pages contain 600+ words of substantive, original analysis rather than thin summaries

Your brand, product, or concept is mentioned and linked from at least one Wikipedia article

You have published at least one piece of original research, survey data, or proprietary statistic that others cite

Content is grammatically correct, logically structured, and free of spam signals (excessive ads, keyword stuffing)

Your domain has earned backlinks from .edu, .gov, or high-authority media domains

Key definitions and explanations on your site are clear, consistent, and match how the industry uses those terms

Free Tool

How does your site score on GEO?

Paste your URL. Get a score and a fix list across all three disciplines. No form, no email.

Run Free Audit →

Frequently Asked Questions

Partially. For Common Crawl, you can search the index at commoncrawl.org to see if your domain was crawled. For proprietary datasets, there's no public access. Some researchers have released tools to detect training data inclusion using membership inference attacks, but these are not reliable for general use.

No — unless a model is retrained or has retrieval capabilities. A model trained in early 2024 won't know about content you published in late 2024. This is why GEO is a long-term investment: the benefit compounds across future model training cycles, not immediately.

Sources & Further Reading

1.Common Crawl — About
2.Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining, 2023
3.Together AI — RedPajama dataset documentation

What is LLM Training Data?

How LLMs Are Trained and What They Learn

GEO Implications: Getting Into Training Data

Most guides are already outdated.

How does your site score on GEO?

Frequently Asked Questions

Read next

AI Overviews

GEO Citation Signals

Perplexity SEO

Entity Optimisation