GEO · Generative Engine Optimisationadvanced3 min read

What is LLM Training Data?

LLM training data is the corpus of text — web pages, books, academic papers, code repositories, and other sources — that large language models are trained on. Content that was crawled and included in training datasets becomes baked into the model's knowledge. Understanding which sources are included and prioritised in LLM training data is foundational to building a GEO strategy.

60%
of Common Crawl training data comes from just 15% of domains, by token volume
Source: Dolma dataset analysis, 2023
Fact-checked against 3 sourcesLast updated 8 June 2026
Key Takeaways
  • High-quality, widely-linked content from before training cutoffs is most likely to appear in LLM training data.
  • Common Crawl, C4, WebText, and The Pile are major public datasets — understanding their crawl criteria is a GEO lever.
  • Academic and reference sources (Wikipedia, arXiv, government sites) are heavily over-represented in training data.
  • Content that reads as authoritative, factual, and well-structured is more likely to be retained after quality filtering.
  • Training data inclusion is a long game — publishing now affects models trained in future iterations.

How LLMs Are Trained and What They Learn

Language models are trained by exposing them to massive text datasets and teaching them to predict the next token (word fragment). The patterns, facts, and associations they learn come entirely from this training data.

Major public datasets include: Common Crawl (a snapshot of the web crawled by independent crawlers), Wikipedia (comprehensively), books via BookCorpus and Books3, academic papers via Semantic Scholar, and code via GitHub.

Proprietary models (GPT-4, Claude, Gemini) use private training sets that likely include licensed content, filtered web data, and curated high-quality sources. The exact contents aren't public, but the principle is the same: widely-cited, authoritative, high-quality text is over-represented.

GEO Implications: Getting Into Training Data

For future model iterations: publish original research, create authoritative definitions and explanations, get widely cited and linked across the web, and earn mentions in sources that are known to be heavily weighted in training (Wikipedia, academic papers, major publications).

Quality filtering is aggressive. Most content scraped from the web is filtered out before training. Content that survives filtering tends to be: long-form, coherent, grammatically correct, factually consistent, and not spam or boilerplate.

The payoff is long-lasting: well-written content that gets into training data generates citations across every product built on that model, without requiring any ongoing SEO work.

Stay sharp

Most guides are already outdated.

One email a week. The search stuff that actually matters — what shifted, what died, and what to do about it.

Subscribe free →
570GB
Size of filtered Common Crawl text used in GPT-3 training
~45TB
Raw size of a single Common Crawl monthly snapshot
~99%
Estimated share of raw crawled web content filtered out before LLM training
3T+
Tokens in LLaMA 2's training dataset
PUBLIC VS. PROPRIETARY LLM TRAINING DATA SOURCES
CharacteristicPublic Datasets (e.g., Common Crawl, Wikipedia)Proprietary Datasets (e.g., GPT-4, Claude)
TransparencyFully documented, downloadableLargely undisclosed
Content scopeBroad web crawl, encyclopedic, academicLicensed content, curated web, internal data
Quality filteringOpen filter pipelines (e.g., C4, MassiveText)Proprietary filtering — typically more aggressive
GEO auditabilityCan verify if your domain was crawledCannot directly verify inclusion
Update frequencyCommon Crawl runs ~monthly; models retrain irregularlyUndisclosed; newer models may use more recent cutoffs
Example weightWikipedia heavily over-sampled (~3x in GPT-3)High-quality sources presumed to be over-weighted
✓ DO

Publish original research with clear, citable statistics and findings

Write long-form, coherent content with consistent factual accuracy

Earn backlinks and citations from Wikipedia, academic papers, and major publications

Use canonical, crawlable URLs so training crawlers can discover and index your content

Define industry terms authoritatively — models learn definitions from repeated, consistent usage

✗ DON'T

Publish thin, boilerplate, or templated content — it is aggressively filtered out pre-training

Rely solely on social media posts or gated paywalled content that crawlers cannot access

Duplicate content across multiple pages — deduplication steps remove near-identical text

Use excessive ads, popups, or low-signal page layouts that quality classifiers penalise

Assume a one-time crawl guarantees permanent model inclusion — content must persist and remain authoritative

HOW WEB CONTENT BECOMES LLM KNOWLEDGE: FROM CRAWL TO MODEL
01
Web Crawling

Crawlers like Common Crawl's CC-Bot or proprietary spiders fetch billions of web pages, storing raw HTML snapshots. Crawl frequency and domain prioritisation influence which content is captured.

02
Extraction & Deduplication

HTML is stripped to plain text. Near-duplicate documents are removed using hashing techniques (e.g., MinHash). This step heavily reduces dataset size and eliminates low-value repetitive content.

03
Quality Filtering

Automated classifiers remove spam, adult content, incoherent text, and short low-information pages. Some pipelines (e.g., Google's C4) use a 'Wikipedia-likeness' classifier to score content quality.

04
Domain & Source Weighting

High-authority sources are intentionally over-sampled. In GPT-3's training mix, Wikipedia was weighted ~3x and curated books datasets were included alongside filtered web data to boost factual grounding.

05
Tokenisation & Training

Surviving text is tokenised into subword units and fed into the model during pretraining. Facts, associations, and writing styles present in this final corpus become encoded in the model's weights.

ESTIMATED RELATIVE WEIGHTING OF SOURCE TYPES IN LLM TRAINING MIXTURES
Filtered Web (Common Crawl)Largest by volume but lowest quality per token; heavily filtered
WikipediaSmall by size but aggressively over-sampled for factual grounding
Books / Long-form TextBookCorpus, Books3, and licensed content valued for coherent reasoning
Academic PapersArXiv, Semantic Scholar; critical for scientific and technical knowledge
Code RepositoriesGitHub data improves logical reasoning across all task types, not just coding
Curated Q&A / ForumsStack Exchange, Reddit subsets; useful for conversational and instructional patterns
GEO READINESS CHECKLIST: OPTIMISING CONTENT FOR LLM TRAINING DATA INCLUSION
0/7 complete
Content is publicly accessible via crawlable, canonical URLs with no login wall blocking key pages
Pages contain 600+ words of substantive, original analysis rather than thin summaries
Your brand, product, or concept is mentioned and linked from at least one Wikipedia article
You have published at least one piece of original research, survey data, or proprietary statistic that others cite
Content is grammatically correct, logically structured, and free of spam signals (excessive ads, keyword stuffing)
Your domain has earned backlinks from .edu, .gov, or high-authority media domains
Key definitions and explanations on your site are clear, consistent, and match how the industry uses those terms
Free Tool

How does your site score on GEO?

Paste your URL. Get a score and a fix list across all three disciplines. No form, no email.

Run Free Audit →

Frequently Asked Questions

Partially. For Common Crawl, you can search the index at commoncrawl.org to see if your domain was crawled. For proprietary datasets, there's no public access. Some researchers have released tools to detect training data inclusion using membership inference attacks, but these are not reliable for general use.

No — unless a model is retrained or has retrieval capabilities. A model trained in early 2024 won't know about content you published in late 2024. This is why GEO is a long-term investment: the benefit compounds across future model training cycles, not immediately.

Sources & Further Reading
  • 1.Common Crawl — About
  • 2.Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining, 2023
  • 3.Together AI — RedPajama dataset documentation