What is LLM Training Data?
LLM training data is the corpus of text — web pages, books, academic papers, code repositories, and other sources — that large language models are trained on. Content that was crawled and included in training datasets becomes baked into the model's knowledge. Understanding which sources are included and prioritised in LLM training data is foundational to building a GEO strategy.
- High-quality, widely-linked content from before training cutoffs is most likely to appear in LLM training data.
- Common Crawl, C4, WebText, and The Pile are major public datasets — understanding their crawl criteria is a GEO lever.
- Academic and reference sources (Wikipedia, arXiv, government sites) are heavily over-represented in training data.
- Content that reads as authoritative, factual, and well-structured is more likely to be retained after quality filtering.
- Training data inclusion is a long game — publishing now affects models trained in future iterations.
How LLMs Are Trained and What They Learn
Language models are trained by exposing them to massive text datasets and teaching them to predict the next token (word fragment). The patterns, facts, and associations they learn come entirely from this training data.
Major public datasets include: Common Crawl (a snapshot of the web crawled by independent crawlers), Wikipedia (comprehensively), books via BookCorpus and Books3, academic papers via Semantic Scholar, and code via GitHub.
Proprietary models (GPT-4, Claude, Gemini) use private training sets that likely include licensed content, filtered web data, and curated high-quality sources. The exact contents aren't public, but the principle is the same: widely-cited, authoritative, high-quality text is over-represented.
GEO Implications: Getting Into Training Data
For future model iterations: publish original research, create authoritative definitions and explanations, get widely cited and linked across the web, and earn mentions in sources that are known to be heavily weighted in training (Wikipedia, academic papers, major publications).
Quality filtering is aggressive. Most content scraped from the web is filtered out before training. Content that survives filtering tends to be: long-form, coherent, grammatically correct, factually consistent, and not spam or boilerplate.
The payoff is long-lasting: well-written content that gets into training data generates citations across every product built on that model, without requiring any ongoing SEO work.
Most guides are already outdated.
One email a week. The search stuff that actually matters — what shifted, what died, and what to do about it.
Subscribe free →| Characteristic | Public Datasets (e.g., Common Crawl, Wikipedia) | Proprietary Datasets (e.g., GPT-4, Claude) |
|---|---|---|
| Transparency | Fully documented, downloadable | Largely undisclosed |
| Content scope | Broad web crawl, encyclopedic, academic | Licensed content, curated web, internal data |
| Quality filtering | Open filter pipelines (e.g., C4, MassiveText) | Proprietary filtering — typically more aggressive |
| GEO auditability | Can verify if your domain was crawled | Cannot directly verify inclusion |
| Update frequency | Common Crawl runs ~monthly; models retrain irregularly | Undisclosed; newer models may use more recent cutoffs |
| Example weight | Wikipedia heavily over-sampled (~3x in GPT-3) | High-quality sources presumed to be over-weighted |
Crawlers like Common Crawl's CC-Bot or proprietary spiders fetch billions of web pages, storing raw HTML snapshots. Crawl frequency and domain prioritisation influence which content is captured.
HTML is stripped to plain text. Near-duplicate documents are removed using hashing techniques (e.g., MinHash). This step heavily reduces dataset size and eliminates low-value repetitive content.
Automated classifiers remove spam, adult content, incoherent text, and short low-information pages. Some pipelines (e.g., Google's C4) use a 'Wikipedia-likeness' classifier to score content quality.
High-authority sources are intentionally over-sampled. In GPT-3's training mix, Wikipedia was weighted ~3x and curated books datasets were included alongside filtered web data to boost factual grounding.
Surviving text is tokenised into subword units and fed into the model during pretraining. Facts, associations, and writing styles present in this final corpus become encoded in the model's weights.
How does your site score on GEO?
Paste your URL. Get a score and a fix list across all three disciplines. No form, no email.
Run Free Audit →Frequently Asked Questions
Partially. For Common Crawl, you can search the index at commoncrawl.org to see if your domain was crawled. For proprietary datasets, there's no public access. Some researchers have released tools to detect training data inclusion using membership inference attacks, but these are not reliable for general use.
No — unless a model is retrained or has retrieval capabilities. A model trained in early 2024 won't know about content you published in late 2024. This is why GEO is a long-term investment: the benefit compounds across future model training cycles, not immediately.
- 1.Common Crawl — About
- 2.Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining, 2023
- 3.Together AI — RedPajama dataset documentation
Read next
AI Overviews
AI Overviews (formerly Search Generative Experience) are AI-generated summaries that appear at the top of Goog…
GEO Citation Signals
GEO citation signals are the factors that make AI systems like ChatGPT, Claude, Perplexity, and Google's AI Ov…
Perplexity SEO
Perplexity SEO refers to the practices that increase the likelihood of your content being found, cited, and so…
Entity Optimisation
Entity optimisation is the practice of making your brand, products, and key concepts clearly defined and verif…