Common Crawl

CCBotcrawlers from Common Crawl

Open public web archive used by many LLM training datasets.

CCBot is the crawler behind Common Crawl, the open-data web archive that underpins most public LLM training datasets (including parts of GPT-3, LLaMA, Mistral, Falcon and others). If you disallow CCBot, you opt out of nearly the entire open-research LLM training pipeline at once.

Vendor
Common Crawl
Category
Crawlers (training & indexing)
User-Agent
CCBot
Documentation

robots.txt snippets

Allow
User-agent: CCBot
Allow: /
Disallow
User-agent: CCBot
Disallow: /

FAQ

What is CCBot?
CCBot is the crawler behind Common Crawl, the open-data web archive that underpins most public LLM training datasets (including parts of GPT-3, LLaMA, Mistral, Falcon and others). If you disallow CCBot, you opt out of nearly the entire open-research LLM training pipeline at once.
What is the user-agent string for CCBot?
CCBot identifies itself with the user-agent token "CCBot". You can match it in robots.txt with "User-Agent: CCBot" and route nginx / log-analyzer rules against that token.
How do I allow CCBot in robots.txt?
Add the following block to your /robots.txt — this explicitly grants CCBot access: User-agent: CCBot Allow: /
How do I block CCBot in robots.txt?
Add the following block to your /robots.txt — note that well-behaved bots honor this, but not every crawler does: User-agent: CCBot Disallow: /
How can I check whether my site is ready for CCBot?
Run a free check at https://agentics.page — it audits whether your robots.txt allows the right bots, whether you publish llms.txt and JSON-LD structured data, whether your content is server-rendered, and whether CCBot can actually consume your site.

Is your domain ready for CCBot?

agentics checks whether your robots.txt allows the right bots, your llms.txt is in shape, your JSON-LD and SSR content are visible, and whether CCBot can actually use your domain.

Run free check →

Related agents