Technical

Which AI Crawlers Should Local Businesses Allow?

A robots.txt and crawler-access guide for service brands that want AI visibility without opening every page to training crawlers.

Amadeus Peterson, CTO & Co-Founder, Cheers9 min readMay 24, 2026

Crawler policy

Access split

Search retrieval, user-request fetches, and model training should not share one blanket rule.

access jobs

robots.txt

1User-agent: OAI-SearchBot

2Allow: /

3User-agent: GPTBot

4Disallow: /

5User-agent: Google-Extended

6Disallow: /

Keep public location pages reachable while documenting training and private-path limits.

Search crawlers

User fetches

Training bots

Multi-location service brands are starting to treat robots.txt like an AI visibility switch. That is too simple.

The better question is not whether to allow or block "AI crawlers." It is which crawler job you are making a decision about. Search retrieval, user-requested page fetches, model training, ad review, and random scraping are different activities. A crawler policy that treats them the same can quietly remove location pages from answer engines, or open content to uses the brand never intended.

For a home services rollup, franchise system, med spa group, or hospitality operator, the practical goal is narrower: keep public location and service pages accessible to legitimate search and answer surfaces, limit training use where the business chooses to, and verify that CDN or bot-management settings are not blocking the pages that need to be found.

Important

Do not copy a generic "block all AI bots" robots.txt template into a multi-location site. It can protect content from some training crawlers, but it can also make real locations harder for AI search systems to retrieve, cite, or summarize.

Restoration company operations manager working at a service counter with a laptop and handheld radio — Crawler access should separate search visibility from model-training permissions.

Start with the crawler's job

OpenAI's crawler docs split access into separate user agents. OAI-SearchBot is for ChatGPT search features. GPTBot is for crawling content that may be used to train foundation models. ChatGPT-User is triggered by certain user actions and is not the control OpenAI tells site owners to use for Search opt outs.

Anthropic's current crawler page makes a similar separation. It lists ClaudeBot for model development, Claude-SearchBot for search, and Claude-User for retrieving content at a user's direction. Anthropic says the bots honor industry-standard robots.txt directives and describes how site owners can block or slow specific bots.

Perplexity documents PerplexityBot and says it will not index full or partial text content from a site that disallows it in robots.txt, while still noting that it may index limited facts such as the domain, headline, and a brief summary.

Google is different because AI Overviews and AI Mode are Search features. Google says the same SEO fundamentals apply, and that pages must be indexed and eligible to appear in Google Search with a snippet to be eligible as supporting links in AI features. Google also says Googlebot, not Google-Extended, is the control for crawling in Search. Google-Extended is a product token for managing certain Gemini training and grounding uses outside regular Search crawling.

That separation matters for local service businesses. Blocking a training crawler may be a reasonable brand decision. Blocking a search crawler, a user-request fetcher, or Googlebot can reduce visibility in the exact answer surfaces the business wants to influence.

A selective policy is safer than a broad block

Most operators do not need a philosophical crawler policy. They need a practical one that a marketing lead, SEO lead, and web developer can maintain across hundreds of location URLs.

A reasonable starting point is to group crawlers by purpose:

Allow search and discovery crawlers that the business wants to appear in, including Googlebot for Google Search AI features and OAI-SearchBot for ChatGPT search when ChatGPT visibility matters
Decide separately on training crawlers such as GPTBot, ClaudeBot, and Google-Extended based on the brand's content-use policy, legal posture, and appetite for model-training reuse
Keep user-triggered fetches usable when appropriate because assistants may fetch a page when a user asks about a specific business, booking page, or source
Block private, duplicate, parameter-heavy, and internal paths such as staging pages, admin routes, search-result pages, cart flows, or campaign parameters that do not help a buyer choose a location
Verify server behavior beyond robots.txt because CDN rules, WAF challenges, bot toggles, and IP allowlists can override a permissive robots file

This is not a recommendation to allow everything. It is a recommendation to stop treating "AI" as one crawler class.

For many Cheers ICP companies, the highest-value public pages are location pages, service pages, reviews or proof pages, comparison pages, and helpful educational articles. Those pages should usually be crawlable if they support local demand. Internal dashboards, lead-routing APIs, private reports, and thin duplicate pages should not be.

What a local-service robots.txt might express

The exact file depends on the site's legal policy, CMS, CDN, and growth priorities. The point is to make search access, training access, and private-path access explicit.

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://www.example.com/sitemap.xml

That example allows ChatGPT's search crawler, blocks two training or non-Search-use controls, blocks private paths for everyone, and keeps the public site crawlable. It is not a universal template. It does not cover Anthropic, Perplexity, Bing, Apple, Common Crawl, or every CDN-level bot rule. It also does not solve content quality, entity clarity, source coverage, or review strength.

Robots.txt only answers one question: which crawlers are instructed not to fetch which paths. It does not prove that a page deserves to be recommended.

For that broader work, read How Local Businesses Can Show Up in Google AI Search and The Citation Stack for AI Search.

The multi-location failure mode is partial blocking

Large local-service sites rarely fail crawler access in one obvious way. They fail by market, subdomain, path, brand, or acquisition.

A PE-backed HVAC platform may have the corporate domain crawlable while acquired brands sit behind older robots rules. A franchise may allow crawlers on the marketing site while blocking franchisee microsites or appointment subdomains at the CDN. A med spa group may allow Googlebot but challenge every non-Google bot with JavaScript, which can stop answer engines from reading public service pages. A home services brand may accidentally block parameterized URLs that canonical location pages depend on.

Partial blocking is hard to spot because normal browser checks still work. The marketing team can load the page. Google Search Console may look clean for the main domain. The location page may be in the sitemap. But the AI crawler or user-agent that matters sees a 403, a bot challenge, a no-snippet rule, or an empty rendered page.

This is why crawler policy belongs in a recurring technical audit, not a one-time robots.txt edit. The audit should cover the apex domain, location directories, legacy acquisition domains, booking subdomains, blog paths, image paths, and proof pages. It should test robots.txt, HTTP status, renderability, canonical tags, snippets, structured data, and CDN bot settings.

If the business needs a repeatable branch-level measurement layer, pair this crawler check with How to Audit AI Search Visibility Across Locations.

If the same business also has inconsistent third-party listings, the crawler problem becomes an entity problem. Why AI Treats Your 50 Locations Like 50 Strangers explains that side of the work.

Do not confuse crawler access with permission to win

Allowing the right crawler is only the entry ticket. It means a search or AI system can try to fetch a page. It does not mean the page is useful, indexed, cited, or selected.

Google says there are no special technical requirements for appearing in AI Overviews or AI Mode beyond eligibility for Search and snippets, and it points site owners back to crawlability, internal links, page experience, textual content, images, structured data that matches visible text, and up-to-date Business Profile information.

For local-service brands, those fundamentals have a location-level interpretation. A crawlable page should identify the local branch, services, service area, contact path, business profile relationship, review proof, and the real-world problem the customer is trying to solve. A generic city page that exists only to target a keyword may be accessible, but still not useful enough to support an AI recommendation.

The mistake is to stop at "we allowed the bot." The better question is whether the page gives the system enough public evidence to say why this location is a credible provider for this job in this market.

Use llms.txt as context, not as the policy layer

An llms.txt file can help explain a site to AI tools that choose to read it. It can point crawlers or agents toward high-value pages, documentation, and business context. It should not replace robots.txt, noindex, authentication, or CDN-level controls.

For local-service brands, llms.txt is useful when it summarizes the site map, service categories, locations, proof pages, and preferred canonical resources. It is not useful when it becomes a dumping ground for every keyword page or a fake permission system.

The control layer is still robots.txt, page-level meta directives, HTTP headers, authentication, and edge rules. The context layer can include llms.txt, structured data, internal links, and clear page copy. Keep those jobs separate.

Read What Is LLMs.txt, and Should Your Business Use One? if your team is deciding whether to add one.

What to inspect this week

Before changing policy across hundreds of locations, pick ten revenue-critical URLs: three location pages, three service pages, two proof or review pages, one article, and one booking or contact path. Fetch each with Googlebot, OAI-SearchBot, GPTBot, ClaudeBot, Claude-SearchBot, PerplexityBot, and a normal browser user agent. Record whether each request returns a useful page, a robots block, a 403, a challenge, a redirect loop, a noindex or no-snippet directive, or a page with missing main content.

Then compare the result to the business decision. If search visibility matters, public location and service pages should be accessible to the search crawlers that feed those surfaces. If the brand does not want training reuse, document that separately. If abusive bots are creating load, handle them with log-based rules instead of blocking every documented AI crawler by default.

The Cheers AI Visibility Grader can show how one business appears in AI search. The crawler audit explains whether the public pages that should support that visibility are even reachable.

Sources

Google Search Central: AI features and your website. Supports the point that AI Overviews and AI Mode use normal Search eligibility, crawling, snippets, internal links, structured data, and Business Profile freshness
Google Crawling Infrastructure: Google-Extended. Explains that Google-Extended is a standalone product token, not a separate HTTP user agent
OpenAI: overview of OpenAI crawlers. Defines OAI-SearchBot, GPTBot, and ChatGPT-User and their different purposes
Anthropic Help Center: web crawling and blocking Anthropic bots. Documents ClaudeBot, Claude-SearchBot, Claude-User, and Anthropic's robots.txt guidance
Perplexity Help Center: how Perplexity follows robots.txt. Documents PerplexityBot's stated behavior when a site disallows crawling
Google Crawling Infrastructure: robots.txt specification. Supports the syntax and limitations of user-agent, allow, disallow, and sitemap rules
Cloudflare bot solutions docs: managed robots.txt setting. Supports the point that CDN-managed robots settings and content signals can affect crawler instructions

Amadeus Peterson is the CTO & Co-Founder of Cheers, the local search platform for multi-location service businesses.

Share this article

Pass it to the operator who still thinks AI visibility is just SEO with a different label.

Frequently Asked Questions

If the business wants a chance to appear in ChatGPT search answers, OpenAI's documentation recommends allowing OAI-SearchBot and its published IP ranges. That is different from allowing GPTBot for model training.

OpenAI separates GPTBot from OAI-SearchBot. GPTBot is used for training foundation models, while OAI-SearchBot is used for ChatGPT search features. Use the current OpenAI crawler docs before changing either directive.

No. Google says Googlebot controls crawling for Search, including AI features in Search. Google-Extended is a separate product token for managing certain Gemini training and grounding uses outside normal Search crawling.

Robots.txt is a crawler instruction file, not an authentication system. Well-documented crawlers may honor it, but brands should also verify CDN, hosting, bot-management, and server-log behavior.

Usually yes for public service and location pages. Multi-location brands should keep public pages crawlable, block private or thin internal surfaces, and avoid accidental CDN rules that block only some markets or subdomains.

Keep reading

Best Practices

Is AI recommending your business?

Find out how visible you are across ChatGPT, Gemini, Perplexity, and AI Overviews.

Check Your Visibility Or book a demo