← Back to Research
Chapter 10 · Web Discovery

95.9% of hotel websites have no AI crawler rules — most are invisible by accident, not by choice

We parsed robots.txt from 104,214 hotel websites across seven countries. Almost none have intentional rules for AI crawlers — the bots that power ChatGPT, Perplexity, and Google AI Mode. Here’s what that means and the one configuration change that fixes it.

Apr 30, 20266 min readHuxo Research

Why this matters to you

When ChatGPT performs a live web search for hotel recommendations, it sends a crawler to your website. So does Perplexity. So does Google AI Mode. Whether those crawlers can read your site — or are blocked from it — is controlled by a single file at the root of your domain: robots.txt.

Most hoteliers set up robots.txt years ago and never revisited it. That file was written for Google and Bing. It says nothing about the new generation of AI crawlers. And that silence has real consequences — in both directions.

We parsed robots.txt from 104,214 hotel websites to find out what hotels are actually telling AI crawlers — and whether any of it is intentional.

95.9% of hotels have no AI-specific crawler rules. Most are open to ChatGPT and Perplexity by accident, not by strategy.

Key findings at a glance

01

95.9% have zero AI-specific blocking rules

Only 4.1% of hotel websites have any AI crawler rule at all. The other 95.9% are open to every bot by default — including training crawlers that build LLM knowledge bases.

02

Training bots are blocked 2.5× more than search bots

Among hotels that do block AI crawlers, training bots (GPTBot, Google-Extended, CCBot) are blocked at 2.5× the rate of search bots. Hotels are more concerned about training data than visibility.

03

Only 2.4% have the optimal configuration

Block training bots, allow search bots. That’s the strategic approach — protecting content from model training while staying visible in ChatGPT and Perplexity answers. Only 2.4% of hotels do this.

04

France is an outlier at 8.1%

One hotel chain’s coordinated decision accounts for ~970 properties. Remove that chain and France drops to 2.3% — matching the US rate. Chain-level decisions dominate the data.


What this means for your hotel

There are two types of AI bots crawling hotel websites, and they have completely different implications:

Training crawlers vs. search crawlers \u2014 different bots, different consequences

Bot typeExamplesWhat it doesBlock it?
Training crawlersGPTBot, Google-Extended, CCBotBuilds the model’s knowledge baseYour choice
Search crawlersOAI-SearchBot, PerplexityBot, GooglebotPowers real-time AI search answersGenerally no

Blocking a training crawler means your content won’t be used to train future model versions — a legitimate choice some hotels make for content ownership reasons. Blocking a search crawler means your hotel disappears from ChatGPT and Perplexity answers entirely.

The 0.2% of hotels blocking search bots while allowing training bots have it exactly backwards. They’re donating their content to train AI models while getting zero visibility in return.

The default is not neutral

A robots.txt with no AI rules doesn’t mean “do nothing.” It means “allow everything” — training crawlers included. If you have no AI rules today, you are actively opted in to training data collection. That may be fine. But it should be a decision, not an oversight.


What to do about it

1. Add explicit rules for AI search crawlers.

At minimum, explicitly allow the search crawlers that power AI answers. This isn’t about gaming anything — it’s confirming that these bots are welcome to read your site and include you in answers.

# Allow AI search crawlers — these power ChatGPT, Perplexity, Google AI Mode
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot
Allow: /

# Training crawlers — block if you prefer not to be in future training data
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Sitemap: https://yourdomain.com/sitemap.xml

2. Make a conscious decision about training bots.

There is no universally right answer. If you want your content to influence how future AI models understand your property, allow training bots. If you prefer to keep control over your content, block them. Both are defensible positions — what isn’t defensible is not having a position.

3. Always include your Sitemap directive.

robots.txt is also where crawlers discover your sitemap. If the Sitemap directive is missing, some crawlers won’t find all your pages. This affects both traditional search engines and AI crawlers.


The evidence

Finding 1 — Overall blocking rates

Share of hotels with AI crawler rules

104,214 hotel websites parsed · April 2026

No AI-specific rules (open by default)
95.9%
Block any AI crawler
4.1%
Block training + allow search (optimal)
2.4%
Block all AI bots comprehensively
1.2%
Block search + allow training (backwards)
0.2%

Bars are shown proportional to each other. The 95.9% bar is scaled to 100% for readability; all other bars reflect their true relative size.

Finding 2 — Which bots are blocked most

Among hotels with any AI crawler rules, training bots dominate. The gap between training bot blocking rates and search bot rates is consistent: roughly 2.5× across all three training bot categories.

Bot-specific blocking rates — share of all 104,214 hotels

Training bots vs. search bots

GPTBot (OpenAI training)
3.6%
Google-Extended (Gemini training)
3.3%
CCBot (Common Crawl training)
3.1%
OAI-SearchBot / PerplexityBot (search)
~1.3%

OAI-SearchBot and PerplexityBot are grouped as their blocking rates were nearly identical (~1.3% each).

2.5×

How much more often training bots are blocked vs. search bots. Average training bot blocking rate: 3.33%. Search bot blocking rate: ~1.3%. Hotels are more protective of training data than they are concerned about AI search visibility.

Finding 3 — Country breakdown

Country-level variation is mostly explained by chain-level decisions, not individual hotel choices. France’s outlier status collapses entirely once a single chain is excluded.

Share of hotels with any AI crawler rule, by country

104,214 hotels across 7 countries

France
8.1%
Spain
4.4%
United States
2.3%
United Kingdom
2.2%
Germany
2.1%
Italy
1.9%
Netherlands
1.7%

France drops to 2.3% when a single chain (~970 properties) with a coordinated block policy is excluded \u2014 matching the US rate exactly.


Frequently asked questions

If I don’t block AI crawlers, does that mean they can scrape all my content?

Yes, by default. A robots.txt with no rules — or no robots.txt at all — is an open invitation to all crawlers. If you want to prevent AI training bots from indexing your site, you need to explicitly add Disallow rules for GPTBot, Google-Extended, and CCBot.

Will blocking GPTBot hurt my visibility in ChatGPT?

It depends on which bot ChatGPT uses for a given query. GPTBot is the training crawler — blocking it affects what future model versions learn about you. OAI-SearchBot is the live search crawler — blocking it removes you from real-time ChatGPT answers. They are different bots with different functions.

Does Googlebot cover Google AI Mode?

Yes. Google AI Mode uses the same Googlebot infrastructure as traditional search. Allowing Googlebot covers both traditional Google Search results and Google AI Mode recommendations.

Should I block training crawlers?

There is no universal answer. Blocking training crawlers means future AI models won’t learn from your content directly — which could reduce how naturally AI engines describe your property over time. Allowing them means your content contributes to model training, which may improve how AI understands and recommends you. Both are legitimate positions.

How do I check what my current robots.txt says?

Visit yourdomain.com/robots.txt in any browser. If you get a 404, you have no robots.txt and all crawlers have full access by default. If you see content, check whether any of the AI crawler user-agents listed in this article appear.


How we ran the study

104,214

Hotels parsed

7

Countries

14

AI crawlers checked

10s

Request timeout

We fetched and parsed robots.txt from 104,214 reachable hotel websites with a 10-second timeout per request. Hotel URLs were sourced across seven countries: United States, United Kingdom, France, Germany, Spain, Italy, and the Netherlands.

For each file, we checked for 14 known AI crawler user-agents (training and search) plus traditional search engine bots. We classified each hotel into: no AI rules, blocks training only, blocks search only, blocks all, or blocks all comprehensively.

Limitations. robots.txt compliance is voluntary — well-behaved bots respect it, but not all crawlers do. Our analysis covers stated policy, not actual crawler behavior. Hotels with no robots.txt file were counted as “no AI-specific rules.”


Want to know if AI engines can actually find your hotel?

Huxo’s AI Visibility Report audits your hotel across ChatGPT, Perplexity, and Google AI Mode — including a full crawlability check and robots.txt analysis.

Continue reading