← Back to Research
Chapter 11 · Surface Comparison

Same AI, two tools — 11% of hotel picks from the developer API come from memory alone. On the ChatGPT website, we never saw the same.

We ran the same 20 hotel queries 30 times each through ChatGPT’s website and through the same AI model as developers see it. Both recommend the hotel about 1 in 5 times — but the mechanics underneath aren’t the same, and a test on one tool doesn’t predict what you’ll see in the other.

Apr 19, 20267 min readHuxo Research

Why this matters to you

The same AI model reaches travellers in two very different places. One is the ChatGPT website at chatgpt.com, where a traveller types a question and reads the answer. The other is the same AI sitting behind apps — booking assistants, travel tools, concierge products, anything a developer builds on top of it. Same brain, different doorways.

We ran the same 20 hotel-search questions 30 times each through both doorways — about 1,200 questions in total — and checked whether one target independent hotel got recommended.

The headline rate is almost identical: about 1 in 5 times, on both. But the paths the AI takes to get there are different. On the website, it almost always looks up information on the live web before answering. Through the developer version, it sometimes answers straight from memory — and when it does, the hotel still gets recommended 11% of the time. Across roughly 130 no-lookup answers on the website in our sample, we never saw that happen once.

Same AI, same questions — on the developer version, 1 in 9 recommendations came from memory alone. On the ChatGPT website, across roughly 130 no-lookup answers, we never once saw the same.

Key findings at a glance

01

About 1 in 5 recommendations, on both tools

The hotel shows up in 19.4% of answers on the ChatGPT website and 21.7% of answers from the developer version. The gap is within chance — statistically, the same. Both pick the hotel about one time in five.

02

Both look up the web often. The website looks up more

The ChatGPT website pulls in live web results on 78% of answers. The developer version does the same on 57%. Both lean on the live web most of the time; the website just leans harder. Either way, a hotel that isn’t findable online is in trouble on both tools.

03

Only the developer version ever recommended from memory in our sample

When the AI skipped the web lookup, the developer version still recommended the hotel 29 times out of 257 — about 11%. On the website side, across 134 no-lookup answers, this never happened once. It’s a side of AI visibility that has nothing to do with today’s search results.

04

They agree on which questions work

Some of our 20 questions produced more recommendations than others. When we rank questions by how often the hotel got picked, the two tools agree on that ranking (moderate positive correlation, statistically significant). So ‘which kinds of travellers are most likely to get your hotel recommended’ is something both tools tell you the same way.


What this means for your hotel

A “1 in 5” recommendation rate isn’t one number. It’s the sum of two different ways the AI can pick your hotel: by looking it up on the live web, or by remembering it from what it learned during training. Both tools use both paths — but they lean on them differently.

The bigger lesson: being findable on the live web carries you most of the way on both tools, because both pull in web results the majority of the time. If your hotel isn’t showing up in today’s search results, it’s in trouble everywhere. Think of this like the ground floor of AI visibility — without it, nothing else matters.

The subtler lesson: there’s a second floor. On the developer version of ChatGPT, the AI sometimes answers without looking anything up — and still recommends the hotel 11% of the time. On the website, we never saw that happen in our sample. If your competitor shows up in someone’s booking app even when the AI didn’t search the web, that competitor has worked its way into the AI’s long-term memory in a way you haven’t. That’s a slower, deeper kind of visibility — and a separate thing to work on.

The two tools are not interchangeable

Checking your hotel on chatgpt.com answers a different question than checking it inside a booking app or travel tool. A good result on one doesn’t predict a good result on the other. If you care about both places where AI is recommending hotels, you have to check both.


What to do about it

1. Don’t rely on a chatgpt.com check alone.

A strong result when you type your hotel into chatgpt.com tells you one thing: how you look on that website. It does not tell you how you look inside booking apps, concierge tools, travel assistants, or anything else built on top of the same AI. Only one of these two places ever recommends from memory, and the overall prominence patterns aren’t identical either. If any of those app-and-tool places matter to your bookings, you need to check those too — or have someone check them for you.

2. Win the live web first. Everything else is downstream.

Both tools pull in live web results most of the time — 78% on the website, 57% on the developer version. That means being visible in today’s search results is the single highest-leverage thing you can do, because it affects both places. Focus on what travellers and journalists write about you: coverage on major travel outlets, regional tourism boards, Wikipedia. Those are the pages AI pulls from when it decides to look something up.

3. Think long-term about how the AI remembers you.

The 11% of “no-lookup” recommendations we measured come from a different place: what the AI already knew before it ever searched. You can’t fix that this week with an SEO change. What builds it up is durable, long-lived web presence — the same hotel showing up under the same name on authoritative sources over years. This is a slower, deeper lever than live-search ranking, but it’s the one that determines whether the AI knows your hotel exists without having to look.


The evidence

Finding 1 — Same destination, same rate

Across 598 answers on the ChatGPT website and 600 answers from the developer version, the hotel was recommended at rates you can’t tell apart: 19.4% vs 21.7%. The gap is not statistically significant — it’s well within the noise you’d expect from running the same test twice. By this measure, they’re the same.

Recommendation rate — same AI, two tools

598 ChatGPT website answers · 600 developer-version answers · 20 questions × 30 repetitions

ChatGPT website
19.4%
Developer version
21.7%

Not statistically significant. The two tools recommend the hotel at the same rate overall. (See Methodology for exact test statistics.)

Finding 2 — How the AI gets to the answer

The bottom-line rate is a sum of two parts: answers that came after the AI looked something up on the web, and answers where it didn’t. The two tools split that sum differently. And one of them has a path that the other literally never uses.

How the AI gets to each answer

Top two rows: how often the AI looked up the web. Bottom two: recommendations made with no lookup.

Website · looked up web
77.6%
Developer · looked up web
57.2%
Website · from memory
0%
Developer · from memory
11.3%

Both tools pull in live web results most of the time — the website 78%, the developer version 57%. But only the developer version ever recommended the hotel without a lookup in our sample (29 out of 257 no-lookup answers). Both the search-rate gap and the no-lookup gap are statistically significant. (See Methodology for exact test statistics.)

0.48

How well the two tools agree on which questions surface the hotel. When we rank each of the 20 questions by how often it led to a recommendation, the two rankings line up with a moderate positive correlation. Questions about airport-plus-spa, for example, work well on both tools. Questions about remote cycling trips work poorly on both. That per-question alignment is statistically significant and averages out the random variation you get when you ask any AI the same question twice.

Finding 3 — When they do recommend, they rank it differently

We classified every answer into one of four tiers. When the hotel does appear, the ChatGPT website commits harder — it’s more likely to call the hotel the top pick. The developer version hedges more, slotting the hotel into a longer list of options. Imagine two travel agents: one hands you a single recommendation; the other hands you a ranked shortlist. Same client, same data, different style.

The four prominence tiers we classified every answer into

TierWhat it means
Top pickThe hotel is presented as the single best choice for the question.
RecommendedThe hotel is affirmatively recommended, but not as the top pick.
In listThe hotel appears in a list of options without any special emphasis.
Not mentionedThe hotel’s name does not appear in the answer at all.

ChatGPT website

598 answers — share of each tier

  • Not mentioned80.6%
  • Top pick8.5%
  • Recommended7.5%
  • In list3.3%

Developer version

600 answers — share of each tier

  • Not mentioned78.3%
  • In list11.2%
  • Recommended6.2%
  • Top pick4.3%

The shapes of the two pies are significantly different. The website concentrates mentions at the top; the developer version spreads them across lists. (See Methodology for exact test statistics.)

One way to read this: the ChatGPT website almost always checks the live web before answering, so when it mentions your hotel it’s because it just saw evidence for it — and it commits. The developer version mixes live-web answers with from-memory answers, so it’s less sure on any one answer and plays it safer with longer lists. Those are different styles of confidence, not different opinions about your hotel.


Frequently asked questions

Does this apply to my hotel?

If travellers might ask an AI for a hotel like yours — on chatgpt.com, inside a booking app, through a concierge service, anywhere built on AI — yes. We tested one specific hotel, but the pattern we measured is about the two tools, not about that hotel. The exact percentages would shift for your property; the structural gap (the website always checks the live web, the developer version sometimes answers from memory) is a property of the tools themselves.

Can I do anything about this right now?

Yes. The single highest-leverage move is making sure your hotel is easy to find on the live web today, because both tools pull in web results on a majority of answers. That means coverage on travel outlets, tourism boards, and authoritative sources under your canonical name. For the “from memory” side, there’s no quick fix — it builds up over years — but the same durable web presence feeds both channels over time.

How is this different from regular SEO?

Regular SEO is about ranking on Google for search queries. This is about being recommended by an AI when someone asks it to suggest a hotel. There’s overlap — both care about clean websites and good citations — but the AI pulls from a wider mix of sources, reads your content semantically rather than matching keywords, and (as this study shows) sometimes answers from what it already knew before ever searching. Think of AI visibility as a separate layer that sits on top of your existing SEO.

Would Claude or Gemini behave the same way?

Out of scope here. This comparison is specifically between the two places OpenAI’s AI reaches travellers. Other AI models — Anthropic’s Claude, Google’s Gemini — may behave differently, and we haven’t measured that. The pattern we did see (“same AI model behaves differently in different tools”) is likely to apply in those ecosystems too, but that’s a hypothesis, not a finding.

If a competitor shows up in AI tools and I don’t, how do I catch up?

Check the live-web side first: search your city on major travel outlets and tourism boards, and see whether your hotel is cited under its canonical name. If not, that’s the fastest lever — a few well-placed citations typically move visibility in weeks, not months. The “AI memory” side is slower: it’s built by being mentioned, repeatedly and consistently, on authoritative sources over years. Start with the live-web layer; plan for the memory layer.


How we ran the study

~1,200

Total answers analysed

20

Questions

30

Repetitions per tool

2

Tools compared

Questions. We wrote 20 natural-language hotel search questions — things a real traveller might ask (“family-friendly hotel near Munich airport, thermal spa nearby, good restaurant, free parking”). Each question implicitly matched one target independent hotel’s features (airport proximity, thermal spa, rural, restaurant, parking) without naming the hotel. Questions varied across traveller type and trip purpose.

How we ran it. We asked every question 30 times on both the ChatGPT website and on the developer version of the same AI (with web lookup enabled, so the AI itself chose whether to search). Both ran on the same underlying GPT-5.3 model. That gave us about 1,200 total answers to analyse.

How we measured mentions. A smaller AI model read each answer and classified it into one of four categories: the hotel wasn’t mentioned at all, it appeared in a list, it was recommended, or it was the top pick. We double-checked with a simple name search before the classification step, so the basic “mentioned or not” call is very reliable. Prominence tiers are slightly fuzzier but consistent between the two tools (same classifier, same rules).

Statistical tests. Fisher’s exact test for binary rate comparisons (recommendation rate, search-rate, no-lookup recommendation rate). A chi-squared test for the full 4-class distribution. Spearman rank correlation for per-question alignment. Exact statistics: recommendation rate p = 0.35 (not significant); search rate p < 0.001; no-lookup recommendation rate p < 0.001; 4-class distribution χ² = 34.4, p < 0.001; per-question Spearman ρ = 0.48, p < 0.05. All tests are two-sided.

Limitations. One target hotel, one model family, one region. The structural findings (the two tools reach a similar average through different paths; only the developer version ever recommends from memory; both agree on which questions work) should apply to other hotels. The exact percentages are a single-hotel measurement and will shift for your property. We do not compare trial-by-trial agreement across surfaces: LLM output is stochastic, so two runs of the same tool on the same question also disagree on individual answers, and we didn’t run the within-tool control needed to isolate how much of any trial-level difference comes from the surfaces vs. from that baseline noise. Two of the 600 expected answers on the website side failed during collection — 0.3% of one arm — too small to change any of the conclusions.


Want to know if your hotel shows up on chatgpt.com and in the apps built on top of it?

Huxo’s AI Visibility Report tests your property across both places — the ChatGPT website travellers use, and the AI that powers booking apps and travel tools.

Continue reading