Why this matters to you
A traveler might ask ChatGPT one day, Gemini the next, and Claude after that. If all three models agreed on which hotel to recommend, you’d only need to be visible to one of them to be visible everywhere. They don’t agree — so being a favorite of one model is not the same as being a favorite of another.
For an independent hotel, the practical question is: which models treat you fairly? And where should you invest your AI visibility effort?
Models agree with each other more than chance — but less than you’d expect.
Key findings at a glance
01
Reasoning tokens cut position sensitivity roughly in half
GPT-5.4 with medium reasoning drops STSR from 0.37 → 0.24. Gemini (which reasons by default) is most stable at 0.18.
02
Models agree on about 60–70% of picks
Given 10 options, random agreement would be ~10%. Real agreement runs 53% (Claude → GPT-reasoning) to 72% (GPT → GPT-reasoning) — consistent, but not convergent.
03
Claude picks an independent path
Highest Jensen-Shannon divergence and lowest rank correlation against every other model. Claude considers a different competitive set — not just different winners.
04
Only Gemini and GPT-reasoning clear the reliability bar
Krippendorff α > 0.667 for both. Default GPT (α = 0.59) and Claude (α = 0.52) fall into ‘moderate agreement’ — their single-pick picks are statistically noisy.
What this means for your hotel
Cross-model AI visibility is not one market. It is four markets that partially overlap. Optimizing for GPT-5.4 buys you most of the GPT-reasoning market as a bonus (agreement 72%), a solid chunk of Gemini (62%), and roughly half of Claude (57%). The reverse also holds: a hotel that Claude loves may not be a hotel that ChatGPT loves.
Two useful implications: (1) visibility tests have to be multi-model, not single-model, to be honest; (2) Claude is the model where “differentiated data” pays off the most, because it’s already making more independent choices than the others.
What to do about it
1. Always test across at least GPT, Gemini, and Claude.
Any “AI visibility” metric measured on a single model is misleading — models agree on only 53–72% of trials. A measurement on GPT alone tells you about GPT; it doesn’t tell you about AI recommendations as a category.
2. When Claude’s not picking you, the problem is data coverage, not position.
Claude’s high JS divergence and low rank correlation (ρ = 0.21 vs Gemini) mean it considers different hotels as the competitive set. If Claude doesn’t surface you, it’s not that Claude’s “missing” — it’s that Claude is reading the set differently. Improving location context and distinctive attributes (things that shift the competitive set) is more effective than tweaking amenity lists.
3. If you had to pick one model to optimize for, pick Gemini.
Gemini is the most stable (STSR = 0.18, κ = 0.79) and agrees most with GPT-reasoning (72%). Optimizing for Gemini gives the best cross-model spillover per unit of effort. And because Gemini’s judgments are the most consistent across orderings, “winning” with Gemini is the most durable form of AI visibility.
The evidence
Finding 1 — Reasoning tokens are the single biggest stabilizer
STSR (Systematic Test-Set Reversal) measures how often the model’s pick flips when you reshuffle the list. Lower = more position-stable.
STSR by model — lower is more stable
976 trials per model · 61 query sets · 16 permutations
Gemini is the most position-stable (31/61 sets fully stable). GPT-reasoning is significantly more stable than default GPT (Mann-Whitney U p = 0.006). Claude is slightly less stable than default GPT but not significantly so (p = 0.215).
Finding 2 — Academic reliability only clears the bar for reasoning models
Academic consistency metrics — the Krippendorff α 0.667 threshold is the social-science bar for ‘tentative’ inter-rater reliability
| Model | Fleiss κ | Krippendorff α | Kendall W | Agreement band |
|---|---|---|---|---|
| Gemini 3.1 Pro | 0.792 | 0.793 | 0.810 | Substantial |
| GPT-5.4 (reasoning) | 0.735 | 0.736 | 0.482 | Substantial |
| GPT-5.4 (default) | 0.589 | 0.590 | 0.350 | Moderate — below 0.667 |
| Claude Opus 4.6 | 0.521 | 0.522 | 0.544 | Moderate — below 0.667 |
An interesting asymmetry
Claude’s Kendall W = 0.544 is higher than GPT’s 0.350 despite Claude having worse STSR. That means Claude has a more stable ‘top tier’ (the same hotels reliably appear near the top) but noisier single-pick tie-breaking among that tier. Claude knows who’s in the running — it just disagrees with itself on who wins.
Finding 3 — Model pairs agree 53–72% of the time
Per-trial agreement measures how often two models pick the exact same hotel on the same trial. Random agreement across 10 options would be ~10%.
Per-trial agreement between model pairs
Fraction of trials where both models picked the same hotel
The GPT-family agreement of 72% makes sense (same base model). The Gemini → GPT-reasoning agreement of 72% is the surprise — different families, different training, and still they land on the same hotel nearly 3 of every 4 times.
Finding 4 — Claude evaluates a different competitive set
Jensen-Shannon divergence compares the full selection distribution across models. Low JS means both models weight the same hotels similarly; high JS means they consider different hotels entirely. Spearman rank correlation ρ does a related thing for rankings.
Cross-model selection distance — JS divergence (lower = more similar) and Spearman ρ (higher = more aligned rankings)
| Pair | Jensen-Shannon | Spearman ρ | Reading |
|---|---|---|---|
| GPT → GPT-reasoning | 0.283 | 0.655 | Closest pair |
| GPT-reasoning → Gemini | 0.335 | 0.279 | Agree on picks, disagree on rankings |
| GPT → Claude | 0.348 | 0.317 | Moderate distance |
| GPT → Gemini | 0.377 | 0.407 | Moderate distance |
| Gemini → Claude | 0.427 | 0.212 | Far apart (29% of sets are negative ρ) |
| GPT-reasoning → Claude | 0.434 | 0.253 | Highest divergence |
About 29% of query sets produce a negative rank correlation between Gemini and Claude. That’s not “they disagree on the winner” — that’s “they rank the options in inverted order.” The models read the same list and come out with opposite preference orderings on almost a third of markets.
Finding 5 — Position-difficult markets are shared across models
We also checked whether the same markets that trip up one model also trip up the others. They do.
STSR correlation across model pairs — Spearman ρ per pair · are the same markets hard for everyone?
| Pair | Spearman ρ | p-value |
|---|---|---|
| GPT → GPT-reasoning | 0.54 | < 0.001 |
| Gemini → GPT-reasoning | 0.51 | < 0.001 |
| GPT → Claude | 0.45 | < 0.001 |
| Gemini → GPT | 0.39 | 0.002 |
| Claude → Gemini | 0.36 | 0.004 |
| Claude → GPT-reasoning | 0.30 | 0.019 |
All six pairs show significantly correlated STSR across markets — meaning the hotel sets that make one model flip its pick tend to make the other models flip theirs too. Positional instability is a property of the competitive set as much as a property of the model.
The same crowded markets are hard for every model — position only bites when no hotel clearly wins.
Frequently asked questions
Reasoning tokens and built-in chain-of-thought steer both models toward similar compensatory evaluation — each feature gets weighed, trade-offs get made, and the model arrives at a ranking. That process converges on similar answers even across model families. Default GPT (no reasoning) skips this step and lands elsewhere.
Based on the data, Claude appears to consider a broader competitive set — it doesn’t converge on the same 2–3 hotels that GPT and Gemini home in on. Its Kendall W (0.544) is high despite moderate Fleiss κ, so Claude has stable preferences over the top tier but noisy single-pick behavior within it. The practical read: Claude’s ‘taste’ is more idiosyncratic, which can be good or bad for your specific hotel.
Claude is the most independent-friendly by behavior (less convergent, more varied picks), but also the least reliable (κ = 0.52 — its picks are noisy). Gemini is the most reliable and fair to well-differentiated hotels, thanks to its stable ranking behavior. GPT-reasoning is the closest to ‘default GPT behavior but more stable.’ There’s no single winner.
Based on agreement rates: optimizing for GPT-reasoning buys you ~72% of Gemini and ~72% of default GPT for free. Optimizing for Claude is the most siloed — only ~52–57% spillover to other models. The highest-leverage single choice is probably GPT-reasoning or Gemini.
They’re exposed via API (and OpenRouter) but not always via consumer UI. Default ChatGPT/Claude.ai conversations may or may not use extended reasoning depending on the model tier. The practical implication for you: visibility tests that don’t specify reasoning mode will land somewhere in between the ‘default’ and ‘reasoning’ results in our table.
How we ran the experiment
3,904
Total trials
4
Models tested
61
Query sets
976
Trials / model
Same inputs across four model configurations. 61 Google Hotels query sets, 10 hotels each, all 13 raw features. Each set shuffled into 16 random permutations. 976 trials per model, 3,904 total trials.
Models tested: GPT-5.4 (default, temperature 0), GPT-5.4 with medium reasoning (via OpenRouter reasoning tokens), Gemini 3.1 Pro, and Claude Opus 4.6. All received identical system and user prompts; only the model string changed.
Metrics: STSR (position instability), Fleiss κ / Krippendorff α / Kendall W (inter-permutation agreement), per-trial agreement (exact-match pick rate), Jensen-Shannon divergence (distance between selection distributions), and Spearman ρ (rank alignment).
Limitations. All models tested at their current April 2026 production versions. API behavior can change; today’s comparison may not hold six months out. Google Hotels data only — other hotel data sources (OTAs with different feature mixes) weren’t tested. No tool use — all trials are “read the list, pick one” without web search or retrieval.
Which AI picks your hotel — and which one skips you entirely?
Huxo’s AI Visibility Report tests you across GPT-5.4, Claude Opus, and Gemini 3.1 Pro simultaneously. You see which models know you, which don’t, and what to fix first.