← Back to Research
Chapter 04 · Cross-Model

Each AI has a different taste — same hotels, three different winners

We ran the same 976 hotel selections through GPT-5.4, GPT-5.4 with reasoning, Gemini 3.1 Pro, and Claude Opus 4.6. They agree on less than you’d think — and Claude, in particular, considers an entirely different competitive set.

Mar 11, 202610 min readHuxo Research

Why this matters to you

A traveler might ask ChatGPT one day, Gemini the next, and Claude after that. If all three models agreed on which hotel to recommend, you’d only need to be visible to one of them to be visible everywhere. They don’t agree — so being a favorite of one model is not the same as being a favorite of another.

For an independent hotel, the practical question is: which models treat you fairly? And where should you invest your AI visibility effort?

Models agree with each other more than chance — but less than you’d expect.

Key findings at a glance

01

Reasoning tokens cut position sensitivity roughly in half

GPT-5.4 with medium reasoning drops STSR from 0.37 → 0.24. Gemini (which reasons by default) is most stable at 0.18.

02

Models agree on about 60–70% of picks

Given 10 options, random agreement would be ~10%. Real agreement runs 53% (Claude → GPT-reasoning) to 72% (GPT → GPT-reasoning) — consistent, but not convergent.

03

Claude picks an independent path

Highest Jensen-Shannon divergence and lowest rank correlation against every other model. Claude considers a different competitive set — not just different winners.

04

Only Gemini and GPT-reasoning clear the reliability bar

Krippendorff α > 0.667 for both. Default GPT (α = 0.59) and Claude (α = 0.52) fall into ‘moderate agreement’ — their single-pick picks are statistically noisy.


What this means for your hotel

Cross-model AI visibility is not one market. It is four markets that partially overlap. Optimizing for GPT-5.4 buys you most of the GPT-reasoning market as a bonus (agreement 72%), a solid chunk of Gemini (62%), and roughly half of Claude (57%). The reverse also holds: a hotel that Claude loves may not be a hotel that ChatGPT loves.

Two useful implications: (1) visibility tests have to be multi-model, not single-model, to be honest; (2) Claude is the model where “differentiated data” pays off the most, because it’s already making more independent choices than the others.


What to do about it

1. Always test across at least GPT, Gemini, and Claude.

Any “AI visibility” metric measured on a single model is misleading — models agree on only 53–72% of trials. A measurement on GPT alone tells you about GPT; it doesn’t tell you about AI recommendations as a category.

2. When Claude’s not picking you, the problem is data coverage, not position.

Claude’s high JS divergence and low rank correlation (ρ = 0.21 vs Gemini) mean it considers different hotels as the competitive set. If Claude doesn’t surface you, it’s not that Claude’s “missing” — it’s that Claude is reading the set differently. Improving location context and distinctive attributes (things that shift the competitive set) is more effective than tweaking amenity lists.

3. If you had to pick one model to optimize for, pick Gemini.

Gemini is the most stable (STSR = 0.18, κ = 0.79) and agrees most with GPT-reasoning (72%). Optimizing for Gemini gives the best cross-model spillover per unit of effort. And because Gemini’s judgments are the most consistent across orderings, “winning” with Gemini is the most durable form of AI visibility.


The evidence

Finding 1 — Reasoning tokens are the single biggest stabilizer

STSR (Systematic Test-Set Reversal) measures how often the model’s pick flips when you reshuffle the list. Lower = more position-stable.

STSR by model — lower is more stable

976 trials per model · 61 query sets · 16 permutations

Gemini 3.1 Pro
0.184
GPT-5.4 (reasoning)
0.237
GPT-5.4 (default)
0.368
Claude Opus 4.6
0.427

Gemini is the most position-stable (31/61 sets fully stable). GPT-reasoning is significantly more stable than default GPT (Mann-Whitney U p = 0.006). Claude is slightly less stable than default GPT but not significantly so (p = 0.215).

Finding 2 — Academic reliability only clears the bar for reasoning models

Academic consistency metrics — the Krippendorff α 0.667 threshold is the social-science bar for ‘tentative’ inter-rater reliability

ModelFleiss κKrippendorff αKendall WAgreement band
Gemini 3.1 Pro0.7920.7930.810Substantial
GPT-5.4 (reasoning)0.7350.7360.482Substantial
GPT-5.4 (default)0.5890.5900.350Moderate — below 0.667
Claude Opus 4.60.5210.5220.544Moderate — below 0.667

An interesting asymmetry

Claude’s Kendall W = 0.544 is higher than GPT’s 0.350 despite Claude having worse STSR. That means Claude has a more stable ‘top tier’ (the same hotels reliably appear near the top) but noisier single-pick tie-breaking among that tier. Claude knows who’s in the running — it just disagrees with itself on who wins.

Finding 3 — Model pairs agree 53–72% of the time

Per-trial agreement measures how often two models pick the exact same hotel on the same trial. Random agreement across 10 options would be ~10%.

Per-trial agreement between model pairs

Fraction of trials where both models picked the same hotel

GPT → GPT-reasoning
72.03%
GPT-reasoning → Gemini
72.13%
GPT → Gemini
61.58%
GPT → Claude
56.76%
Gemini → Claude
53.59%
GPT-reasoning → Claude
52.56%

The GPT-family agreement of 72% makes sense (same base model). The Gemini → GPT-reasoning agreement of 72% is the surprise — different families, different training, and still they land on the same hotel nearly 3 of every 4 times.

Finding 4 — Claude evaluates a different competitive set

Jensen-Shannon divergence compares the full selection distribution across models. Low JS means both models weight the same hotels similarly; high JS means they consider different hotels entirely. Spearman rank correlation ρ does a related thing for rankings.

Cross-model selection distance — JS divergence (lower = more similar) and Spearman ρ (higher = more aligned rankings)

PairJensen-ShannonSpearman ρReading
GPT → GPT-reasoning0.2830.655Closest pair
GPT-reasoning → Gemini0.3350.279Agree on picks, disagree on rankings
GPT → Claude0.3480.317Moderate distance
GPT → Gemini0.3770.407Moderate distance
Gemini → Claude0.4270.212Far apart (29% of sets are negative ρ)
GPT-reasoning → Claude0.4340.253Highest divergence

About 29% of query sets produce a negative rank correlation between Gemini and Claude. That’s not “they disagree on the winner” — that’s “they rank the options in inverted order.” The models read the same list and come out with opposite preference orderings on almost a third of markets.

Finding 5 — Position-difficult markets are shared across models

We also checked whether the same markets that trip up one model also trip up the others. They do.

STSR correlation across model pairs — Spearman ρ per pair · are the same markets hard for everyone?

PairSpearman ρp-value
GPT → GPT-reasoning0.54< 0.001
Gemini → GPT-reasoning0.51< 0.001
GPT → Claude0.45< 0.001
Gemini → GPT0.390.002
Claude → Gemini0.360.004
Claude → GPT-reasoning0.300.019

All six pairs show significantly correlated STSR across markets — meaning the hotel sets that make one model flip its pick tend to make the other models flip theirs too. Positional instability is a property of the competitive set as much as a property of the model.

The same crowded markets are hard for every model — position only bites when no hotel clearly wins.

Frequently asked questions

Why does GPT-reasoning and Gemini agree 72% — basically as much as GPT agrees with itself?

Reasoning tokens and built-in chain-of-thought steer both models toward similar compensatory evaluation — each feature gets weighed, trade-offs get made, and the model arrives at a ranking. That process converges on similar answers even across model families. Default GPT (no reasoning) skips this step and lands elsewhere.

Why is Claude so different?

Based on the data, Claude appears to consider a broader competitive set — it doesn’t converge on the same 2–3 hotels that GPT and Gemini home in on. Its Kendall W (0.544) is high despite moderate Fleiss κ, so Claude has stable preferences over the top tier but noisy single-pick behavior within it. The practical read: Claude’s ‘taste’ is more idiosyncratic, which can be good or bad for your specific hotel.

Which model is ‘best’ for independent hotels?

Claude is the most independent-friendly by behavior (less convergent, more varied picks), but also the least reliable (κ = 0.52 — its picks are noisy). Gemini is the most reliable and fair to well-differentiated hotels, thanks to its stable ranking behavior. GPT-reasoning is the closest to ‘default GPT behavior but more stable.’ There’s no single winner.

If I optimize for one model, how much spillover do I get to the others?

Based on agreement rates: optimizing for GPT-reasoning buys you ~72% of Gemini and ~72% of default GPT for free. Optimizing for Claude is the most siloed — only ~52–57% spillover to other models. The highest-leverage single choice is probably GPT-reasoning or Gemini.

Are reasoning tokens available to everyone?

They’re exposed via API (and OpenRouter) but not always via consumer UI. Default ChatGPT/Claude.ai conversations may or may not use extended reasoning depending on the model tier. The practical implication for you: visibility tests that don’t specify reasoning mode will land somewhere in between the ‘default’ and ‘reasoning’ results in our table.


How we ran the experiment

3,904

Total trials

4

Models tested

61

Query sets

976

Trials / model

Same inputs across four model configurations. 61 Google Hotels query sets, 10 hotels each, all 13 raw features. Each set shuffled into 16 random permutations. 976 trials per model, 3,904 total trials.

Models tested: GPT-5.4 (default, temperature 0), GPT-5.4 with medium reasoning (via OpenRouter reasoning tokens), Gemini 3.1 Pro, and Claude Opus 4.6. All received identical system and user prompts; only the model string changed.

Metrics: STSR (position instability), Fleiss κ / Krippendorff α / Kendall W (inter-permutation agreement), per-trial agreement (exact-match pick rate), Jensen-Shannon divergence (distance between selection distributions), and Spearman ρ (rank alignment).

Limitations. All models tested at their current April 2026 production versions. API behavior can change; today’s comparison may not hold six months out. Google Hotels data only — other hotel data sources (OTAs with different feature mixes) weren’t tested. No tool use — all trials are “read the list, pick one” without web search or retrieval.


Which AI picks your hotel — and which one skips you entirely?

Huxo’s AI Visibility Report tests you across GPT-5.4, Claude Opus, and Gemini 3.1 Pro simultaneously. You see which models know you, which don’t, and what to fix first.

Continue reading