Why this matters to you
When a traveler asks an AI to pick a hotel, the AI reads some list of options and returns one winner. You might expect the winner to depend only on which hotels are on that list — not on what order they happen to be in.
That’s not what happens. The same 10 hotels in a different order yields a different winner 37% of the time. Some of that is real positional bias. Some of it is temperature-zero API non-determinism. Separating the two is the only honest way to reason about this.
A hotel listed first has roughly double the selection probability of one listed last.
Key findings at a glance
01
Earlier positions win
Positions 1–4 are picked ~13% of the time each; positions 7–10 average ~7%. Chi-squared rejects the null of uniform selection (χ² = 70.1, p < 0.001).
02
The effect is real — but overstated without a control
Overall STSR is 0.368, but the determinism baseline (identity order) is 0.115. True positional bias is roughly 0.25 — about 68% of the observed effect.
03
It’s bimodal: some markets are fully stable, others are a mess
20% of query sets have STSR = 0 (no flips across any permutation). Others exceed 0.5. Position matters a lot — or not at all — depending on the competitive set.
04
Clear-winner markets are immune
When one hotel is objectively superior, position stops mattering. Position only shows up when the AI can’t cleanly differentiate options.
What this means for your hotel
Position matters most when you’re competing in a similar-looking peer set — which is the norm in every midscale-to-upscale market. The upstream implication: positional instability makes all of your AI visibility metrics noisier. One AI recommendation test today and another next week can disagree on your hotel purely because the list was ordered differently.
Three operational consequences flow from that: (1) one-off AI visibility checks are unreliable — you need repeated measurement; (2) anything that makes you genuinely differentiated reduces positional noise against you; (3) when AI recommendation engines start respecting a fixed answer (fewer flips across orderings), hotels that benefit most are the ones with distinctive data — not the ones listed first by default.
What to do about it
1. Don’t over-read a single AI visibility test.
If you ran one ChatGPT query and it didn’t pick your hotel, that’s a single sample. Our data shows the model flips its pick on 37% of permutation pairs. Run the same query multiple times, ideally with different list orderings, before drawing conclusions.
2. Invest in differentiators that create a ‘clear winner’ in your segment.
When one hotel stands clearly above the peer set, position stops mattering. That’s a specific, useful mechanism: reduce the AI’s uncertainty between you and your neighbors, and the positional noise that lets other hotels win goes away. Location-context data and clear review-based quality signals are the levers here.
3. Treat positional bias as a floor, not a ceiling.
Even with reasoning tokens and careful prompts, models keep some positional bias. The right architecture for any production AI recommendation system is a shortlist + deterministic tiebreaker (we cover this in Chapter 06). For your purposes as a hotelier: the system will not be fair to you by default, and any assumption that ‘the AI will just rank us correctly’ is wrong.
The evidence
Finding 1 — Earlier positions win disproportionately
Across 976 trials (61 query sets × 16 random orderings), we counted how often each position was picked. If the model were position-blind, each position should land at 10%. It doesn’t.
Position selection distribution — GPT-5.4, 976 trials
Expected uniform rate: 10% per position
Positions 1–4 average ~13% each; positions 7–10 average ~7%. Chi-squared uniformity test rejects null with χ² = 70.1, p < 0.001.
1.8×
Position 1 vs. Position 7. 13.22% selection rate vs. 6.45% — being listed first gives a hotel roughly double the selection probability of the seventh-listed hotel, holding everything else equal.
Finding 2 — The real bias is smaller than the raw number suggests
STSR (Systematic Test-Set Reversal) measures how often the model’s pick disagrees with itself across permutation pairs. Raw STSR across permuted orderings is 0.368 — sounds bad. But GPT-5.4 at temperature 0 is not perfectly deterministic: if you run the same identity-ordered list twice, it still disagrees with itself sometimes.
Determinism control — identity order vs. permuted
Separating real positional bias from API noise · same GPT-5.4, temperature = 0
Identity Krippendorff α = 0.867 (good reliability, > 0.8). Permuted α = 0.597 (below the 0.667 threshold). ~32% of the raw permuted instability is baseline API noise; the remaining ~68% is genuine positional bias.
Why the baseline matters
Without the identity control, every positional bias claim is inflated. Our raw 0.368 STSR looks like ‘the model disagrees with itself 37% of the time because of position.’ The real story is ‘about 12% of that is API non-determinism, and the remaining 25% is position.’ Both are real; only the 25% is actually positional.
Finding 3 — Instability is bimodal across markets
STSR isn’t evenly distributed across query sets. A big chunk of markets are completely stable (STSR = 0). Another chunk are deeply unstable (STSR > 0.5). Very few sit in the middle.
STSR distribution across 61 query sets
How many query sets fall in each STSR band
The distribution is visibly bimodal. Some markets are ‘clear-winner’ cases — the same hotel wins every permutation. Others are genuine toss-ups where position is the deciding factor. Overall mean PCM = 0.65: the model agrees with its first-order pick only 65% of the time, with 80% of query sets (49/61) flipping at least once.
Finding 4 — Academic reliability falls below the standard threshold
For single-pick hotel recommendation, GPT-5.4 under the default configuration does not reach the reliability threshold that social science uses for ‘trustworthy annotator agreement.’
Academic consistency metrics — baseline 10-hotel selection
| Metric | Value | Interpretation |
|---|---|---|
| Fleiss κ | 0.589 | Moderate agreement |
| Krippendorff α | 0.590 | Below 0.667 threshold |
| Kendall W | 0.350 | Low concordance |
| Deterministic sets (permuted) | 12 / 61 | 20% fully stable |
| Deterministic sets (identity) | 38 / 61 | 62% fully stable |
What ‘below threshold’ means
Krippendorff’s α ≥ 0.667 is the standard bar for tentative conclusions; ≥ 0.8 for reliable ones. At α = 0.590, a single-pick recommendation from this configuration shouldn’t be treated as a stable judgment. It’s closer to a probabilistic vote — the model has a favorite, but not consistently the same one across orderings.
Frequently asked questions
Partly. Transformer attention patterns favor early tokens, which translates to favoring early list items. But training data also reinforces this — ‘best hotels in [city]’ articles put their top pick first. Either way, it’s a property of how current models read lists, not a property of the hotel data itself.
For a single query, no. For the visibility of a single hotel across a market segment, it’s meaningful — it means your expected selection probability depends on variables you don’t control, like what order the API happened to present your competitors in. And the effect compounds: every downstream bias (brand, keyword familiarity) interacts with this, not replaces it.
They help. GPT-5.4 with medium reasoning drops STSR to ~0.24 and pushes Fleiss κ above 0.7. Gemini 3.1 Pro (which reasons by default) is the most position-stable model we tested. See Chapter 04 and Chapter 06 for the full comparison and the mitigation experiments.
No. The model’s top 3 shortlist is far more stable than its single pick — ~98% of single-pick winners appear in the top 3 across orderings (Chapter 06). The instability is mostly tie-breaking noise within a fairly stable competitive set. The model identifies *who’s in the running* reliably; it’s the final pick among closely-matched candidates that bounces around.
You usually can’t. But you can reduce the chance that you’re *in the ‘could go either way’ middle tier*, because that’s where position determines the outcome. Differentiate your listing so the model doesn’t see you as fungible with neighbors — strong location context, complete quality signals, distinctive amenities.
How we ran the experiment
976
Total trials
61
Query sets
10
Hotels / set
16
Permutations / set
We tested GPT-5.4 at temperature 0 across 61 Google Hotels query sets. Each set contained 10 real hotels with 13 raw features (price, ratings, location, amenities). For each set, we generated 16 random orderings and asked the model to pick the best hotel.
We measured three things: (1) position distribution — what fraction of picks came from each slot; (2) STSR — fraction of permutation pairs within a set whose picks disagreed; (3) PCM — how often the model agreed with its first-ordering pick across all 16 permutations.
Critically, we also ran a determinism control: the same 61 query sets in their identity ordering, repeated 16 times. Any instability in this condition reflects API non-determinism at temperature 0, not positional effects. Identity STSR = 0.115 vs. permuted STSR = 0.368 — ~32% of the raw instability is API noise; the rest is position.
Limitations. Single model (GPT-5.4) at one configuration. Fixed list length of 10 — positional effects can look different at other lengths. Real Google Hotels data but only 61 markets — the bimodal distribution suggests market-specific effects we haven’t fully characterized.
Is AI picking your competitors just because they’re listed first?
Huxo’s AI Visibility Report runs multi-ordering tests across GPT-5.4, Claude, and Gemini — so you see your true visibility, not a single-sample snapshot.