← Back to Research
Chapter 03 · Positional Bias

Does the order of results matter? Yes — and the AI’s pick flips 37% of the time when you reshuffle

We presented the same hotels in 16 different orders and asked GPT-5.4 to pick one. The top four positions split ~13% each; the bottom four split ~7%. Only part of that is real bias — the rest is API noise.

Mar 3, 20267 min readHuxo Research

Why this matters to you

When a traveler asks an AI to pick a hotel, the AI reads some list of options and returns one winner. You might expect the winner to depend only on which hotels are on that list — not on what order they happen to be in.

That’s not what happens. The same 10 hotels in a different order yields a different winner 37% of the time. Some of that is real positional bias. Some of it is temperature-zero API non-determinism. Separating the two is the only honest way to reason about this.

A hotel listed first has roughly double the selection probability of one listed last.

Key findings at a glance

01

Earlier positions win

Positions 1–4 are picked ~13% of the time each; positions 7–10 average ~7%. Chi-squared rejects the null of uniform selection (χ² = 70.1, p < 0.001).

02

The effect is real — but overstated without a control

Overall STSR is 0.368, but the determinism baseline (identity order) is 0.115. True positional bias is roughly 0.25 — about 68% of the observed effect.

03

It’s bimodal: some markets are fully stable, others are a mess

20% of query sets have STSR = 0 (no flips across any permutation). Others exceed 0.5. Position matters a lot — or not at all — depending on the competitive set.

04

Clear-winner markets are immune

When one hotel is objectively superior, position stops mattering. Position only shows up when the AI can’t cleanly differentiate options.


What this means for your hotel

Position matters most when you’re competing in a similar-looking peer set — which is the norm in every midscale-to-upscale market. The upstream implication: positional instability makes all of your AI visibility metrics noisier. One AI recommendation test today and another next week can disagree on your hotel purely because the list was ordered differently.

Three operational consequences flow from that: (1) one-off AI visibility checks are unreliable — you need repeated measurement; (2) anything that makes you genuinely differentiated reduces positional noise against you; (3) when AI recommendation engines start respecting a fixed answer (fewer flips across orderings), hotels that benefit most are the ones with distinctive data — not the ones listed first by default.


What to do about it

1. Don’t over-read a single AI visibility test.

If you ran one ChatGPT query and it didn’t pick your hotel, that’s a single sample. Our data shows the model flips its pick on 37% of permutation pairs. Run the same query multiple times, ideally with different list orderings, before drawing conclusions.

2. Invest in differentiators that create a ‘clear winner’ in your segment.

When one hotel stands clearly above the peer set, position stops mattering. That’s a specific, useful mechanism: reduce the AI’s uncertainty between you and your neighbors, and the positional noise that lets other hotels win goes away. Location-context data and clear review-based quality signals are the levers here.

3. Treat positional bias as a floor, not a ceiling.

Even with reasoning tokens and careful prompts, models keep some positional bias. The right architecture for any production AI recommendation system is a shortlist + deterministic tiebreaker (we cover this in Chapter 06). For your purposes as a hotelier: the system will not be fair to you by default, and any assumption that ‘the AI will just rank us correctly’ is wrong.


The evidence

Finding 1 — Earlier positions win disproportionately

Across 976 trials (61 query sets × 16 random orderings), we counted how often each position was picked. If the model were position-blind, each position should land at 10%. It doesn’t.

Position selection distribution — GPT-5.4, 976 trials

Expected uniform rate: 10% per position

Position 1
13.22%
Position 2
13.52%
Position 3
13.11%
Position 4
12.30%
Position 5
9.84%
Position 6
9.32%
Position 7
6.45%
Position 8
7.99%
Position 9
6.66%
Position 10
7.58%

Positions 1–4 average ~13% each; positions 7–10 average ~7%. Chi-squared uniformity test rejects null with χ² = 70.1, p < 0.001.

1.8×

Position 1 vs. Position 7. 13.22% selection rate vs. 6.45% — being listed first gives a hotel roughly double the selection probability of the seventh-listed hotel, holding everything else equal.

Finding 2 — The real bias is smaller than the raw number suggests

STSR (Systematic Test-Set Reversal) measures how often the model’s pick disagrees with itself across permutation pairs. Raw STSR across permuted orderings is 0.368 — sounds bad. But GPT-5.4 at temperature 0 is not perfectly deterministic: if you run the same identity-ordered list twice, it still disagrees with itself sometimes.

Determinism control — identity order vs. permuted

Separating real positional bias from API noise · same GPT-5.4, temperature = 0

Identity order
STSR 0.115
Permuted
STSR 0.368
True positional — diff
~0.25

Identity Krippendorff α = 0.867 (good reliability, > 0.8). Permuted α = 0.597 (below the 0.667 threshold). ~32% of the raw permuted instability is baseline API noise; the remaining ~68% is genuine positional bias.

Why the baseline matters

Without the identity control, every positional bias claim is inflated. Our raw 0.368 STSR looks like ‘the model disagrees with itself 37% of the time because of position.’ The real story is ‘about 12% of that is API non-determinism, and the remaining 25% is position.’ Both are real; only the 25% is actually positional.

Finding 3 — Instability is bimodal across markets

STSR isn’t evenly distributed across query sets. A big chunk of markets are completely stable (STSR = 0). Another chunk are deeply unstable (STSR > 0.5). Very few sit in the middle.

STSR distribution across 61 query sets

How many query sets fall in each STSR band

0.00 – 0.05
12 sets
0.10 – 0.15
7
0.20 – 0.25
4
0.30 – 0.35
4
0.40 – 0.45
4
0.45 – 0.50
3
0.50 – 0.55
9
0.55 – 0.60
6
0.60 – 0.65
5
0.65 – 0.70
6
0.75 – 0.80
1

The distribution is visibly bimodal. Some markets are ‘clear-winner’ cases — the same hotel wins every permutation. Others are genuine toss-ups where position is the deciding factor. Overall mean PCM = 0.65: the model agrees with its first-order pick only 65% of the time, with 80% of query sets (49/61) flipping at least once.

Finding 4 — Academic reliability falls below the standard threshold

For single-pick hotel recommendation, GPT-5.4 under the default configuration does not reach the reliability threshold that social science uses for ‘trustworthy annotator agreement.’

Academic consistency metrics — baseline 10-hotel selection

MetricValueInterpretation
Fleiss κ0.589Moderate agreement
Krippendorff α0.590Below 0.667 threshold
Kendall W0.350Low concordance
Deterministic sets (permuted)12 / 6120% fully stable
Deterministic sets (identity)38 / 6162% fully stable

What ‘below threshold’ means

Krippendorff’s α ≥ 0.667 is the standard bar for tentative conclusions; ≥ 0.8 for reliable ones. At α = 0.590, a single-pick recommendation from this configuration shouldn’t be treated as a stable judgment. It’s closer to a probabilistic vote — the model has a favorite, but not consistently the same one across orderings.


Frequently asked questions

Isn’t primacy bias just a training-data quirk?

Partly. Transformer attention patterns favor early tokens, which translates to favoring early list items. But training data also reinforces this — ‘best hotels in [city]’ articles put their top pick first. Either way, it’s a property of how current models read lists, not a property of the hotel data itself.

If position 1 gets picked 13% and position 10 gets 7.6%, that’s only a 5-point spread. Is it really a big deal?

For a single query, no. For the visibility of a single hotel across a market segment, it’s meaningful — it means your expected selection probability depends on variables you don’t control, like what order the API happened to present your competitors in. And the effect compounds: every downstream bias (brand, keyword familiarity) interacts with this, not replaces it.

Do reasoning tokens fix this?

They help. GPT-5.4 with medium reasoning drops STSR to ~0.24 and pushes Fleiss κ above 0.7. Gemini 3.1 Pro (which reasons by default) is the most position-stable model we tested. See Chapter 04 and Chapter 06 for the full comparison and the mitigation experiments.

Does this mean AI hotel recommendations are essentially random?

No. The model’s top 3 shortlist is far more stable than its single pick — ~98% of single-pick winners appear in the top 3 across orderings (Chapter 06). The instability is mostly tie-breaking noise within a fairly stable competitive set. The model identifies *who’s in the running* reliably; it’s the final pick among closely-matched candidates that bounces around.

What can I do if I can’t change where I’m placed in the list?

You usually can’t. But you can reduce the chance that you’re *in the ‘could go either way’ middle tier*, because that’s where position determines the outcome. Differentiate your listing so the model doesn’t see you as fungible with neighbors — strong location context, complete quality signals, distinctive amenities.


How we ran the experiment

976

Total trials

61

Query sets

10

Hotels / set

16

Permutations / set

We tested GPT-5.4 at temperature 0 across 61 Google Hotels query sets. Each set contained 10 real hotels with 13 raw features (price, ratings, location, amenities). For each set, we generated 16 random orderings and asked the model to pick the best hotel.

We measured three things: (1) position distribution — what fraction of picks came from each slot; (2) STSR — fraction of permutation pairs within a set whose picks disagreed; (3) PCM — how often the model agreed with its first-ordering pick across all 16 permutations.

Critically, we also ran a determinism control: the same 61 query sets in their identity ordering, repeated 16 times. Any instability in this condition reflects API non-determinism at temperature 0, not positional effects. Identity STSR = 0.115 vs. permuted STSR = 0.368 — ~32% of the raw instability is API noise; the rest is position.

Limitations. Single model (GPT-5.4) at one configuration. Fixed list length of 10 — positional effects can look different at other lengths. Real Google Hotels data but only 61 markets — the bimodal distribution suggests market-specific effects we haven’t fully characterized.


Is AI picking your competitors just because they’re listed first?

Huxo’s AI Visibility Report runs multi-ordering tests across GPT-5.4, Claude, and Gemini — so you see your true visibility, not a single-sample snapshot.

Continue reading