Why this matters to you
Chapter 03 showed that GPT-5.4 flips its hotel pick roughly one out of every three times you reshuffle the list. That’s a real problem — and a tempting one to “fix” with a prompt tweak. A smarter format. A two-step workflow. A ranked table. Some enrichment.
We tried all of them. Fifteen approaches across four categories: reasoning controls, format changes, multi-step workflows, and information-theoretic augmentations. Most of them look like improvements. Almost none of them are.
This chapter is the “we ran the experiment so you don’t have to” version — and it leads to a very specific architectural recommendation for anyone building a hotel selection agent.
Positional instability is an attention-mechanism problem, not a prompt problem.
Key findings at a glance
01
Only reasoning tokens significantly reduce STSR
Standard GPT with medium reasoning: STSR 0.24 (p=0.0105). Pivoted + reasoning: 0.25 (p=0.0212). No other approach clears p < 0.05 vs. the standard baseline (Mann-Whitney U).
02
Triple presentation makes bias significantly worse
Showing the same hotels in 3 random orderings bloats STSR to 0.48 (p=0.0100). The model averages the wrong way.
03
Ranked tables look good but are just anchoring
Summary+Ranked trends to 0.28 with Kendall W of 0.70 — the highest we measured. But agreement with reasoning is only 65–66%. It’s not debiasing; it’s steering on a pre-computed ranking.
04
The right architecture is shortlist + deterministic tiebreaker
Asking for the top 3 instead of one pick: 98.3% of single-winners appear in the top-3, Jaccard 0.74, Overlap@3 0.82. The model picks the set reliably — and a deterministic rule picks the winner.
What this means for your hotel
Two practical takeaways for anyone thinking about AI visibility.
First, “which prompt trick beats position?” is the wrong question. Our data shows position sensitivity is not a surface behavior — it’s baked into how LLMs attend to ordered lists. Every cosmetic fix we tried (bullet format, YAML, tables, multi-step prompts) left the underlying bias intact. If a vendor tells you they’ve “solved” LLM positional bias with a clever prompt, the evidence says otherwise.
Second, the model is reliable at identifying the shortlist, not the winner. In 93.2% of permutation pairs the top-3 lists overlap by 2 or 3 hotels. In 98.3% of trials, the single-pick winner was already in that stable top-3. So if you’re benchmarking “which hotels does AI recommend?”, measuring top-3 presence is far more informative than measuring win rate.
What to do about it
1. When you benchmark AI visibility, measure top-3 presence, not single-pick wins.
Single-pick STSR = 0.36. Top-3 Jaccard = 0.74. Your position in the model’s competitive set is a far more stable, repeatable signal than whether you happen to be the winner on a given shuffle. A “won 4 of 10 runs” measurement is almost meaningless. “Appeared in top-3 in 9 of 10 runs” is a real number.
2. If you’re building an agent, ask for 3, not 1.
The right architecture is: prompt the LLM for a top-3 shortlist, then apply a deterministic or user-defined rule to break the tie (price, availability, guest preference, hotel brand relationship). This separates what LLMs are good at (competitive set identification) from what they’re bad at (single-winner selection from comparable options).
3. Don’t trust prompt-engineering claims that skip statistical tests.
Many of the formatting changes we tried looked like improvements on a handful of queries. The per-query noise is large enough that anything that moves STSR from 0.36 to 0.33 can be illusion. Mann-Whitney U against the standard baseline is the honest test — and 13 of 15 interventions don’t pass it.
The evidence
Finding 1 — Only reasoning tokens pass the significance bar
We ranked all 16 conditions (15 interventions + the standard baseline) by mean STSR. Lower is more position-stable. p-values are Mann-Whitney U against the standard baseline.
Mean STSR per intervention — sorted best to worst
976 trials per condition · only values marked SIG pass p < 0.05 vs. baseline
Two interventions are statistically significantly *better* than baseline (Standard + reasoning p=0.0105, Pivoted + reasoning p=0.0212). One is significantly *worse* (Triple presentation p=0.0100). The other 12 are noise.
Finding 2 — Academic reliability metrics agree with STSR
We cross-checked our custom STSR metric against three peer-reviewed inter-rater agreement measures: Fleiss κ, Krippendorff α, and Kendall W. They rank the conditions identically — our metric wasn’t misleading.
Academic consistency per condition — Krippendorff α 0.667 is the social-science threshold for ‘tentative agreement’
| Condition | Fleiss κ | Krippendorff α | Kendall W | Band |
|---|---|---|---|---|
| Standard + reasoning | 0.731 | 0.731 | 0.481 | Substantial |
| Pivoted + reasoning | 0.718 | 0.718 | 0.476 | Substantial |
| Summary + Ranked | 0.690 | 0.691 | 0.697 | Substantial |
| Ranked tables | 0.684 | 0.684 | 0.639 | Substantial |
| Branded | 0.639 | 0.639 | 0.510 | Substantial |
| Open two-step | 0.632 | 0.632 | 0.500 | Substantial |
| Summary header | 0.626 | 0.627 | 0.531 | Substantial |
| RISE | 0.625 | 0.626 | 0.457 | Substantial |
| Two-stage | 0.623 | 0.623 | 0.550 | Substantial |
| Enriched YAML | 0.621 | 0.621 | 0.478 | Substantial |
| Standard (baseline) | 0.596 | 0.597 | 0.413 | Moderate — below 0.667 |
| Pivoted + two-stage | 0.591 | 0.592 | 0.484 | Moderate |
| Factoids | 0.590 | 0.590 | 0.379 | Moderate |
| Grouped factoids | 0.536 | 0.536 | 0.357 | Moderate |
| Pivoted (no reasoning) | 0.532 | 0.532 | 0.398 | Moderate |
| Triple presentation | 0.461 | 0.461 | 0.411 | Moderate (worst) |
The baseline fails the academic bar
Standard GPT-5.4 without reasoning has κ = 0.596 — below the 0.667 Krippendorff threshold that academic research uses to say ‘this rater’s outputs are reliable enough to draw tentative conclusions from.’ By peer-reviewed standards, single-pick hotel recommendations from default GPT should not be treated as deterministic outputs. The industry does anyway.
Finding 3 — Ranked tables improve STSR but via anchoring, not debiasing
Summary + Ranked shows Kendall W = 0.697 — the highest of any condition. But its per-trial agreement with the reasoning version is only 65–66%, meaning on a third of trials it picks a different hotel than the reasoning model would.
The reason: pre-sorted tables contain the answer in the preamble. The model reads “best-rated: Hotel X” at the top and anchors to that. It’s stable because it’s not actually computing a ranking — it’s reading one. Substitute anchoring for positional bias, and the output looks cleaner on paper but isn’t genuinely better.
Finding 4 — Compression and elaboration both fail
We tested three information-theoretic variants that compress or enrich the feature table:
Information-theoretic approaches bracket the baseline without improving it
STSR — all three non-significant vs. standard baseline 0.362
Giving the model more context (Enriched YAML with min/max/median statistics per feature) didn’t help. Compressing to atomic facts didn’t help. Grouping those facts by feature actively hurt. The model’s positional sensitivity is independent of input format or token count.
Finding 5 — The shortlist experiment: ask for 3, get a reliable answer
Instead of asking “which hotel is best?” we asked “which are the top 3?” — same 10 hotels, same permutations. The consistency jumps dramatically.
98.3%
of single-pick winners appear in the top-3 shortlist — so the model’s ‘winner’ is almost always already inside the stable competitive set
Top-3 shortlist consistency — across permutation pairs
| Metric | Value | Reading |
|---|---|---|
| Mean Jaccard similarity | 0.736 | Very high overlap between permutation pairs |
| Overlap@3 | 0.815 | ~2.4 of 3 picks match between shuffled runs |
| RBO (p=0.9, top-weighted) | 0.793 | Ranks within the top-3 also largely agree |
| Exact 3/3 overlap (permutation pairs) | 51.4% | Half the time both shuffles return the same three hotels |
| 2/3 overlap | 41.8% | Another ~42% of pairs differ by only one hotel |
| 1/3 overlap | 6.6% | Meaningfully different shortlists |
| 0/3 overlap | 0.2% | Virtually never happens |
| Pairs with 2+ hotel overlap | 93.2% | The ‘competitive set’ is reliable |
The single-pick instability (STSR = 0.362) is, almost entirely, tie-breaking noise within this stable shortlist. The LLM knows who’s in the running. It just doesn’t have a reliable way to pick a winner from 3 genuinely comparable candidates.
Finding 6 — Open two-step agrees most with the baseline
Pairwise agreement on the same trials:
Per-trial agreement with standard baseline
Fraction of trials where both conditions picked the same hotel
Open two-step has the highest agreement with baseline (80%) and still 75% agreement with reasoning — so it makes broadly similar decisions with slightly better stability. Ranked tables look stable on paper but disagree with reasoning on 34–35% of trials — they’re not improving GPT’s judgment, they’re replacing it with the pre-computed ranking.
The model reliably identifies the competitive set. It does not reliably identify the winner.
Frequently asked questions
Because they all still require the model to pick a winner from a list at the end. Moving elimination earlier or restructuring the prompt doesn’t change the fundamental attention behavior — position still influences the final choice. The only interventions that help are the ones that change how the model thinks about the input (reasoning tokens) or bypass single-winner selection entirely (top-3 shortlist).
Showing the same 10 hotels in 3 different orderings creates three times as many attention targets. Every position gets primacy-weighted across all three copies, the model tries to aggregate, and the aggregation amplifies noise rather than averaging it out. STSR jumped to 0.482 — by far the worst condition we measured.
Close, but exposed differently. OpenRouter / OpenAI ‘reasoning’ tokens are a dedicated reasoning mode with its own token budget. Chain-of-thought prompting (‘think step by step’) in a regular completion may help a little but wasn’t tested in isolation here. Our reasoning conditions use explicit reasoning tokens at medium effort.
Because ‘looks better’ and ‘is better’ diverge here. Ranked tables improve Kendall W (consistency) but agreement with the reasoning baseline drops to 65–66%. You’re buying stability by telling the model which answer to pick, not by helping it decide. If your pre-computed ranking is wrong, the model now reliably picks the wrong hotel. Genuine debiasing has to leave the model room to evaluate.
For user-facing agents: (1) use reasoning-capable models whenever possible, (2) ask for top-3 rather than single winner, (3) apply deterministic tiebreakers for the final pick. For visibility measurement: track top-3 presence, not win rate.
How we ran the experiment
15,616
Total trials
16
Conditions
61
Query sets
976
Trials / condition
15 interventions plus the standard baseline, all on the same inputs: 61 Google Hotels query sets, 10 hotels each, 13 raw features per hotel, 16 random permutations per set. GPT-5.4 at temperature 0 for non-reasoning conditions; medium reasoning tokens enabled for the two reasoning conditions.
Four intervention categories. (A) Reasoning controls: standard and pivoted formats with medium reasoning tokens. (B) Formatting: pivoted YAML, summary header, ranked tables, summary + ranked, triple presentation, branded (real names). (C) Multi-step: two-stage PASS/FAIL, open two-step, RISE iterative elimination, pivoted + two-stage. (D) Information-theoretic: factoids, grouped factoids, enriched YAML (TAP4LLM statistical augmentation).
Metrics. STSR (primary — position-instability), Fleiss κ, Krippendorff α, Kendall W (inter-permutation agreement), Mann-Whitney U for pairwise significance vs. baseline, Top-1 concentration, and for the shortlist experiment: Jaccard, Overlap@3, RBO.
Limitations. Single-model study — GPT-5.4 only. Interventions that fail on GPT-5.4 might succeed on other architectures, though Chapter 04’s cross-model comparison suggests the same positional bias is present in Claude and default GPT, and Gemini’s better stability appears intrinsic rather than prompt-induced. Google Hotels data only. 13-feature schema — different feature mixes may behave differently.
Want to see which AI picks your hotel — and which skips you?
Huxo’s AI Visibility Report tests your property across GPT-5.4, Claude, and Gemini. We report top-3 presence, not just ‘the winner,’ because our research shows that’s the only stable signal.