Fixing positional bias — what actually works

Why this matters to you

Chapter 03 showed that GPT-5.4 flips its hotel pick roughly one out of every three times you reshuffle the list. That’s a real problem — and a tempting one to “fix” with a prompt tweak. A smarter format. A two-step workflow. A ranked table. Some enrichment.

We tried all of them. Fifteen approaches across four categories: reasoning controls, format changes, multi-step workflows, and information-theoretic augmentations. Most of them look like improvements. Almost none of them are.

This chapter is the “we ran the experiment so you don’t have to” version — and it leads to a very specific architectural recommendation for anyone building a hotel selection agent.

Positional instability is an attention-mechanism problem, not a prompt problem.

Key findings at a glance

Only reasoning tokens significantly reduce STSR

Standard GPT with medium reasoning: STSR 0.24 (p=0.0105). Pivoted + reasoning: 0.25 (p=0.0212). No other approach clears p < 0.05 vs. the standard baseline (Mann-Whitney U).

Triple presentation makes bias significantly worse

Showing the same hotels in 3 random orderings bloats STSR to 0.48 (p=0.0100). The model averages the wrong way.

Ranked tables look good but are just anchoring

Summary+Ranked trends to 0.28 with Kendall W of 0.70 — the highest we measured. But agreement with reasoning is only 65–66%. It’s not debiasing; it’s steering on a pre-computed ranking.

The right architecture is shortlist + deterministic tiebreaker

Asking for the top 3 instead of one pick: 98.3% of single-winners appear in the top-3, Jaccard 0.74, Overlap@3 0.82. The model picks the set reliably — and a deterministic rule picks the winner.

What this means for your hotel

Two practical takeaways for anyone thinking about AI visibility.

First, “which prompt trick beats position?” is the wrong question. Our data shows position sensitivity is not a surface behavior — it’s baked into how LLMs attend to ordered lists. Every cosmetic fix we tried (bullet format, YAML, tables, multi-step prompts) left the underlying bias intact. If a vendor tells you they’ve “solved” LLM positional bias with a clever prompt, the evidence says otherwise.

Second, the model is reliable at identifying the shortlist, not the winner. In 93.2% of permutation pairs the top-3 lists overlap by 2 or 3 hotels. In 98.3% of trials, the single-pick winner was already in that stable top-3. So if you’re benchmarking “which hotels does AI recommend?”, measuring top-3 presence is far more informative than measuring win rate.

What to do about it

1. When you benchmark AI visibility, measure top-3 presence, not single-pick wins.

Single-pick STSR = 0.36. Top-3 Jaccard = 0.74. Your position in the model’s competitive set is a far more stable, repeatable signal than whether you happen to be the winner on a given shuffle. A “won 4 of 10 runs” measurement is almost meaningless. “Appeared in top-3 in 9 of 10 runs” is a real number.

2. If you’re building an agent, ask for 3, not 1.

The right architecture is: prompt the LLM for a top-3 shortlist, then apply a deterministic or user-defined rule to break the tie (price, availability, guest preference, hotel brand relationship). This separates what LLMs are good at (competitive set identification) from what they’re bad at (single-winner selection from comparable options).

3. Don’t trust prompt-engineering claims that skip statistical tests.

Many of the formatting changes we tried looked like improvements on a handful of queries. The per-query noise is large enough that anything that moves STSR from 0.36 to 0.33 can be illusion. Mann-Whitney U against the standard baseline is the honest test — and 13 of 15 interventions don’t pass it.

The evidence

Finding 1 — Only reasoning tokens pass the significance bar

We ranked all 16 conditions (15 interventions + the standard baseline) by mean STSR. Lower is more position-stable. p-values are Mann-Whitney U against the standard baseline.

Mean STSR per intervention — sorted best to worst

976 trials per condition · only values marked SIG pass p < 0.05 vs. baseline

Standard + reasoning **SIG**

0.241

Pivoted + reasoning **SIG**

0.253

Summary + Ranked

0.275

Ranked tables

0.281

Branded

0.322

Open two-step

0.329

Summary header

0.334

RISE (iterative elimination)

0.335

Two-stage (PASS/FAIL)

0.337

Enriched YAML (TAP4LLM)

0.339

Standard (baseline)

0.362

Pivoted + two-stage

0.366

Factoids

0.368

Grouped factoids

0.415

Pivoted (no reasoning)

0.419

Triple presentation **SIG-WORSE**

0.482

Two interventions are statistically significantly *better* than baseline (Standard + reasoning p=0.0105, Pivoted + reasoning p=0.0212). One is significantly *worse* (Triple presentation p=0.0100). The other 12 are noise.

Finding 2 — Academic reliability metrics agree with STSR

We cross-checked our custom STSR metric against three peer-reviewed inter-rater agreement measures: Fleiss κ, Krippendorff α, and Kendall W. They rank the conditions identically — our metric wasn’t misleading.

Academic consistency per condition — Krippendorff α 0.667 is the social-science threshold for ‘tentative agreement’

Condition	Fleiss κ	Krippendorff α	Kendall W	Band
Standard + reasoning	0.731	0.731	0.481	Substantial
Pivoted + reasoning	0.718	0.718	0.476	Substantial
Summary + Ranked	0.690	0.691	0.697	Substantial
Ranked tables	0.684	0.684	0.639	Substantial
Branded	0.639	0.639	0.510	Substantial
Open two-step	0.632	0.632	0.500	Substantial
Summary header	0.626	0.627	0.531	Substantial
RISE	0.625	0.626	0.457	Substantial
Two-stage	0.623	0.623	0.550	Substantial
Enriched YAML	0.621	0.621	0.478	Substantial
Standard (baseline)	0.596	0.597	0.413	Moderate — below 0.667
Pivoted + two-stage	0.591	0.592	0.484	Moderate
Factoids	0.590	0.590	0.379	Moderate
Grouped factoids	0.536	0.536	0.357	Moderate
Pivoted (no reasoning)	0.532	0.532	0.398	Moderate
Triple presentation	0.461	0.461	0.411	Moderate (worst)

The baseline fails the academic bar

Standard GPT-5.4 without reasoning has κ = 0.596 — below the 0.667 Krippendorff threshold that academic research uses to say ‘this rater’s outputs are reliable enough to draw tentative conclusions from.’ By peer-reviewed standards, single-pick hotel recommendations from default GPT should not be treated as deterministic outputs. The industry does anyway.

Finding 3 — Ranked tables improve STSR but via anchoring, not debiasing

Summary + Ranked shows Kendall W = 0.697 — the highest of any condition. But its per-trial agreement with the reasoning version is only 65–66%, meaning on a third of trials it picks a different hotel than the reasoning model would.

The reason: pre-sorted tables contain the answer in the preamble. The model reads “best-rated: Hotel X” at the top and anchors to that. It’s stable because it’s not actually computing a ranking — it’s reading one. Substitute anchoring for positional bias, and the output looks cleaner on paper but isn’t genuinely better.

Finding 4 — Compression and elaboration both fail

We tested three information-theoretic variants that compress or enrich the feature table:

Information-theoretic approaches bracket the baseline without improving it

STSR — all three non-significant vs. standard baseline 0.362

Enriched YAML (more statistical context)

0.339

Factoids (compressed, constant-removed)

0.368

Grouped factoids (clustered by feature)

0.415

Giving the model more context (Enriched YAML with min/max/median statistics per feature) didn’t help. Compressing to atomic facts didn’t help. Grouping those facts by feature actively hurt. The model’s positional sensitivity is independent of input format or token count.

Finding 5 — The shortlist experiment: ask for 3, get a reliable answer

Instead of asking “which hotel is best?” we asked “which are the top 3?” — same 10 hotels, same permutations. The consistency jumps dramatically.

98.3^%

of single-pick winners appear in the top-3 shortlist — so the model’s ‘winner’ is almost always already inside the stable competitive set

Top-3 shortlist consistency — across permutation pairs

Metric	Value	Reading
Mean Jaccard similarity	0.736	Very high overlap between permutation pairs
Overlap@3	0.815	~2.4 of 3 picks match between shuffled runs
RBO (p=0.9, top-weighted)	0.793	Ranks within the top-3 also largely agree
Exact 3/3 overlap (permutation pairs)	51.4%	Half the time both shuffles return the same three hotels
2/3 overlap	41.8%	Another ~42% of pairs differ by only one hotel
1/3 overlap	6.6%	Meaningfully different shortlists
0/3 overlap	0.2%	Virtually never happens
Pairs with 2+ hotel overlap	93.2%	The ‘competitive set’ is reliable

The single-pick instability (STSR = 0.362) is, almost entirely, tie-breaking noise within this stable shortlist. The LLM knows who’s in the running. It just doesn’t have a reliable way to pick a winner from 3 genuinely comparable candidates.

Finding 6 — Open two-step agrees most with the baseline

Pairwise agreement on the same trials:

Per-trial agreement with standard baseline

Fraction of trials where both conditions picked the same hotel

Standard → Open two-step

0.80

Standard → Factoids

0.78

Standard → Enriched

0.77

Standard → Standard + reasoning

~0.75

Open two-step has the highest agreement with baseline (80%) and still 75% agreement with reasoning — so it makes broadly similar decisions with slightly better stability. Ranked tables look stable on paper but disagree with reasoning on 34–35% of trials — they’re not improving GPT’s judgment, they’re replacing it with the pre-computed ranking.

The model reliably identifies the competitive set. It does not reliably identify the winner.

Frequently asked questions

Why don’t fancy prompt techniques like RISE or two-stage elimination work?

Because they all still require the model to pick a winner from a list at the end. Moving elimination earlier or restructuring the prompt doesn’t change the fundamental attention behavior — position still influences the final choice. The only interventions that help are the ones that change how the model thinks about the input (reasoning tokens) or bypass single-winner selection entirely (top-3 shortlist).

Why is ‘triple presentation’ significantly worse?

Showing the same 10 hotels in 3 different orderings creates three times as many attention targets. Every position gets primacy-weighted across all three copies, the model tries to aggregate, and the aggregation amplifies noise rather than averaging it out. STSR jumped to 0.482 — by far the worst condition we measured.

Are reasoning tokens the same thing as chain-of-thought?

Close, but exposed differently. OpenRouter / OpenAI ‘reasoning’ tokens are a dedicated reasoning mode with its own token budget. Chain-of-thought prompting (‘think step by step’) in a regular completion may help a little but wasn’t tested in isolation here. Our reasoning conditions use explicit reasoning tokens at medium effort.

If ranked tables look better, why not just use them?

Because ‘looks better’ and ‘is better’ diverge here. Ranked tables improve Kendall W (consistency) but agreement with the reasoning baseline drops to 65–66%. You’re buying stability by telling the model which answer to pick, not by helping it decide. If your pre-computed ranking is wrong, the model now reliably picks the wrong hotel. Genuine debiasing has to leave the model room to evaluate.

So what’s the practical rule?

For user-facing agents: (1) use reasoning-capable models whenever possible, (2) ask for top-3 rather than single winner, (3) apply deterministic tiebreakers for the final pick. For visibility measurement: track top-3 presence, not win rate.

How we ran the experiment

15,616

Total trials

Conditions

Query sets

976

Trials / condition

15 interventions plus the standard baseline, all on the same inputs: 61 Google Hotels query sets, 10 hotels each, 13 raw features per hotel, 16 random permutations per set. GPT-5.4 at temperature 0 for non-reasoning conditions; medium reasoning tokens enabled for the two reasoning conditions.

Four intervention categories. (A) Reasoning controls: standard and pivoted formats with medium reasoning tokens. (B) Formatting: pivoted YAML, summary header, ranked tables, summary + ranked, triple presentation, branded (real names). (C) Multi-step: two-stage PASS/FAIL, open two-step, RISE iterative elimination, pivoted + two-stage. (D) Information-theoretic: factoids, grouped factoids, enriched YAML (TAP4LLM statistical augmentation).

Metrics. STSR (primary — position-instability), Fleiss κ, Krippendorff α, Kendall W (inter-permutation agreement), Mann-Whitney U for pairwise significance vs. baseline, Top-1 concentration, and for the shortlist experiment: Jaccard, Overlap@3, RBO.

Limitations. Single-model study — GPT-5.4 only. Interventions that fail on GPT-5.4 might succeed on other architectures, though Chapter 04’s cross-model comparison suggests the same positional bias is present in Claude and default GPT, and Gemini’s better stability appears intrinsic rather than prompt-induced. Google Hotels data only. 13-feature schema — different feature mixes may behave differently.

Want to see which AI picks your hotel — and which skips you?

Huxo’s AI Visibility Report tests your property across GPT-5.4, Claude, and Gemini. We report top-3 presence, not just ‘the winner,’ because our research shows that’s the only stable signal.

Get your free AI report →Book a demo