Why this matters to you
Chapter 03 measured positional bias on direct API calls — a clean, controlled setup. But travelers don’t talk to APIs. They talk to ChatGPT, Gemini, or agent-style tools that can reason, search the web, and explain their reasoning. You’d hope the extra context would wash out position effects.
It doesn’t. The tools-and-reasoning layer is where your customers actually meet AI, and at that layer the same “first 4 positions get the attention” pattern we saw in Chapter 03 is still there — just now measured in a harder, messier setting.
Position 1–4 average 13% each. Position 7–10 average 7.5% each. Across 6 different agents.
Key findings at a glance
01
Combined agent data is highly significant
Combined chi-squared across the three OpenCode agents = 116.0, p < 0.001. Positional bias is real in agentic settings — not a quirk of direct API calls.
02
Claude has the strongest primacy skew — position 2 at 22%
Claude agent picks position 2 at 21.96% (2.2× expected). Despite agent access to tools, it never used any. Pure reasoning on a shuffled list still produces bias.
03
Gemini is the only model close to uniform
Gemini chi-squared p = 0.34 (not significant), STSR = 0.199, Krippendorff α = 0.775. It’s also the only model that used websearch (14.96% of trials). External information anchors decisions against position.
04
Browser ChatGPT (unauthenticated) has the worst primacy bias
Free-tier browser ChatGPT picks position 1 17.1% of the time (1.7× expected). STSR = 0.48. The consumer-facing free experience is less position-invariant than the direct API.
What this means for your hotel
Three practical implications.
First, agentic AI inherits every bias of the underlying model. Adding tools, reasoning tokens, or an agent loop doesn’t reset the positional problem. Claude as an agent is biased. GPT as an agent is biased. Both, in our test, declined to use tools even when available — they reasoned over the list and the first four positions won disproportionately.
Second, the model that retrieves is the model that’s fair. Only Gemini used websearch (in 15% of trials) — and Gemini is the only model that reached the Krippendorff α ≥ 0.667 academic reliability threshold in this experiment. External information appears to anchor decisions in a way the list alone doesn’t.
Third, unauthenticated browser ChatGPT is a different beast. Free-tier browser users (no login) get an undisclosed older model. Our data shows it has substantially worse primacy bias than direct-API ChatGPT — position 1 picked 17.1% of the time. If you’re measuring “what ChatGPT shows my guests,” you need to test both the logged-in and logged-out paths.
What to do about it
1. Don’t assume agents fix bias. They don’t.
If a vendor claims their agent-based recommendation pipeline is “unbiased because it reasons,” ask for the data. In ours, Claude and GPT agents — both with full tool access — never invoked a single tool and produced the same (or worse) primacy bias we saw in Chapter 03’s direct API calls.
2. Test the unauthenticated/free path separately.
Free-tier ChatGPT and free-tier Gemini serve different models than the logged-in experience, and behave differently. Our browser ChatGPT runs showed STSR = 0.48 — noticeably worse than direct API GPT-5.4’s 0.36. If travelers find you through a logged-out ChatGPT session, that’s the pipeline that needs measuring, not the paid API.
3. Expect Gemini to be the fairest agent in most markets.
In both Chapter 04 (direct API) and this agent study, Gemini produced the most uniform, most position-invariant results — and the only one to clear the academic reliability bar in both studies. If you can only afford to optimize for one agentic channel, Gemini gives you the most signal per unit of effort.
The evidence
Finding 1 — Position distributions show clear primacy bias in Claude and GPT, near-uniform behavior in Gemini
Expected rate if position didn’t matter: 10% per position.
Position selection rate per agent (OpenCode · 8 permutations × 61 query sets)
| Position | Claude (n=428) | Gemini (n=482) | GPT (n=488) |
|---|---|---|---|
| 1 | 8.88% | 11.41% | 11.07% |
| 2 | 21.96% | 9.34% | 14.55% |
| 3 | 13.79% | 11.41% | 12.50% |
| 4 | 16.59% | 12.45% | 13.93% |
| 5 | 10.05% | 10.79% | 8.61% |
| 6 | 7.94% | 8.51% | 8.40% |
| 7 | 4.67% | 7.88% | 6.35% |
| 8 | 7.71% | 8.92% | 8.81% |
| 9 | 5.14% | 10.58% | 8.20% |
| 10 | 3.27% | 8.71% | 7.58% |
| Chi-squared p | < 0.001 | 0.34 | < 0.001 |
Claude’s distribution is striking: position 2 gets 22%, position 10 gets 3.3%. Claude is picking hotels in the top half of the list nearly 3× more often than hotels in the bottom half. GPT shows a softer version of the same pattern. Gemini alone passes a chi-squared uniformity test.
Finding 2 — Only Gemini reaches the academic reliability threshold
Consistency metrics per agent — academic threshold is Krippendorff α ≥ 0.667 (tentative agreement)
| Agent | STSR | Fleiss κ | Krippendorff α | Kendall W | Band |
|---|---|---|---|---|---|
| Gemini 3.1 Pro (OpenCode) | 0.199 | 0.787 | 0.775 | 0.735 | Substantial |
| GPT-5.4 (OpenCode) | 0.413 | 0.537 | 0.538 | 0.538 | Moderate — below 0.667 |
| Claude Opus 4.6 (OpenCode) | 0.411 | 0.555 | 0.527 | 0.434 | Moderate — below 0.667 |
Agent GPT is *slightly worse* than direct API GPT on the same data
On the same 61 query sets, direct API GPT-5.4 had STSR = 0.368 (Chapter 03). As an OpenCode agent it’s 0.413. The extra overhead of agent reasoning — variable paths, non-zero temperature, multi-turn context — adds noise rather than reducing it. ‘Agentify the problem’ is not automatically better.
Finding 3 — Tool use is what actually moves the needle
Tool usage rates per agent — 1,398 total OpenCode trials
| Agent | Used any tool | Websearch | File/Code tools |
|---|---|---|---|
| Gemini 3.1 Pro | 14.96% | 14.96% | 1.02% (Write/Read/Codesearch) |
| GPT-5.4 | 0% | 0% | 0% |
| Claude Opus 4.6 | 0% | 0% | 0% |
Claude and GPT, both configured with full tool access, never used a single tool. They treated the hotel-selection task as a pure reasoning problem over the list they were given. Gemini is the only model that went and looked at the web in a meaningful fraction of trials — and Gemini is the only model that produced position-invariant results. Correlation is not causation, but the pattern is consistent with “external information anchors decisions away from list position.”
Finding 4 — Browser UIs (unauthenticated) show different biases again
Unauthenticated browser tests via BrightData Scraping Browser · US residential IPs · fresh sessions (no login, no cookies, no history)
| Browser pipeline | n (valid) | STSR | Fleiss κ | Notable |
|---|---|---|---|---|
| ChatGPT browser (free tier, undisclosed model) | 409 | 0.48 | 0.45 | Strongest primacy — pos 1 at 17.1% (1.7×) |
| Gemini browser (unauthenticated — 2.0 Flash) | 229 | 0.46 | 0.47 | No primacy, but odd peak at pos 7 (14.8%) |
Logged-out ChatGPT is not logged-in ChatGPT
As of February 2026, OpenAI retired GPT-4o, GPT-4.1, o4-mini, and GPT-5 from ChatGPT. Logged-in users get GPT-5.3 as default. Logged-out users get an undisclosed, older, lighter model. Gemini’s unauthenticated path serves 2.0 Flash (Google opened anonymous access in March 2025). The free/unauthenticated browser paths do not behave like the current API models — so visibility measurements on those paths are measuring a different product, with materially different biases.
Claude and GPT agents never used their tools. Gemini did — 15% of the time — and Gemini was the only model that didn’t fail the uniformity test.
Frequently asked questions
We don’t have a definitive answer. Both were configured with full tool access inside OpenCode. Our working hypothesis is that the hotel-selection prompt reads as a ‘decide among these 10 items’ task, and the models decide it’s a reasoning task rather than a retrieval task. Gemini is the only model that routinely decided to look something up. That decision — not the tools themselves — is what reduced its positional bias.
Most likely because the browser free tier serves a different (older, lighter) model — not GPT-5.4. OpenAI doesn’t publish which model runs the logged-out tier, so we can’t test the exact same weights. The behavioral difference is consistent with a less reasoning-capable model being more susceptible to surface cues like position.
No — and that’s one of the study’s caveats. Agent behavior depends on the model version, the agent framework (OpenCode here), and the tool registry. Gemini’s tool-use rate might look very different in six months if Google changes the default reasoning mode or restricts websearch access. We’re publishing the data as a snapshot of April 2026 behavior, not a universal law.
We don’t fully know. The browser pipeline includes a search step, HTML rendering, and post-processing that the API doesn’t. Any of those layers could introduce ordering effects that aren’t simple primacy. The key takeaway for hotels: unauthenticated Gemini is not the same surface as direct API Gemini, and the biases don’t overlap cleanly.
Two things. First, ‘agent’ is not a magic word — if a partner tells you their AI agent solves hotel discovery fairly, ask whether it actually invokes tools, and measure the output. Second, the path travelers take to reach an AI recommendation matters. Logged-in ChatGPT, logged-out ChatGPT, Gemini app, and Gemini web are four different products with four different biases. A complete AI visibility picture tests all four.
How we ran the experiment
1,398
Agent trials (OpenCode)
638
Browser trials (ChatGPT + Gemini)
61
Query sets
8
Permutations / set (agent)
Three OpenCode agents tested on the same 61 Google Hotels query sets used in Chapters 03–06, with 8 random permutations per set (vs. 16 in the direct-API studies — agents are slower/more expensive). Plus two browser UIs tested via Playwright + BrightData Scraping Browser with US residential IPs. Each browser trial ran in a fresh, unauthenticated session.
Agents tested: anthropic/claude-opus-4.6, google/gemini-3.1-pro-preview-customtools, openai/gpt-5.4 — all via OpenCode with full tool access (websearch, read, write, codesearch). Agents were free to reason, call tools, or decline tools.
Browser pipelines: chatgpt-instant-free-2026-04 and gemini-browser-free-2026-04 — both unauthenticated, both served the free-tier model. Playwright automation through BrightData Scraping Browser, US residential IPs, fresh browser sessions per trial.
Metrics. Position selection distribution per model, chi-squared test of uniformity, STSR, Fleiss κ, Krippendorff α, Kendall W, and tool-usage rates.
Limitations. Snapshot data — model versions, free-tier configurations, and agent frameworks all change. The browser pipeline is inherently noisier than the API. We can’t control which model the ChatGPT free tier serves because OpenAI doesn’t disclose it. Tool invocation rates may depend on prompt phrasing.
Which AI pipeline picks up your hotel — and which quietly skips you?
Huxo’s AI Visibility Report tests your property across logged-in and logged-out AI pipelines — so you see the biases that matter for real travelers, not just for a clean API call.