← Back to Research
Background · Optional reading

How we ran it — shared methodology across eight chapters

You don’t have to read this to understand the findings. But if you want to know where the numbers come from — what a query set is, what a permutation is, why we use Fleiss κ instead of plain agreement — this is the page.

Feb 14, 20265 min readHuxo Research

Why this research exists

Hotel search is being rewritten in front of us. Instead of scrolling ten blue links, a growing share of travelers type a question into ChatGPT, Gemini, or Claude and read the answer. If an AI doesn’t pick your hotel, the traveler doesn’t see it.

We wanted to understand how these models pick. Not with anecdotes — with controlled experiments, real hotel data, and statistics. Over eight chapters we measured which features move the decision, which don’t, how much position alone biases the pick, how different models disagree with each other, and whether any of it can be fixed.

The goal: give hotel operators a clear picture of what they’re being judged on, and give researchers reproducible numbers to build on.

Every chapter reports effect sizes with confidence intervals, not “AI does X.”

Query sets & data

The core dataset is 61 query sets drawn from real Google Hotels results. Each set is a snapshot of 10 hotels returned for a real travel query in a real market — names, star ratings, review scores, prices, amenities, nearby-places descriptions, and the other 13 fields Google Hotels exposes on a listing.

We use real data because synthetic hotels can’t test brand recognition, location priors, or any signal that depends on the model’s training data. Real hotels expose real signals.

Why 61 markets? Enough to separate signal from market-specific noise, small enough that we can sanity-check every chart by hand. Chapters 02 (Brand), 08 (Geographic), and 05 (Human vs. AI) use additional purpose-built datasets on top of the 61 core sets.


Permutation design

A single ask (“here are 10 hotels, pick one”) gives you one answer. That’s not an experiment, it’s an anecdote. To get statistics, we shuffle.

For each query set we build 16 random permutations of the same 10 hotels and run every permutation through the model. Same hotels, same attributes, different orderings. Any change in the pick has to be caused by order alone.

That’s the mechanic behind almost every number in this research:

  • 976 trials per condition = 61 query sets × 16 permutations
  • 3,904 trials in Chapter 04 = 976 × 4 models
  • 13,664 trials in Chapter 01 = 976 baseline + 976 × 13 feature knockouts
  • 15,616 trials in Chapter 06 = 976 × 16 intervention conditions

Chapter 08 (Geographic) uses a different permutation count (30 reps × 20 queries × 2 conditions = 1,198 trials) because ChatGPT browser sessions are slower than API calls and we traded permutation depth for longer queries.


The statistics we use

We report more than one metric per finding because any single metric can be gamed or misread. Here’s the vocabulary you’ll see across chapters.

STSR (Set-level Top-1 Shuffle Rate). Fraction of permutation pairs within a query set that disagreed on the pick. STSR = 0 means the model always picks the same hotel regardless of order; STSR = 1 means every shuffle yields a different winner. Our positional-bias number (Chapter 03).

PCM (Permutation Consistency to Mode). Fraction of permutations within a set that agree with the most common pick. Complements STSR: tells you “when the model does have a favorite, how often does it stick to it?”

Fleiss κ / Krippendorff α / Kendall W. Three inter-rater reliability metrics. We use the 16 permutations as 16 “raters” and ask whether they agree on the ranking. Academic convention: κ or α ≥ 0.667 is “substantial agreement.” Below that, the model’s picks are closer to random than to consistent.

Jensen-Shannon divergence. Distance between two probability distributions. We use it to measure how differently two models distribute their picks across the 10 positions (Chapter 04).

Spearman ρ. Rank correlation. If the model’s full ranking (not just the #1 pick) is stable under shuffling, ρ stays near 1. If the ranking flips, ρ drops.

McNemar / Fisher / Mann-Whitney U / χ² / binomial. Standard significance tests. McNemar for paired before/after (Chapter 02 brand-visible vs. brand-masked); Fisher exact for 2×2 tables with small counts (Chapter 08); Mann-Whitney U for ordinal outcomes; χ² for the position-uniformity null; binomial against a known baseline rate.

Every significance claim in an article is tied to one of these tests with the test statistic and p-value reported inline, not hidden in a footnote.


Models tested

Across the eight chapters we used different model configurations depending on the question being asked.

ModelConfigurationUsed in
GPT-5.4OpenAI API, temperature 0, no reasoning tokensCh 01, 02, 03, 04, 05, 06, 07
GPT-5.4 with reasoningOpenRouter reasoning tokens, medium effortCh 04, 06
Claude Opus 4.6Anthropic API, default configCh 04, 07
Gemini 3.1 ProGoogle API, reasoning on by defaultCh 04, 07
ChatGPT browser (free)Playwright + BrightData, unauthenticated, US IPsCh 07, 08
Gemini browser (free)Playwright + BrightData, unauthenticated, US IPsCh 07

All API configurations are fixed at temperature 0 unless stated otherwise, so that every chance of variation comes from the manipulation being tested (ordering, feature knockout, model swap, prompt change) rather than from sampling randomness.


Shared limitations

Several caveats apply to almost every chapter. They’re individually noted in each article’s Methodology section, but collecting them here:

  • Google Hotels is the data surface. The 13 features we knock out in Chapter 01 are the 13 that Google Hotels exposes. A hotel’s own website may expose more or different signals that we don’t test.
  • Temperature 0 and no tool use, unless explicitly stated. Chapters 01–06 test the model as a pure reasoning task over a given list. Chapter 07 relaxes this (agents with tools) and Chapter 08 tests the full browser surface (websearch on).
  • Snapshot data. Everything is April 2026 production models. API behavior changes; the specific numbers may drift. The patterns — position matters, brand matters, models disagree — have been stable across our pilot runs going back months.
  • English-language queries, US / EU markets. Our query sets skew toward English-language travel queries in US and Western European markets. Chinese-language models or non-Latin-script markets aren’t represented.
  • Hotel recommendation only. We test hotel selection specifically. Generalizing to other recommendation domains (restaurants, flights, tours) requires replication.

If you find a finding that looks off — wrong number, unclear statistic, missing limitation — email research@huxo.ai. Criticism that sharpens the research is welcome.

Back to the research

Start from the research index, or jump directly to a chapter:

Continue reading