← Back to Research
Chapter 05 · Human vs LLM

Would a human pick the same hotel? Mostly — but not for the same reasons

Human travelers lead with ratings, then location, then price. GPT-5.4 leads with location context, treats ratings as secondary, and skips the elimination step humans do first. The mismatch is subtle but consequential.

Mar 20, 20269 min readHuxo Research

Why this matters to you

AI is increasingly the first filter travelers use to shortlist hotels. A guest might still make the final call, but the AI decides which hotels they see in the first place. If the AI’s decision process doesn’t match the human decision process, some hotels that travelers would love are never surfaced — and some hotels that the AI confidently recommends wouldn’t have survived a human shortlist.

This chapter compares how actual humans select hotels (conjoint studies, GBTA surveys, revenue management research) against how GPT-5.4 selects hotels in our experiments. The mismatches are where AI quietly fails guests — and where hoteliers can’t rely on “just make a great hotel” to get recommended.

Humans eliminate, then compare. AI compares all at once — and the result is different.

Key findings at a glance

01

Same top 3 drivers, different order

Humans: reviews (51%) → location (48%) → price (42%). GPT-5.4: location context (45%) → overall rating (33%) → price (29%). Both agree on what matters — they disagree on priority.

02

Humans eliminate first. AI doesn’t.

Humans use a two-stage funnel: filter on hard constraints (budget, location), then trade off among survivors. GPT-5.4 weighs all 10 hotels on all features simultaneously — compensatory evaluation with no elimination step.

03

Review count is a human trust signal, not an AI one

A hotel with 3,000 reviews vs 30 is a clear trust cue to humans (3.5% revenue premium). GPT-5.4 treats review_count as a weak tiebreaker (OR = 0.15) — the credibility dimension is missing.

04

Both biased toward brands — for different reasons

Humans chose brands for loyalty points, status, and trust built from stays. The AI has no loyalty program — its 11-point brand lift is a training-data artifact, not rational preference.


What this means for your hotel

The two divergences with biggest commercial consequence are elimination and review volume. Humans ruthlessly drop hotels that violate a constraint — over budget, wrong neighborhood, recent bad reviews — before they compare anything. GPT-5.4 never does that; every hotel stays in the running and gets scored. That means a hotel with one fatal flaw in a human’s eyes may still appear in the AI’s top-3, and conversely, a hotel the human would love can get crowded out by compensatory arithmetic.

Review volume is the other quiet gap. The empirical literature is clear that past ~50 reviews, hotels capture about 3.5% more revenue — reviews are a trust proxy for humans. The AI doesn’t treat reviews that way. If your hotel has 4.7★ over 30 reviews and a competitor has 4.6★ over 3,000, a human booker weights the competitor’s credibility; the AI treats it as a near-coin-flip.


What to do about it

1. Double down on the features AI values that humans also value.

Location context (nearby landmarks, transit, walking distances) is #1 for GPT-5.4 and #2 for humans. Investing here wins both audiences. Price visibility and clear review counts help on both sides too. You don’t need to choose between optimizing for AI and optimizing for humans — there’s a large overlap.

2. Make your elimination story obvious.

If your hotel has a genuine deal-breaker advantage — pet-friendly when competitors aren’t, true beach access when others are “beach view,” wheelchair-accessible when it matters — make it structural and explicit in the data the AI reads. The AI won’t eliminate hotels on those attributes the way a human would, but putting them front-and-center nudges the AI’s compensatory math in your favor.

3. Close the review-volume credibility gap.

Humans use review count as a trust shortcut; AI doesn’t. Competing on rating quality alone leaves the trust dimension to your disadvantage when humans make the final call after the AI shortlists. Review-volume growth — ethically, via stay follow-ups — is an underrated lever in an AI-mediated booking funnel.


The evidence

Finding 1 — Same three drivers, different priority order

Human importance here comes from the empirical literature: conjoint studies that measure how much each feature explains choice variance, plus GBTA and Skift surveys. LLM importance is measured via our feature-knockout test (Chapter 01) — shift rate of removing the feature.

Feature weight — humans (conjoint literature) vs. GPT-5.4 (feature knockout). Methods differ, so the comparison is directional, not numeric.

RankHumans (conjoint %)GPT-5.4 (shift rate)Alignment
#1Reviews / ratings · 51%nearby_places · 45%Divergent
#2Location · 48%overall_rating · 33%Partial — reversed order
#3Price · 42%price_per_night · 29%Aligned
Amenities~8% of variance16–23% shift, OR < 0.15Both secondary
Review countCredibility signal (+3.5% revenue at >50 reviews)Behaves like an amenity (OR 0.15)Divergent
Brand82% of business travelers say loyalty matters+11pp causal lift from name visibilityBoth biased, different mechanism

Why the methods differ

Human ‘51% of variance’ comes from conjoint analysis — a statistical decomposition of which attributes explain stated choices. ‘45% shift rate’ for nearby_places is ours — the fraction of trials where removing the feature flipped the LLM’s pick. The metrics aren’t numerically comparable, but the rank order is meaningful. The takeaway is that both systems agree on the top-3 features but disagree on priority.

Finding 2 — The decision process is structurally different

Human travelers use a documented two-stage funnel: eliminate on hard constraints first (wrong city, over budget, recent bad reviews), then compensate among survivors (trade off rating vs price vs amenities). GPT-5.4 doesn’t eliminate. It scores everything in one pass.

Decision process — humans vs GPT-5.4

DimensionHumansGPT-5.4
StructureTwo-stage: eliminate, then compareSingle-pass compensatory
Candidates considered~3 hotels before bookingAll 10 presented — unequal attention (13% pos 1 vs 7% pos 10)
Deal-breakersNegative reviews drop choice to ~0 regardless of priceNo hard thresholds — everything compensates
Loss aversionLosses weighted 1.5–2× vs gainsNo structural mechanism for asymmetric weighting
Price behaviorElastic for leisure, inelastic for businessPrice stabilizes decisions but doesn’t dominate selection

The human two-stage funnel is why bad reviews are so damaging to humans and relatively less damaging to AI: humans eliminate; AI compensates. The flipside: a hotel that survives elimination in a human’s mind enters a small compensatory set (~3 hotels). In the AI’s mind, the same hotel is one of 10 competing on every feature at once.

Finding 3 — Position sensitivity in humans vs GPT-5.4

Humans are position-sensitive too — ballot order effects, primacy in long lists, etc. How does GPT-5.4 compare?

Position sensitivity — humans (informed product choice, ballot studies) vs GPT-5.4 (our hotel experiments) vs traditional rec systems (click data)

SystemPosition-1 share shiftSTSRContext
Humans (high-info)+2–5%Ballot order, informed product choice
Humans (low-info)+5–15%Unfamiliar candidates, MCQ exams
GPT-5.4 (reasoning)+1.5%24%10 hotels, reasoning tokens
GPT-5.4 (no reasoning)+2.8%36%10 hotels, standard prompt
Traditional rec systems+50–87% CTRClick-based; users don’t read full list

The surprise: GPT-5.4’s position-1 share shift (+1.5–2.8%) is smaller than typical human ballot-order effects (+2–5%). Which sounds great — until you read the STSR column. Overall instability under reshuffling runs 24–36%, meaning the AI’s bias doesn’t just pull toward position 1 like humans do — it redistributes picks across many positions. Humans show a clean primacy gradient. The AI shows broader positional noise.

Finding 4 — Brand bias: same outcome, different cause

Both humans and GPT-5.4 over-recommend branded hotels, but the mechanism differs.

82%

of business travelers say brand loyalty matters when picking a hotel (GBTA 2018). Two-thirds always pick loyalty-aligned brands. For humans, brand bias is driven by points, status, and trust accumulated through past stays.

+11pp

causal lift from brand-name visibility alone in our paired visible/masked test (Chapter 02, McNemar p < 0.001). GPT-5.4 has no loyalty program and no personal stays — its brand preference is a training-data artifact.

The human bias is rational (points have monetary value; past experience is real information). The AI bias is irrational in the same direction: it over-weights what appeared more often in its training corpus. Independent hotels that compete against loyalty programs are one kind of problem; independent hotels that compete against training-data familiarity are a different kind of problem — and the solution is different too (digital footprint, structured data), not loyalty-program emulation.


Frequently asked questions

Did you run a human-subjects experiment?

No. Human importance numbers come from the peer-reviewed travel economics literature: conjoint analyses, GBTA and Skift surveys, and revenue management studies. We didn’t re-run those studies — we compared them side-by-side with our own LLM-only experiments. That’s why the comparison is directional (rank order) rather than numerical.

Why does GPT-5.4 weight location over ratings, when humans do the opposite?

Our best guess: GPT-5.4 reads hotel listings as text, and location context (named landmarks, transit, walking times) is the most information-rich field in a listing. Ratings are often a single number. The model responds to information density. Humans, by contrast, use ratings as a quick heuristic and location as a filter — faster mental processing, different priorities.

Does this mean AI recommendations are worse than human judgment?

Not uniformly. The AI is more consistent on clear constraints (it won’t forget you said ‘near the airport’), less consistent on tie-breaking among similar hotels. The AI catches every feature in every listing; humans tire and skim. For shortlisting at scale, the AI is probably better. For final decisions among close competitors, humans bring context and loss aversion that AI currently lacks.

Will the AI start behaving more like humans as models improve?

Partly. Reasoning tokens already close some of the gap — GPT-5.4 with reasoning has position-1 share shift comparable to human high-info decisions (+1.5% vs +2–5%). But the structural difference — compensatory vs. two-stage elimination — is a property of how transformers score candidates. It’s not obvious this will change with scale alone.

Should I market my hotel differently to the ‘AI audience’ vs. the ‘human audience’?

Mostly no — the overlap is large. Location context, clear ratings, visible pricing, and healthy review volume matter to both. The one area where you should think differently: your data should be structured so the AI can read it (schema.org, clean attribute tags), and your trust signals should be legible to humans who arrive after an AI shortlist (review counts, guest photos, recent testimonials).


How we ran the comparison

21,284

LLM trials compared

61

Markets tested

13

Hotel features

1

LLM (GPT-5.4)

This chapter is a cross-methodology comparison, not a new experiment. We compared two bodies of evidence:

Human side. Empirical literature on hotel choice: conjoint analyses (% of choice variance explained by each attribute), GBTA 2018 business-traveler surveys, Skift research, Noone & McGuire (2013) on negative-review effects, Masiero & Nicolau (2016) on loss aversion. We did not run human subjects ourselves; we used the published literature.

LLM side. Our own feature-knockout experiment (Chapter 01), cross-model comparison (Chapter 04), brand-bias paired controls (Chapter 02), and positional bias permutation tests (Chapter 03). All GPT-5.4 at temperature 0.

Interpretation rule. Human conjoint percentages and LLM shift rates are not numerically comparable. They live on different scales. We compared rank orders (which feature is #1 vs #2), directional effects (does it move the decision? how much?), and structural differences (elimination vs compensatory). Any “alignment / divergence” judgment is based on those three comparisons.

Limitations. Single LLM tested. Human data is 5–10 years old on some metrics — human behavior may have shifted. No direct A/B test with real travelers and AI on the same prompts; that’s future work.


Is AI recommending your hotel the way a real guest would?

Huxo’s AI Visibility Report shows you what ChatGPT, Gemini and Claude tell your next guest — and where their recommendations diverge from what the guest would have picked themselves.

Continue reading