My experience using Claude to book an Airbnb.
Why browser agents can navigate but struggle to browse.
I tried using Claude’s browser copilot to book an Airbnb for the summer. I gave it my dates, budget, area, and all the usual criteria. It pulled options, filtered, summarized tradeoffs. Then, just to be sure of the results, I went back to Airbnb, and did it myself. My apartments were noticeably better, and what took me nearly an hour with Claude, I was able to do myself in half the time.
The agent’s problem wasn’t reasoning, or understanding my prompts. It was that it used Airbnb badly. It treated the search like a structured query: enter constraints, get results, pick the best one. What I actually did looked nothing like that. I was moving around the map, zooming in and out, opening listings, closing them. I’d look at three places on the same street and decide the street felt off, shift the map slightly, start again. Photos did most of the work. You can tell fast if a place will feel cramped, if the light is bad, if it’s one of those apartments that looks good in the first shot and gets worse after. None of that is in a filter.
There’s also a constant re-ranking happening that you’re not conscious of. You see something slightly better and everything before it drops. You revisit a listing you’d dismissed and it moves up because the alternatives got worse. The agent had no access to any of that.
To know whether the agent did a bad job, you have to do the task yourself, at which point you’ve already done the job. You could give it better instructions, and send it back. But after 45 minutes of watching it search, who wants to evaluate its results, write new prompts, and send it through the whole process again? You either trust what it came back with, or it’s faster to do it yourself.
My session replay, every click, every listing opened and closed, every time I nudged the map, is a complete record of how the preference formed. It’s not noise, but a decision taking shape in real time, and it’s a better benchmark for evaluating the agent than checking whether it completed the booking. The question is whether it converges on the same kind of place and explores the same parts of the map. The ground truth is the process, not the outcome.
Most agent evaluation misses this. We keep asking “did it finish the task” when the more useful question is “did it make the same moves a good human would make.”
The Airbnb UI is built for humans. Visual, spatial, optimized for comparison. It evolved to match how people actually process and filter information. We moved from databases to spreadsheets to SaaS interfaces because each step was more legible to the way humans think.
For an agent, all of that is overhead. It doesn’t need a map. It doesn’t parse photos the way we do. What it actually wants is closer to a spreadsheet, 800 rows, clean columns, every listing queryable. The agent could reason over it directly without navigating a UI built for someone else.
Are we regressing? Is the right interface for an agentic world just a database with an API? Run a filter algorithm, return ranked results, done. That’s not a new idea, it’s basically what SaaS was before anyone put a visual layer on top of it.
The UI isn’t just a display layer for humans. It’s where intent forms. I didn’t know exactly what I wanted when I started; browsing was how I figured it out. The visual, spatial, comparative experience of Airbnb isn’t a nice-to-have on top of the data. It’s the process by which a vague preference becomes a specific one. Strip it out and you haven’t made the problem simpler, you’ve just removed the mechanism by which the goal gets defined.
The agent working from a spreadsheet would be faster and more systematic and would still pick worse apartments, because it would be optimizing against a goal that was never properly formed in the first place.
So where does that leave browser agents? The question isn’t whether they’ll replace humans doing this kind of task. It’s whether they can participate in the loop that makes the task work. The one where preferences shift in real time based on what you’re seeing, where the UI is doing cognitive work, not just displaying results.
What changes this isn’t better reasoning or better prompts. It’s agents that read UI behavior as signal: which listings got attention, which got skipped, what got revisited, and what got dropped the moment something better appeared, and use that to infer intent in real-time instead of waiting to be told what to optimize for. The UI doesn’t become redundant. It becomes the data source. The session becomes the instruction set.
That’s a different architecture than most browser agents are built on. And it’s why an Airbnb test is more interesting than it looks.

