The Autoplay Roadmap: How Our Thinking Evolved

Oct 02, 2025

When we started Autoplay, the dream was simple: real-time automation in software.

A chatbot that could take a command - “Schedule this report for me” - and then go do it, directly in the UI.

But we ran into a hard truth: copilots don’t fail because the AI can’t click the right buttons. They fail because people don’t know what to ask.

Users don’t know what they don’t know. They don’t know which features exist, what’s possible, or how others in their org are using the product. And even if they did, they only want answers that are relevant to them in that moment. That was the core hypothesis Autoplay was built on:

Users don’t know what they don’t know
Users only want to be shown or told what is relevant to them and what they don’t already understand
Users want to understand how others in their organization/team use the same software
Users will enjoy using the product more if it’s presented through a gamified experience

That’s why we shifted from “do-it-for-me” copilots → into “show-me-why” intent detection.

Step One: Recording the Right Data

At first, we tried the brute force route: screenshots and computer vision.

It was slow, expensive, and it missed crucial context like hover time, mouse drift, hesitation.

Then Monday.com gave us a better idea:

“Why not plug into our session replays instead? See if AI can do more there.”

That unlocked everything.

Atlas: Why Session Replays Became Our Wedge

We realized something obvious in hindsight:

Product teams hate watching session replays, but they have to, because that’s where the “why” lives.

So we built Autoplay Atlas.

It takes raw replays and turns them into high-impact insights:

Where users hesitate
What they know vs. don’t know
What goals they’re trying to achieve

Instead of hours of replay watching, you get immediate clarity.

Building the Models

We started by converting video into frames and training models on hesitation, knowledge, intent.

Early versions meant uploading FullStory or Sentry links and getting back a summary.

From there, we layered on conversational intelligence into individual sessions: a chatbot you could ask,

“What can you tell me about what they user is trying to do”

By November, our models could reliably detect intent. Genie (our UI assistant) was starting to adapt in real time. We began asking customers whether they wanted us to integrate with existing replay tools (PostHog, FullStory, Hotjar) or replace them outright.

Scaling the Pipeline

February

Shipped smarter NLP - questions like “Show me hesitation during onboarding filters” just worked.
Knowledge and hesitation models got sharper.
Added search across sessions + hypothesis testing.

April

Redesigned the timeline for clarity.
Launched collection summaries.
Search was upgraded with unsupervised clustering - surfacing emergent goals and golden path deviations.
PostHog came on board to help us scale infrastructure and co-sell.

May

Golden Path shipped: define your ideal workflow, then instantly see where users deviate.
We refined our ICP: PLG martech tools with clear adoption paths (campaigns, integrations, automations). Complex enough for AI to matter, narrow enough to measure impact.
Outbound focused here → first commit in from Sendlane.

Thinking in First Principles: TERRA

As we mapped intent at scale, we developed TERRA:

Task, Event, Representation, Reasoning Architecture.

It’s the most efficient way we’ve found to connect raw clicks → to meaningful goals.

And it powers everything from hypothesis testing to real-time copilots.

Hypothesis Testing: One Flow, Not Ten Tools

The current adoption problem:

Time-to-value is too slow.
Teams stitch together multiple tools to ask simple questions: Is this drop-off caused by a bug, a bad UX pattern, or a user decision?

Our answer: turn everything into tags.

An Issue = tag:issue_onboarding.
A Golden Path = tag:golden_path_activation.
A Hypothesis = tag:hypothesis_filters_confusion.

Instead of bouncing between views, you stay in one unified search canvas. Every step is traceable, shareable, repeatable. The agent can run the flow end-to-end: check for bugs, compare hesitation before/after, classify outcomes as decision vs. UX.

It’s hypothesis testing, but agentic.

Where We’re Headed

We’re pushing into agent-driven workflows, where Autoplay doesn’t just surface friction but tests it automatically.

Future R&D:

Labeling data via chatbot interactions → supervised learning + RLHF.
Session replay–powered copilots that intervene in real time.
Tight integrations with PostHog error tracking, cohorts, and feature flags.

The long-term vision loops back to where we started: real-time copilots. But this time, grounded in user intent, context, and data.

Autoplay AI

Discussion about this post