What I Learned Pointing a Ralph Loop at My Product for a Week

I gave my agent a project and an experimental Amplitude opportunity feature. It built entire feature categories I hadn’t asked for.

May 13, 2026

12 min read

Eric Carlson

Chief AI Architect, Amplitude

Amplitude recently hosted AI Week, a time dedicated to upending our normal work process to focus on a fully different AI-native model. As a data scientist by background, I wanted to run one experiment: can I give a coding agent a clear objective function, scaffold it with an orchestration system, then run a Ralph Wiggum loop to autonomously build a product?

Kick it off, walk away, see what happened. My app was a backcountry route planning app I had been toying with. The agent was Claude Code with browser use enabled. The orchestrator was Amplitude’s Opportunity Finder, a new experimental feature that’s supposed to identify product signals, surface improvement opportunities, then draft specs and PRs that coding agents pick up. The constraint I set for myself was that I would not intervene. No prompts, no nudges, just up-front goal definition.

By the end of the week, my app had 102 shipped features. Slope-angle overlays on a 3D Mount Hood. A physics-based avalanche runout simulation that modeled how snow would flow down a canyon before a skier ever dropped in. A mushroom foraging prediction model grounded in species-specific fruiting weather and micro-climate estimates from elevation and aspect. Weather tabs, resort data, mountain biking routes, kayaking maps. Every one of them shipped with a browser-recorded GIF validating that the feature worked end-to-end.

The features were impressive, but what really surprised me was the Ralph loop itself and the techniques that helped it run effectively. Here’s what I learned.

The Ralph loop, as I ran it

The Ralph loop is named after the Simpsons character Ralph Wiggum, who optimistically and persistently proceeds through life, even when incompetent. For the programmer, it is just a while loop over Claude Code. When one session completes, it fires up another one, always in full auto-mode. For the computer/data scientist, it is an intelligent global optimizer with a rigorous objective function that stops only after convergence, if ever.

The version I ran had three cycles: build the next opportunity, verify it in a browser, and generate more opportunities from the new product state. I pointed it at the app, gave it very high-level goals and constraints (examine competitors, optimize for fun planning), and let it run.

Ralph loops are easy to write, but without precise definitions of how they should hill-climb, you’re just burning tokens (like Ralph’s famous quote, ”It tastes like ... burning.”). My success depended on defining how to generate opportunities, how verification was measured, and how the outcomes of one cycle became the inputs to the next.

Going in, I frankly underestimated the effectiveness of this system for building my Ski app.

Where the opportunities came from

Once it has a goal, a Ralph loop needs an input queue. You can just ask the agent to pick things on its own, but I wanted to try giving it a little more guidance, so I connected it to Amplitude’s experimental Opportunity Finder.

The Opportunity Finder is supposed to work as an ”AI PM,” identifying high-value tasks from signals in your analytics, session replays, customer feedback, agent traces, and competitive gaps, then drafting specs for a coding agent to pick up. For my Ralph loop, each opportunity arrived as a structured object: a one-line problem statement, a few-sentence proposed solution, and behavioral evidence for why it was worth picking up.

This was load-bearing in a way I did not fully appreciate before the week started. Without a structured opportunity queue, a forever-loop just asks the model “What should I build next?” and the answer drifts into whatever the model’s priors happen to be about ski apps. With the queue, the loop had a ranked input that was tied to the product’s actual goals and refreshed every cycle with whatever the previous cycle had produced.

My agent instrumented itself

I think this is the part that mattered most, and it is the part I almost did not do.

My agent did not just ship features. It wired up its own telemetry. Amplitude events, full session replay, metric definitions tied back to the opportunities that had proposed the feature in the first place. Every feature the agent built started reporting back on itself the moment it shipped. It really moved at the speed of feedback.

Anyone who has gotten stuck with a single model trying to fix a bug over and over knows how that can fail. Amplitude provided a feedback machine that let Claude mine for the next thing to do rather than spinning its wheels. These external signals seemed to knock the agents out of mode collapse. Maybe the product features would be buggy at first, but they would heal themselves as the bugs were observed.

When the agent instruments its own output, the next cycle has behavioral evidence (albeit mostly synthetic behaviors) to work from, including what got used, what got ignored, where sessions got stuck. The Opportunity Finder ranked against that evidence.

The second cycle got smarter than the first. The tenth got smarter than the second. That is what compounding looks like in this setup, and it almost did not happen without the telemetry being a first-class part of what the agent ships. Not a thing I bolted on afterward.

Browser verification

I paired Claude Code with browser use enabled, so the agent had to click through the app it just changed. Every cycle, the agent opened the browser, used the feature it just built, and recorded a GIF of the click-through.

My ski planning app didn’t have a lot of real users, and I initially thought this would be a barrier to the machine working. But it turns out that the act of Claude verifying features with agents drove a lot of synthetic traffic through parts of the site, which allowed Amplitude to pick up on flow oddities, bugs, and improvements. Session replays of these agents provided visual feedback that Amplitude’s Opportunity Manager agents could use to understand the user journeys and features holistically, at a higher level than code alone.

This approach caught a class of bug that unit tests miss. Let’s say the feature compiles, the tests pass, the button renders, but the dropdown is wired to the wrong handler, or the form submits the wrong field, or a state update never propagates. An agent actually using browser verification finds that in seconds. A green CI run does not.

The GIF was the artifact. Every PR had one attached. It was documentation, verification, and evidence in the same file. When I came back after a stretch of agent time and wanted to know whether feature #73 actually worked, I watched the GIF. I did not read the diff.

The features I didn’t know I needed

Going in, I expected the loop to mostly generate fixes and refinements, iterating on things I had already scoped. Instead, it proposed entire feature categories I had not asked for.

The mushroom foraging model is the one I keep coming back to. I never told the agent to build it. The Opportunity Finder, doing its own competitive research against other outdoor planning apps, decided that a foraging feature was a gap worth closing. Then it researched when different species fruit, pulled historical weather data, and built a prediction methodology in the style of a field guide, grounded in species-specific micro-climate estimates from elevation and aspect. I watched the GIF of the feature being used to find morels near a trailhead and thought: I would not have built this. But I'm delighted that the agent did.

The avalanche runout simulator is similar. I had scoped “slope angle visualization and route hazard highlighting” as a goal. The loop noticed that slope angle alone is not how backcountry skiers actually evaluate risk, instead they evaluate terrain traps, runout paths, trigger points. So it built a physics simulation that models how snow would flow down a specific canyon before a skier drops in.

The Ralph loop did overnight what I would have spent a week on as a PM, along with a handful of physics-savvy engineers. It looked at what competitors shipped this quarter, read session replays, synthesized the top three opportunities. Not always as well as I would have done it, but consistently, and across a wider surface than I would have covered alone. “Define goals and let the system converge” was the name of the game.

Risking auto-merge

The last manual step in the loop, the one I still had my hand on, was merge. By the end of the week, I was letting narrow categories of work (like small UI additions with clear success metrics) auto-merge without me looking. Anything touching user data I left alone. The judgment about which work was safe to auto-merge turned out to be a per-opportunity-type decision, not a global switch.

Going forward, I want to get the deploy step into the loop too, i.e., the last handoff left in the chain. If verification is strong enough for auto-merge, it is strong enough for auto-deploy. Same idea, one more step.

My case was pretty low-risk, but the more risk I was willing to take, the higher the return. It’s an interesting case study for deeply understanding your own product risks, and getting systematic about where and how to gate that. When you hand over the reins, you can really move quickly.

What I took away

Three things I am carrying into the next project:

The loop is not the interesting part. The dispatcher and the verification gate are. Any while-loop will spin. Only a loop with honest outcome signals and a clearly defined objective function will progress toward value.
Self-instrumentation is what makes the loop compound. Without it, the agent has no idea whether the thing it shipped worked outside of the code, and the next cycle is running on the same priors as the first. With it, the loop starts telling you which of its own outputs actually worked, and you can start trusting specific opportunity shapes more over time.
The bottleneck moves. When one agent can ship 102 verified features in a week, execution stops being the scarce resource. Taste moves to the front. Prioritization moves to the front. Knowing which opportunities are worth pursuing, and which of the ones that shipped actually moved a metric, became what I spent my attention on. The loop does the building. I do the judging.

The parts that make this worth anything are the parts that are easy to skip: the verification and the feedback loop. The Ralph loop does not work without them. Nothing does.

About the author