May 31, 2026

Vibes gettin worse

Fourteen months of commits on one side project, read as a record of how building with AI agents actually changed. Meltdown, discipline, and a slow escalation in what the tools could carry.

On March 22, 2025, at some point on a Saturday morning, I committed 90 files to Bewks under the message vibes gettin worse, more garbage. The diff was +24,342 / -10,659. An hour later I committed again: the app is back at least. That one was 18 files, +458 / -260, and most of the work was deleting things, including an api/auth/session-bypass route that should never have existed.

Those two commit messages are the most honest documentation in the whole repository. I’ve now got 483 commits across fourteen months on this project, and read in order they’re less a changelog than a record of how building with AI agents changed over a year. The tools changed, and so did what I was willing to ask them to do.

The garbage era

The first chaos commit is what AI pair-programming looked like in early 2025, before I knew how to hold it. I’d asked for test infrastructure. What I got was eleven hand-rolled mock files, two competing test setups (setup.js and setup.ts), and duplicated test trees where the same suite lived at two paths. None of it ran. The model generated scaffolding faster than I could read it, and I kept accepting, because each individual piece looked plausible and the sum looked like progress.

None of it was progress, just surface area. The recovery commit an hour later was the cleanup: delete the cleverness, get a bootable app back. That’s the lesson from that morning, the one I keep relearning. An agent that’s confidently wrong produces more code than one that’s right, and after you’ve been burned by that a few times the sheer volume of a change starts reading as the warning sign rather than the reassurance.

I didn’t write any of that down at the time. I just wrote vibes gettin worse, which turns out to be the same observation with less dignity.

The quiet, and then the flood

There’s a hole in the history. After August 2025 the commits stop for three months. September, October, November, nothing. Then December opens with 79 commits and February 2026 lands 153. The gap wasn’t a vacation. It lines up almost exactly with the months when agent tooling got good enough to run several at once, and the project’s whole rhythm changed on the far side of it.

The before-and-after is stark in the diffs. Early Bewks was me typing with an assistant. Late Bewks is me reviewing work from a swarm: branches I’d dispatch, come back to, and either merge or kill. The bottleneck moved off my fingers and onto my judgment, which is a much better place for it to live but a worse one to be lazy in. The failure mode in March was accepting bad code because writing it had been cheap. By February it was accepting bad decisions, because reviewing them had gotten just as cheap. Same trap, one layer up.

The boring work got cheap

The most visible thing the agents bought me was the boring middle, done at a volume I’d never have paid for by hand.

I extracted a BaseController with a single withAction() wrapper and deleted about a thousand net lines of copy-pasted try/catch/logging across fourteen controllers in one pass. goodreads.controller.ts alone lost 512 lines. I split a 1,097-line BookTable.tsx down to 403, and a 1,287-line MainLayout.tsx down to 372, pulling the guts into hooks. I migrated the entire test suite off Jest onto Vitest: 277 files, +12,005 / -16,720, mechanical and tedious and exactly the kind of thing that used to die in a backlog because no human wanted to spend a weekend on it.

The test-coverage push is my favorite one, because it came with a lie baked in. I drove coverage from 45% to 72%, about 4,400 tests, and somewhere in there discovered the number had been wrong the whole time. A misconfigured Jest multi-project config was silently overwriting the .tsx coverage with zeros. The agents were good at writing the tests. They were no help at all noticing that the scoreboard was unplugged. That division of labor has held up better than almost any other generalization I’ve made this year.

Where it actually got clever

If you want to see the tools, and my nerve, escalate over time, look at how Bewks talks to the outside world for metadata. It’s a small cat-and-mouse with the modern bot-protection internet, and it climbs.

It started naive: axios, fetch the page, parse the HTML. That held up for a long time. Then I added film metadata, and IMDb started answering my scraper with an HTTP 202 challenge page instead of data. The same afternoon, instead of fighting the challenge, I pivoted to graphql.imdb.com, the internal endpoint IMDb’s own site calls, and replaced forty lines of brittle regex with one typed query. The site was already handing me a clean API. I’d just been politely reading its HTML like it was 2010.

Goodreads was meaner. Its search endpoint started returning HTTP 202 with a zero-byte body, a success status that’s actually a silent block, the worst kind, because nothing throws. So I routed only the search path through a headless browser, and added a MIN_HTML_SIZE = 2048 guard: a real browser always renders something, so byte-length became the signal for “you’ve been blocked” when the status code lied. That browser engine now runs on a Raspberry Pi, with code that hunts for a Chromium binary across five different OS layouts because I genuinely don’t control where it’ll be installed.

axios, to a GraphQL endpoint the site didn’t advertise, to a headless Chromium on a Pi detecting silent blocks by counting bytes. I didn’t design that ladder. Each rung was forced by a specific failure, and I only climbed it because, by the time each problem showed up, asking an agent to build the next rung had gotten cheap enough that it stopped feeling like a project and started feeling like a reply.

What “over time” actually means

There’s a tidy version of this story where the AI got smarter and so did I, hand in hand, and it all bends toward the light. That’s not quite it.

What changed wasn’t one capability. It was the cost of attempts collapsing, over and over, in a way that kept moving the bottleneck. When writing code got cheap, the scarce thing became reading it. Once reading single diffs got cheap too, the scarce thing became deciding which of six parallel branches even deserved to exist. The vibes gettin worse commit and the byte-counting WAF detector are the same project fourteen months apart, and the difference between them isn’t that the model got good. It’s that I slowly figured out which half of the work was still mine.

It’s still mine. The agents never once noticed the coverage number was fake.