EXP-TRE-002 Apr 24, 2026 → ongoing

Calibrating a writing voice with AI as the mirror

The plan: AI drafts → I get frustrated with the quality and take over → compare to the published version → recalibrate
Result: outgrew the experiment — now a standing tool I run on real drafts
Notes: created skill that incorporates stop-slop, inclusive language, and no-new-invented-hyphenated-terminology
Blog post: How I'm teaching an AI my writing voice

Plain language summary

Can an AI learn not just how I write, but why I write that way? Most approaches to voice calibration feed the AI samples and let it copy the surface. I wanted something different: to capture the reasoning behind my choices, so it learns how I think, not just how I sound.

Since April 2026 I've run a simple loop. The AI drafts something, we iterate and capture live feedback, and if we don't come to a resolution together, I take over fully. Then we use AI to review what I changed and why. That reason is the durable part, usually a value or a rule about how I want to come across. AI catalogs it and we test it against my real published writing. A pattern has to show up more than one way before I trust it. Once I do, it becomes part of my voice calibration baseline.

It outgrew the experiment. The loop is now a standing tool that reads every draft before I do, stripping the patterns that give AI away and keeping the language about people plain and concrete. The rules it has built up are mine, and I keep them private.

What this does not show: this is one person's voice, judged by that same person. There's no counterfactual, no way to measure how my writing would read without it. Whether the method works for anyone else is untested.

The question

When I started using AI to create some drafts for me, I naturally wanted to calibrate against my writing voice. So I fed Claude samples of my writing to give it a target. That seemed like it was missing the point. I wanted AI to be able to not just imitate my writing. I wanted to expose the reasoning behind the choices I made when writing.

The experiment: have AI create a draft, then capture my corrections systematically, extract and review the principle for each correction, and then test the principles against real published work until they're confirmed or rejected.

This is the second of a pair of experiments that point AI at my own working life rather than at code; the companion experiment is Designing the day for flow (coming soon).

The loop

AI drafts a post, an email, or something else written → iterate over corrections as AI redrafts based on my feedback → if we don't come to a resolution together, I take over fully → AI compares the draft against what I actually published → extracts the pattern behind the difference → proposes it as a principle → I confirm or reject it into a base calibration file that future drafting reads first. A watch protocol flags correction moments in live writing sessions so they feed the loop instead of just disappearing.

One rule guards the whole thing: triangulation before lock-in. I don't lock a pattern into my voice calibration file until I've seen the principle show up more than one way: not just in my edits to AI drafts, but in my own past writing, or in a quick side-by-side of two phrasings. One way is a hint. Two ways that agree make it likely. Three or more, and it becomes a rule and it gets added to the file.

The goal of that loop is to explain HOW I write, not WHAT I write.

An early finding: the calibration techniques aren't interchangeable, and they map to different contexts. A/B comparisons and anti-voice exercises (literally "How would you NOT want this to sound?") work very well in a voice session, while historical comparison can easily run async. That mapping makes voice work possible in time that was previously dead (driving, walking, waiting, or even sleeping) instead of demanding desk focus.

What the record looks like

I'm not going to publish the rules themselves. They're the point. They're the part that's mine, and they do the actual work. A list of "say it this way, not that way" would read as a style sheet anyone could lift, and it would miss what the rules actually encode. What's worth saying is the shape of what the method turns up.

Most entries aren't style fixes. They're moments where the AI made a choice I'd never make. When we work out why I was rejecting it, the reason is almost never grammar. It was usually about who I'm writing for, or how I want to come across. Those reasons are the durable part; the wording was just where they surfaced.

The file has a texture worth naming. Some are hard mechanical rules. I write Canadian English, and I reach for "--" instead of an em-dash. Others are positioning rules that reverse a default the AI reaches for from its training. A few are tensions the method surfaced and deliberately didn't resolve: two things I care about that pull against each other, logged as open questions instead of forced into a rule. And plenty of reviews end with nothing promoted at all, because the pattern was already captured. A confirmation that changes nothing is the guard working.

The counts, as of now: seven core principles locked in, plus a set of hard mechanical rules. About six more candidates sit in a holding pattern, each seen once or twice but not yet enough to lock. A couple more are parked until I have real-world engagement data to judge them by. A few have been watched for weeks without recurring. That waiting is its own kind of answer: not every correction is a rule, and refusing to lock one in early is what keeps the file honest.

What happened

It outgrew the experiment. The plan was supposed to run in stages: a month building the baseline, then triangulation, then everyday use. Instead it collapsed into the thing it was designing. The loop now runs on real drafts (web copy, newsletters, posts), with the baseline file as its memory. The dated record: baseline established April 24; proposed-additions reviews late April, late May, early June, and this week. Some proposals lock in, some are dismissed as already captured, and the file changes slowly.

Some rules that started as voice corrections hardened into standing policy, applied to every draft before I see it: strip the predictable AI patterns, keep language about people concrete, never coin manufactured jargon. The corrections I make today sit higher up the stack than the ones I made in April.

Honest caveats

A sample of one person's voice, judged by that same person. "This sounds like me" is the outcome measure, and it's subjective by construction. Triangulation manages the overfitting risk but doesn't remove it: every input still arrives by way of correcting one particular AI's habits. Whether the process transfers to anyone else's voice is untested. There's no counterfactual either. I can't measure what my drafts would read like today without the calibration, and the correction-rate drop the design aims for isn't measured yet.

What's next

The loop keeps running on real work. The next formal checkpoint is a quarterly consolidation pass, to prune and check for drift. What's still open is whether this transfers: the principles are mine, but nothing about the method is.