Calibrating a writing voice with AI as the mirror
- The plan
- AI drafts → I get frustrated with the quality and take over → compare to the published version → recalibrate
- Result
- outgrew the experiment — now a standing tool I run on real drafts
- Notes
- created skill that incorporates stop-slop, inclusive language, and no-new-invented-hyphenated-terminology
Plain language summary
Can an AI learn not just how I write, but why I write that way? Most voice tools just feed the AI samples and let it copy the surface. I wanted something different: to capture the reasoning behind my choices, so it learns how I think, not just how I sound.
Since April 2026 I've run a simple loop. The AI drafts something. I take over when the writing frustrates me. Then I look at what I changed and ask why. That reason is the durable part, usually a value or a rule about how I want to come across. I write it down and test it against my real published writing. A pattern has to show up more than one way before I trust it. That keeps me from just learning the quirks of one AI.
It outgrew the experiment. The loop is now a standing tool that reads every draft before I do, stripping the patterns that give AI away and keeping the language about people plain and concrete. The rules it has built up are mine, and I keep them private.
What this does not show: this is one person's voice, judged by that same person. There's no counterfactual, no way to measure how my writing would read without it. Whether the method works for anyone else is untested.
The question
When I started using AI to create some drafts for me, I naturally wanted to calibrate against my writing voice. So I fed Claude samples of my writing to give it a target. That seemed like it was missing the point. I wanted AI to be able to not just imitate my writing. I wanted to expose the reasoning behind the choices I made when writing.
The experiment: have AI create a draft, then capture my corrections systematically, extract and review the principle for each correction, and then test the principles against real published work until they're confirmed or rejected.
This is the second of a pair of experiments that point AI at my own working life rather than at code; the companion experiment is Designing the day for flow (coming soon).
The loop
AI drafts a post, an email, or something else written → iterate over corrections as AI redrafts based on my feedback → I take over completely when the quality frustrates me → compare drafts against what I actually published → extract the pattern behind the difference → propose it as a principle → confirm or reject it into a base calibration file that future drafting reads first. A watch protocol flags correction moments in live writing sessions so they feed the loop instead of just disappearing.
One rule guards the whole thing: triangulation before lock-in. I don't lock a pattern into my voice calibration file until I've seen the principle show up more than one way: not just in my edits to AI drafts, but in my own past writing, or in a quick side-by-side of two phrasings. One callout is a hint. Two callouts that agree make it likely. Three or more callouts, and it becomes a rule and it gets added to the file.
The goal of that loop is to explain HOW I write, not WHAT I write.
An early finding: the calibration techniques aren't interchangeable, and they map to different contexts. A/B comparisons and anti-voice exercises (literally "How would you NOT want this to sound?") work very well in a voice session, while historical comparison can easily run async. That mapping makes voice work possible in time that was previously dead (driving, walking, waiting, or even sleeping) instead of demanding desk focus.
What the record looks like
I'm not going to publish the rules themselves. They're the point. They're the part that's mine, and they do the actual work. A list of "say it this way, not that way" would read as a style sheet anyone could lift, and it would miss what the rules actually encode. What's worth saying is the shape of what the method turns up.
Most entries aren't style fixes. They're moments where the AI made a choice I'd never make. When I worked out why I was rejecting it, the reason was almost never grammar. It was usually about who I'm writing for, or how I want to come across. Those reasons are the durable part; the wording was just where they surfaced.
The file has a texture worth naming. Some are hard mechanical rules. I write Canadian English, and I reach for "--" instead of an em-dash. Others are positioning rules that reverse a default the AI reaches for from its training. A few are tensions the method surfaced and deliberately didn't resolve: two things I care about that pull against each other, logged as open questions instead of forced into a rule. And plenty of reviews end with nothing promoted at all, because the pattern was already captured. A confirmation that changes nothing is the guard working.
The counts, as of now: seven core principles locked in, plus a set of hard mechanical rules. About six more candidates sit in a holding pattern, each seen once or twice but not yet enough to lock. A couple more are parked until I have real-world engagement data to judge them by. A few have been watched for weeks without recurring. That waiting is its own kind of answer: not every correction is a rule, and refusing to lock one in early is what keeps the file honest.
What happened
It outgrew the experiment. The plan was supposed to run in stages: a month building the baseline, then triangulation, then everyday use. Instead it collapsed into the thing it was designing. The loop now runs on real drafts (web copy, newsletters, posts), with the baseline file as its memory. The dated record: baseline established April 24; proposed-additions reviews late April, late May, early June, and this week. Some proposals lock in, some are dismissed as already captured, and the file changes slowly.
Some rules that started as voice corrections hardened into standing policy, applied to every draft before I see it: strip the predictable AI patterns, keep language about people concrete, never coin manufactured jargon. The corrections I make today sit higher up the stack than the ones I made in April.
Honest caveats
A sample of one person's voice, judged by that same person. "This sounds like me" is the outcome measure, and it's subjective by construction. Triangulation manages the overfitting risk but doesn't remove it: every input still arrives by way of correcting one particular AI's habits. Whether the process transfers to anyone else's voice is untested. There's no counterfactual either. I can't measure what my drafts would read like today without the calibration, and the correction-rate drop the design aims for is tracked, not yet measured.
What's next
The loop keeps running on real work. The next formal checkpoint is a quarterly consolidation pass, to prune and check for drift. What's still open is whether this transfers: the principles are mine, but nothing about the method is.