Essay · · 4 min read

A wake word, not a mode

Tonecast's voice feature does two very different things. Hold Fn and talk, and it either transcribes and polishes what you said -- dictation -- or it treats what you said as an instruction and goes and does it: "answer this email," "make this shorter," "polite no, I'm traveling."

The obvious design is two hotkeys, or a toggle: dictation mode and command mode. We shipped neither. There's one button, one transcript, and one routing decision:

switch VoiceRouter.route(transcript) {
case .intent(let instruction):  // opened with the wake word → do it
case .dictation(let polish):    // everything else → polish and paste
case .empty:                    // silence → nothing
}

If the transcript opens with "Tonecast," it's an instruction. Otherwise it's dictation, polished at whatever level you've picked -- Raw, Polished, Full Tonecast, or YOLO. The router strips the wake word and any trailing comma or colon, and what's left is the command.

Flowchart: hold Fn and speak, producing one transcript. If it opens with Tonecast, it becomes an instruction with the wake word stripped and Tonecast does the thing. Otherwise it is dictation, polished into your voice. Both paths paste at the cursor.
one transcript, one decision

Why not a second hotkey

Because modes are where voice interfaces go to die. A mode switch means that before you speak, you have to remember which state a menu-bar app is in. Get it wrong in one direction and your dictated paragraph gets "executed" as a nonsense command. Get it wrong in the other and "answer this email" gets pasted into the email, verbatim, as your reply. Both failures are embarrassing, and both are your fault for holding the wrong key -- which is the worst kind of product blame.

Speech already has a disambiguation mechanism, and every human knows it: you say a name when you're addressing someone. Nobody flips a switch before delegating to a colleague; they say "Alex, can you handle this?" The wake word makes talking to the tool work exactly like talking to a person, which means there's nothing to remember. The mode lives in the sentence, not in the app.

It also survives the transcript being the only input. By the time routing happens, the audio is gone -- Whisper has already turned it into text. A prefix check on a string is something you can unit test with a table of transcripts. No acoustic model, no confidence thresholds, no "was the button half-pressed" ambiguity.

The part I didn't plan: turncast

Here's the charming engineering bit. Speech models are trained on real words, and "Tonecast" isn't one yet. So Whisper regularly hears the brand name as something else, and the router has learned to accept the ways it gets it wrong:

private static let wakeWordAliases = [
    "tonecast",
    "tone cast",
    "turncast",
    "turn cast",
    "tunecast",
    "tune cast"
]

Every one of those came from a real mis-transcription. You say "Tonecast, reply to this," Whisper writes "Turncast, reply to this," and the command should still just work -- the user said the wake word; the ASR is the one that flubbed it. The router matches aliases longest-first at the start of the transcript, and a small regex canonicalizes whichever variant appeared back to "Tonecast" so the rest of the pipeline sees one consistent name.

Fighting the speech model is a losing game. Absorbing its mistakes is a list literal.

Talking about Tonecast

One real failure shaped the router more than anything else. Early on, the rule was simply "starts with the wake word → command." Then I dictated a sentence that began with the product's name -- "Tonecast uses a text polishing tool..." -- and instead of transcribing it, the app went off and answered it. The generated reply landed in my document looking like a spectacularly broken polish.

The fix leans on how English actually works. A comma or colon after the wake word ("Tonecast, reply...") is always an explicit address -- Whisper's punctuation is doing the work. But when the wake word is followed by a bare space, the router looks at the very next word. If it's a copula, an auxiliary, or a declarative lead-in -- is, has, can't, uses, works, just, basically, and a few dozen more -- you're talking about Tonecast, and the whole sentence routes to dictation. "Tonecast, answer this" is a command; "Tonecast is a menu bar app" is a sentence.

It's a denylist of words that can't begin an instruction, sitting in front of an LLM-powered feature. Deeply unfashionable. Also: correct, instant, and testable.

Small surface, few states

The entire router is about a hundred lines of pure functions -- no state, no audio, string in, decision out. That smallness is the point. Every mode you add to an interface is a bet that users will track it; every state you remove is a class of error nobody can make anymore.

One button. Say what you want typed, or say its name and what you want done. That's the whole interface.

If that sounds like how you'd rather talk to your Mac, join the early-access list.


Tonecast is built by Codefox AI. Questions, feedback, or just want to say hi? Email us at support@tonecast.ai.