Essay · · 4 min read

Say what you want done

Transcription is a commodity. Whisper has been open source since 2022, the hosted versions are fast and near-free, and every dictation tool on the market now turns speech into accurate text. That race is over, and everyone won.

Which is why the interesting question about voice isn't "how well does it hear me?" anymore. It's: what happens after the words?

Where the field is

To be fair about it, the good dictation tools have been climbing past raw transcription for a while. Wispr Flow does automatic cleanup as you speak and has an experimental Command Mode: highlight text, speak an instruction, get it rewritten -- more concise, friendlier, translated. Their docs are upfront that it works best for local edits. SuperWhisper lets you build custom modes that run your dictation through an LLM with your own prompt, optionally fed the active input field, your selection, or your clipboard -- and it can run its speech models fully on-device, which deserves respect. Apple's built-in dictation is transcription plus spoken punctuation; the rewriting lives in Writing Tools, which you invoke by click, and the real voice-editing commands ("Replace cat with dog") live in Voice Control, an accessibility feature.

All of this is editing. The text is yours to produce; the tool operates on text that already exists. Meanwhile Superhuman generates entire replies in your voice -- genuinely finished work -- but you type the prompt, and it only works inside their email client.

So the landscape splits cleanly: voice tools that edit text, and AI tools that draft work. The empty square is voice that drafts work.

The missing rung

That's the square Tonecast's instruction mode sits in. Hold Fn and say: "Tonecast, answer this email -- polite no, I'm traveling."

Here's what actually happens, because I think the mechanics are the argument. The wake word routes the transcript to intent instead of dictation. Tonecast then resolves the conversation you're looking at -- the Gmail thread, the Apple Mail message, the WhatsApp or Slack or iMessage conversation, even a thread inside Superhuman -- through per-app integrations. It loads your voice profile for that channel and that contact, hands the model three things -- the conversation, your profile, your spoken instruction -- and pastes the drafted reply at your cursor. One utterance in, finished work out.

Two pipelines compared. Dictation tools: speech becomes text, then you still read, think, write, edit, and send. Tonecast instruction: speech becomes intent plus the conversation on your screen, which becomes drafted work in your voice.
the same ten seconds of speech, two very different outputs

Two details worth knowing. First, not every instruction even needs a model: the router tries cheap parsers before the LLM, so "Tonecast, switch to Slack" or "Tonecast, screenshot this window" are recognized as commands and executed instantly, no tokens spent. The expensive path is reserved for the instructions that deserve it. Second, when there's no conversation to read, it doesn't improvise: if context can't be resolved, the instruction fails with a plain message -- "No conversation context found" -- rather than letting the model draft a reply to a thread it never saw.

Why intent is structurally worth more

It collapses the loop. Dictation, however polished, hands you back text and leaves you the actual job: read the thread, figure out what it needs, compose, fix, send. Five steps, and transcription only ever removed the "type" part of one of them. An instruction collapses transcribe-read-think-write-edit into a single spoken sentence. That's not an incremental accuracy win; it's a different denominator.

It composes with voice profiles. An editing command produces roughly the same output for everyone -- "make this shorter" is "make this shorter." An instruction executes in your voice: the same markdown profile that shapes your dictation is injected into the instruction prompt, per channel, per contact. "Answer this email" from me and "answer this email" from you produce two different emails, each sounding like its own person. The instruction feature isn't sitting next to the voice learning; it's multiplied by it.

It travels across apps. Because the context comes from what's in front of you -- not from a plugin you installed into one program -- the same sentence works in Mail, in Gmail in a browser tab, in Slack, in WhatsApp. The integration surface is the screen you're already looking at, which is the only integration surface that generalizes.

The commodity and the moat

None of this makes transcription unimportant -- Tonecast still needs to hear "polite no, I'm traveling" correctly, and it happily rides the same Whisper-class models as everyone else. That's the point: the hearing is shared infrastructure now. The compounding asset is everything after -- knowing what's on your screen, knowing how you'd say it, and being trusted to go from utterance to artifact.

Dictation tools will keep getting better at giving you your words back. We're betting the future belongs to tools you can hand the words to.

Try the difference yourself -- join the early-access list.


Tonecast is built by Codefox AI. Questions, feedback, or just want to say hi? Email us at support@tonecast.ai.