Real-Time Speech-to-Text: How It Works + Best Tools and APIs (2026)

How streaming STT works, what latency really means, and the best real-time speech-to-text tools and APIs in 2026 — for both end users and developers.

Real-Time Speech-to-Text: How It Works + Best Tools and APIs (2026)

Real-time speech-to-text is the difference between watching a transcript appear word by word as someone speaks — and waiting until the meeting is over to see what was said. The technology has gotten dramatically better in the past two years, with leading APIs now hitting 150–300 ms of latency on streaming audio. This guide covers what real-time STT actually is, how the streaming pipeline works under the hood, the latency benchmarks that matter for different use cases, and the best tools and APIs for both end users and developers in 2026.

Table of contents

Key takeaways

  • Real-time speech-to-text streams transcripts within hundreds of milliseconds of speech — fast enough for voice agents, live captions, and natural conversation.

  • Below 300 ms feels imperceptible; 500 ms is good for live captions; over 2 seconds starts to feel like batch transcription rather than real-time.

  • For consumers, Otter, Notta, Tactiq, and Web Speech API cover most needs. For developers, Deepgram, AssemblyAI, Speechmatics, and ElevenLabs Scribe v2 lead the API category.

  • Free options exist on both ends — Web Speech API runs entirely in the browser, and self-hosted Whisper streaming works without per-minute fees if you have a capable GPU.

What is real-time speech-to-text?

Live captions appearing on screen as person speaks into a microphone

Real-time speech-to-text — sometimes called streaming STT or live transcription — produces written output as the speaker is talking, not after they finish. Words appear on screen within hundreds of milliseconds of being spoken. This is what powers live captions on Zoom, voice typing on your phone, voice agents on a customer service line, and the cursor that types as you speak in Google Docs.

Real-time vs batch transcription

Two different jobs, two different optimization goals:

Mode

Real-time / streaming

Batch / file upload

Output timing

Words appear within hundreds of ms

Full transcript after entire file processed

Optimized for

Latency

Accuracy

Typical accuracy

Slightly lower (no future context)

Slightly higher (full context)

Best for

Live captions, voice agents, live transcripts

Recordings, podcasts, post-meeting summaries

Real-time and batch are both built on the same underlying STT models — see our pillar guide on speech-to-text for the technical foundations. Real-time adds streaming infrastructure on top.

How real-time speech-to-text works

The streaming pipeline

Audio is broken into small chunks (typically 100–250 ms each) and sent continuously to the STT engine. The engine processes each chunk as it arrives, emitting partial results that get refined as more context becomes available. The chunks travel over a persistent connection — usually WebSocket or gRPC — that stays open for the whole session.

Partial vs final transcripts

One of the trickiest parts of real-time STT is that early predictions are uncertain. Engines typically emit two kinds of results:

  • Partial transcripts — the model's best guess so far, updated continuously as more audio arrives. Words may change as context fills in.

  • Final transcripts — locked-in segments after the model is confident (usually triggered by a pause or punctuation cue).

Good UI design renders partial results in a lighter color or with a special indicator, then "commits" them when the final version arrives. If you've watched live captions on YouTube or Google Meet flicker mid-sentence and then settle, that's the partial-to-final transition in action.

Endpointing (voice activity detection)

The engine has to decide when an utterance ends — when the speaker has stopped talking, not just paused. Endpointing uses voice activity detection (VAD) and pause-length thresholds. Set too aggressive and the engine cuts off mid-sentence; set too lenient and the final transcript arrives slowly. Most engines expose this as a tunable parameter (200–800 ms is typical).

Streaming protocols

  • WebSocket — the most common protocol for browser and mobile clients. Persistent connection, low overhead, well-supported.

  • gRPC — preferred for server-side and microservices integrations. Bidirectional streaming, strongly typed, lower per-message overhead than WebSocket.

  • HTTP long-polling — fallback for environments where WebSocket isn't available. Simpler but higher overhead.

  • SDK abstractions — most vendors provide SDKs that hide the underlying protocol. Useful for getting started but understand what's under the hood for production deployments.

Latency budget

End-to-end latency in real-time STT is the sum of several stages:

  • Audio capture buffering (~50–100 ms)

  • Network transit to the STT server (~30–150 ms depending on geography)

  • Model inference (~50–200 ms)

  • Network return to the client (~30–150 ms)

  • Client rendering (~10–50 ms)

Total: typical 200–700 ms for cloud-based real-time STT. On-device models cut the network legs entirely but trade off model quality.

Latency benchmarks: what's actually "real-time"?

"Real-time" is a marketing term that means different things depending on the use case. Here's what the latency budget needs to be for each:

Latency

Feels like

Good for

<200 ms

Imperceptible

Voice agents, conversational AI, gaming

200–500 ms

Live

Live captions, simultaneous interpretation

500 ms – 1 s

Slight delay

Meeting transcripts, voice typing

1–2 s

Noticeable delay

Most note-taking use cases (still fine)

>2 s

Feels like batch

Post-meeting summaries (use batch instead)

Vendor benchmarks (typical)

  • ElevenLabs Scribe v2 Realtime — ~150 ms (industry-leading at launch)

  • Deepgram Nova-3 — ~250 ms

  • AssemblyAI Universal-Streaming — ~300 ms

  • Google Cloud Speech-to-Text Streaming — ~400 ms

  • Microsoft Azure Speech (Real-time) — ~400 ms

  • Speechmatics Real-Time — ~500 ms

  • OpenAI Realtime API — ~500 ms (includes voice agent loop)

  • Web Speech API (browser) — varies, ~500 ms–1 s

These numbers depend on geographic region (closer servers are faster), audio quality, and load — measure on your own workload before committing.

When to use real-time speech-to-text

  • Live captions for video conferencing. Zoom, Google Meet, Microsoft Teams — captions appear as participants speak, with no perceptible lag.

  • Meeting transcripts during the call. Note-takers (Otter, Fireflies, Tactiq) show the transcript live so attendees can fact-check, search, and add comments while the meeting happens.

  • Voice agents and conversational AI. Customer-service voicebots, AI tutors, and assistants need sub-300 ms STT to feel responsive in dialogue.

  • Live broadcasting and webinars. Real-time captioning makes streamed content accessible to deaf/hard-of-hearing viewers and to international audiences (with translation).

  • Voice typing. Dictation in any text field. The cursor types as you speak.

  • Call-center QA and compliance. Live transcripts let supervisors monitor calls in real time, flagging issues before they escalate.

  • Accessibility for live events. Conferences, classrooms, and town halls add live captions for inclusion.

Benefits of real-time over batch

  • Immediacy. The transcript exists the moment the conversation does. No waiting for processing.

  • Interactivity. Users can search, comment, or take action on the transcript while the meeting is still happening.

  • Accessibility. Deaf and hard-of-hearing users participate in real time, not after-the-fact.

  • Live monitoring. Supervisors, hosts, or content moderators can intervene as things happen, not after.

  • Voice-driven UX. Voice agents, hands-free apps, and dictation only work because real-time STT exists.

Best real-time speech-to-text tools (consumer)

Real-time transcript appearing during a video meeting on a laptop

Otter.ai

The most polished consumer real-time STT tool. Lives in a browser tab or as a Chrome extension; transcripts appear inline as people speak. Excellent on Zoom, Google Meet, and Microsoft Teams. Free tier covers 300 minutes/month; paid from ~$10/month.

Notta

Multilingual real-time transcription with strong support for Asian languages. The Notta extension or app captures live audio and produces an in-tab transcript. Free tier with 120 min/month; paid from ~$9/month.

Fireflies.ai

Bot-based real-time transcription that joins meetings as a virtual participant. Best for sales teams who want CRM integration on top of real-time transcripts. Free tier limited; paid from ~$10/user/month.

Tactiq

Chrome extension that displays live transcripts inside Google Meet, Zoom, and Microsoft Teams. No bot, no install on the desktop side. Free tier with 10 transcripts/month; paid from ~$8/month.

Web Speech API (browser-built-in)

Chrome, Edge, and Safari include the Web Speech API — free streaming STT directly in the browser, no account or API key required. Quality varies by browser and language; Chrome's implementation is the most capable. Used by tools like Speechnotes and many free dictation web apps.

Speechnotes

Free, browser-based dictation tool built on the Web Speech API. No login, runs in any modern browser. Limited to a single browser tab at a time; perfect for ad-hoc voice typing.

Best real-time speech-to-text APIs (developer)

Deepgram Nova-3

Currently the industry leader for low-latency streaming STT. Sub-300 ms typical latency, strong English accuracy, custom vocabulary support, and fair pricing (~$0.0044/min on the streaming tier). Used heavily in voice-agent and call-center products.

AssemblyAI Universal-Streaming

The most feature-rich streaming STT. Beyond just transcripts, the API returns sentiment, topic detection, content moderation, and speaker labels — all in real time. ~$0.0045/min for streaming.

Google Cloud Speech-to-Text

Mature, broadly supported (~125 languages), good integration with the rest of Google Cloud. Streaming mode is solid but not as low-latency as the specialized providers. Pricing tiered, ~$0.024/min on standard streaming.

Microsoft Azure Speech

Strong choice for Microsoft 365 / Azure-native shops. Real-time mode supports custom vocabularies, custom acoustic models, and HIPAA + FedRAMP compliance for regulated industries. Pricing similar to Google.

Speechmatics Real-Time

Best-in-class accuracy on diverse accents (Speechmatics specifically focuses on accent robustness). Higher latency than Deepgram or AssemblyAI but stronger word-error-rate on global English.

ElevenLabs Scribe v2 Realtime

Newest entrant (launched 2025), benchmarked at ~150 ms — currently the lowest-latency cloud STT. Designed primarily for voice-agent and conversational AI use cases. Pricing in line with the category.

OpenAI Realtime API

End-to-end voice agent API: ASR + LLM + TTS in a single bidirectional stream. Useful for building conversational AI products without integrating three vendors. Latency includes the LLM round-trip, so total ~500 ms typical.

Whisper streaming (open-source)

OpenAI's Whisper model isn't a streaming model out of the box, but several open-source projects (whisper-streaming, RealtimeSTT, faster-whisper) wrap it for streaming use. Self-host on a GPU and the per-minute cost drops to compute. Latency depends on hardware (~300–800 ms on a consumer GPU).

Comparison table

Tool / API

Latency

Languages

Pricing

Free tier

Otter.ai

~600 ms

3+

~$10/mo

300 min/mo

Notta

~700 ms

50+

~$9/mo

120 min/mo

Tactiq

~600 ms

40+

~$8/mo

10/mo

Web Speech API

~500 ms–1 s

Browser-dependent

Free

Yes

Deepgram Nova-3

~250 ms

30+

~$0.0044/min

$200 credit

AssemblyAI

~300 ms

15+ streaming

~$0.0045/min

$50 credit

Google Streaming

~400 ms

125+

~$0.024/min

60 min/mo

Microsoft Azure

~400 ms

100+

~$0.024/min

5 hr/mo

ElevenLabs Scribe v2

~150 ms

30+

~$0.005/min

Trial

Whisper streaming

~300–800 ms

99

Self-host

Free (compute)

How to choose the right tool

  1. What latency does your use case really need? Voice agents demand sub-300 ms. Live captions are fine at 500 ms. Meeting transcripts work at 1 sec. Don't pay a premium for latency you don't need.

  2. What's the accuracy floor? The metric is Word Error Rate (WER). For consumer use, anything under 10% feels good; for compliance or legal use, you may need under 5%. Our guide on WER explains how to interpret these numbers.

  3. What languages do you need? Google leads on breadth (125+); Notta and Microsoft are strong on Asian; Speechmatics on accent robustness; Whisper is open-source multilingual.

  4. Where will it run? Mobile app, web, server-side, or embedded? SDK availability and protocol support vary across vendors.

  5. Do you need custom vocabulary? Industry jargon, product names, and proper nouns benefit from a custom dictionary. Most paid APIs support this; consumer tools rarely do.

  6. What's your privacy requirement? Cloud is faster and more accurate; on-device or self-hosted (Web Speech API on-device, Whisper self-host) avoids sending audio to a third party.

Implementation considerations (for developers)

  • Audio format. 16 kHz mono PCM is the universal input. Some APIs accept 8 kHz (telephony) or 48 kHz (high-fidelity); MP3 and Opus require server-side decoding which adds latency.

  • Connection management. Implement reconnect logic with exponential backoff. Real-time STT connections can drop on flaky networks — your code should resume gracefully.

  • Partial result handling. Decide how to render partial transcripts in your UI. Common pattern: render partials in italic / lighter color, replace with finalized text when the API marks them final.

  • End-of-utterance detection. Tune the endpointing threshold for your use case. Voice agents prefer aggressive endpointing (~200 ms pause); meeting transcripts prefer lenient (~700 ms) to avoid cutting off thinking pauses.

  • Self-hosted vs cloud. Self-hosting Whisper or a similar open-source model takes a GPU (a single A100 or RTX 4090 handles 5–10 concurrent streams). Cloud APIs are simpler operationally but cost more per minute.

  • Error handling. Network blips, model uncertainty, and silent audio all need graceful handling. Your UX should communicate status to users (recording, processing, error).

Privacy and cost

Privacy considerations

  • Cloud means third-party processing. All major cloud STT vendors process audio on their servers. For sensitive content, demand SOC 2, HIPAA (if applicable), and explicit non-training-data agreements.

  • On-device options. The Web Speech API uses on-device models in Chrome and Safari for some languages. Apple's iOS Live Transcribe runs on-device. For sensitive use cases, these are meaningful.

  • Self-hosted. Whisper streaming on your own infrastructure means audio never leaves your network — the highest privacy bar.

Cost benchmarks

  • Consumer apps — $9–18/month for unlimited streaming usage typically.

  • Specialized APIs (Deepgram, AssemblyAI, ElevenLabs) — ~$0.004–0.005/min on streaming. A 60-minute meeting costs about $0.27.

  • Hyperscaler APIs (Google, Microsoft) — ~$0.024/min on standard tier. Same 60-minute meeting: ~$1.44.

  • Free options — Web Speech API (browser), Whisper streaming (self-host on a GPU you already have).

Frequently asked questions

Real-time vs batch — when should I use which?

Real-time when the user needs the transcript while the meeting is happening (live captions, voice agents, in-meeting note-taking). Batch when accuracy matters more than speed and you only need the transcript afterward (recordings, podcasts, post-meeting summaries). Many products use both: batch for the canonical post-meeting record, real-time for the in-call experience.

How low can real-time STT latency go?

The current state of the art on cloud APIs is ~150 ms (ElevenLabs Scribe v2). On-device models can hit similar numbers but at lower accuracy. Below ~100 ms is essentially the network and audio buffering floor — the model itself can be faster, but the round-trip is bound by physics.

What's a good Word Error Rate for real-time STT?

Under 10% on clean English audio is good for general use; under 5% is excellent. Real-time WER is typically 2–4 percentage points worse than batch on the same audio because the model can't use future context. See our deep dive on WER for benchmarking specifics.

Can real-time STT run offline?

Yes, with caveats. The Web Speech API uses on-device models for some languages (Chrome implements local streaming for English in particular). Apple's Live Transcribe runs on-device. Self-hosted Whisper streaming is offline by design. The trade-off is generally lower accuracy and language support than cloud-based options, though the gap is closing.

Do I need a GPU to run real-time STT?

For cloud APIs, no — the GPU is the vendor's problem. For self-hosting (Whisper streaming), yes — a consumer GPU like an RTX 4080/4090 or a single cloud A100 will handle real-time inference for 5–10 concurrent streams. CPU-only inference works for tiny models but won't keep up with real-time on full-quality STT.

What are the free real-time STT options?

The Web Speech API (built into Chrome, Edge, Safari) is fully free, runs in the browser, and requires no API key. Otter, Notta, and Tactiq have free tiers with monthly minute caps. For developers, Whisper streaming self-hosted on hardware you already own has no per-minute cost, just compute. Most cloud APIs (Deepgram, AssemblyAI) offer a generous free credit on signup.

Conclusion

Real-time speech-to-text has crossed the threshold where it just works for most use cases — sub-300 ms latency on the leading APIs, accuracy that holds up in real meetings, and prices low enough that streaming entire workflows is no longer a budget question. Pick the tool that matches your latency budget and accuracy requirement; build the right partial-vs-final UI; and let users see their words on screen the moment they speak them.