Skip to content
Back to Articles
Explainer

How real-time speech translation works — from microphone to 225 languages

Real-time speech translation converts a speaker's voice into translated audio in 225 languages through a pipeline of speech recognition, machine translation, and voice synthesis. Here is how each stage works.

Last updated · May 27, 2026 9 min read

Real-time speech translation seems like magic: one person speaks, and moments later hundreds of listeners hear the same words in their own language. Behind that experience is a software pipeline running three AI models in sequence, completing the journey from spoken word to translated audio in under a second.

This article walks through each stage of that pipeline — speech recognition, machine translation, and voice synthesis — and explains how they combine to deliver 225 languages to a live audience.

Stage 1: Speech-to-text — capturing what the speaker says

How streaming STT works

The pipeline begins the moment the speaker opens their mouth. The browser captures audio from the microphone and sends it over WebRTC — the same protocol used for video calls — to a LiveKit SFU (Selective Forwarding Unit). The SFU routes the audio track to the translation agent running on the server.

The agent does not wait for a complete sentence. Instead, it streams audio in small chunks to Deepgram Nova-3, a neural speech recognition model. Deepgram returns partial transcripts that get refined as more audio arrives. A sentence like “good morning everyone and welcome to the conference” might arrive as three partial results: “good morning,” then “good morning everyone and,” then the complete sentence. Each refinement updates downstream translation in near-real-time.

This streaming approach is what keeps latency low. The system does not buffer an entire utterance before acting — it starts processing within tens of milliseconds of receiving audio. By the time the speaker finishes a sentence, the translation pipeline is already well underway.

Speaker language detection

Deepgram Nova-3 supports 49 speaker language codes — language-region variants like American English (en-US), Brazilian Portuguese (pt-BR), and Simplified Chinese (zh-CN). The speaker selects their language when starting the session. This matters because accurate speech recognition requires knowing the input language. “Auto-detect” models exist, but they add latency and reduce accuracy for rare language pairs — an unacceptable trade-off in a live setting.

For practical tips on getting the cleanest audio into the pipeline — microphone choice, placement, and room acoustics — see our guide on choosing the right microphone.

Stage 2: Machine translation — converting meaning across languages

The translation engine

Once the speech-to-text stage produces a transcript, the text passes into machine translation. The engine depends on the speaker’s plan:

  • Free tier: Google Cloud NMT (Neural Machine Translation) — fast and reliable for major language pairs. NMT is a production-proven model trained on billions of parallel sentences, and it handles straightforward translation with low latency.
  • Paid tiers (Starter, Pro, Max): DualModelTranslator — uses Google Cloud Translation LLM for roughly 100 languages where large language models produce more natural, context-aware output, falling back to NMT for the remaining pairs. The LLM advantage is real: it handles idioms, register shifts, domain-specific terminology, and long-range context better than statistical approaches. For simpler pairs — Spanish to Portuguese, for example — NMT is faster and equally accurate, so the system routes accordingly.

Handling 225 output languages

The system supports 225 output languages, split into two tiers:

  • 51 languages receive full audio. Translated text is synthesized into speech via Google Cloud TTS and delivered as a live audio stream.
  • 174 additional languages receive live text captions. The translation is real and translated — not transcribed — but delivered as scrolling text instead of audio.

Languages are activated on demand. When a listener joins a session and picks their language, the pipeline creates a translation stream for that specific source-target pair. If nobody selects Finnish, no Finnish translation is generated — and no language-hours are consumed for it. See the full list of supported languages for audio and caption coverage.

Latency in the translation step

Machine translation is the fastest stage in the pipeline:

  • NMT: typically 50–150 ms per sentence fragment
  • LLM: typically 100–300 ms per fragment — higher quality for complex text, marginally slower

Because the streaming architecture feeds partial transcripts into translation as they arrive, the system does not wait for a complete sentence before translating. Partial results are refined as more context becomes available, which means the listener receives a steady stream of translated content rather than a series of discrete bursts.

Stage 3: Text-to-speech — giving the translation a voice

How TTS synthesis works

For the 51 audio languages, the translated text passes to Google Cloud TTS. The model generates a natural-sounding audio waveform in the target language. Each language has its own voice model tuned for that language’s phonology — the rhythm, intonation, and consonant-vowel patterns that make speech sound natural rather than robotic.

The synthesized audio is published as a new audio track on the LiveKit SFU. Each language gets its own track, independent of the others.

Audio delivery to listeners

The delivery mechanism is WebRTC — the same protocol used for video calls, optimized for low-latency real-time media. Each listener subscribes to the audio track matching their chosen language. No mixing, no switching — the listener hears one continuous stream in their language from start to finish.

Listeners can join from a phone, tablet, or laptop. For the full audience experience — how a listener scans a QR code, picks a language, and connects — see how QR code translation works.

The full pipeline in numbers

Pipeline stageTechnologyLatencyCost per language-hour
Speech-to-textDeepgram Nova-3 (streaming)200–400 ms~$0.46
TranslationGoogle Cloud NMT / Translation LLM50–300 ms~$0.02–0.08
Text-to-speechGoogle Cloud TTS100–200 ms~$0.79
Audio deliveryWebRTC via LiveKit SFU<100 ms$0 (self-hosted)
End-to-end350 ms–1 s~$1.27–$1.33

Where latency accumulates

End-to-end latency has three sources:

  1. Network ingress — the time for audio to travel from the speaker’s browser, through the LiveKit SFU, to the translation agent. This depends on the speaker’s internet connection but is typically under 100 ms on a stable connection.
  2. Processing — STT + translation + TTS. This is the bulk of the delay: roughly 350–900 ms depending on the language pair and whether the system uses NMT or LLM translation.
  3. Network egress — the time for the translated audio track to travel from the SFU to each listener’s device. Again, typically under 100 ms.

Total end-to-end latency for audio languages typically falls between 0.5 and 1.0 seconds. Text caption languages skip the TTS step entirely, so they arrive faster — but without synthesized audio. For a deeper comparison of AI-driven translation against traditional human interpretation, see real-time translation vs simultaneous interpretation.

Why this matters for event organizers

Sub-second latency means listeners can follow along naturally. They are not waiting awkwardly for translation to catch up — they hear the translated version close enough to the original that the rhythm of the talk is preserved. In practice, most audiences report that a consistent 0.5–1.0 second delay feels like a natural pause rather than a technical lag.

225 languages means no audience member is excluded. Whether the event serves a dozen languages or two hundred, the same pipeline handles all of them without additional hardware, personnel, or setup time.

The pipeline runs continuously for hours without fatigue — unlike human interpreters, who rotate every 20 minutes to maintain accuracy. A four-hour conference translated into eight languages runs the same pipeline from start to finish, with consistent quality throughout.

Cost is driven by language tracks, not audience size. Whether 5 or 350 people listen in French, the cost is one language-hour per hour. For a full breakdown of the billing model, see the language-hour pricing model.

The bottom line

Real-time speech translation is a three-stage pipeline — recognize, translate, synthesize — that converts one speaker’s voice into hundreds of listeners’ languages in under a second. Each stage is a production-proven AI model: Deepgram for speech recognition, Google Cloud for translation and voice synthesis, WebRTC for delivery. The components are not experimental. They run at scale in production environments every day.

The technology is mature enough for conferences, town halls, classrooms, and broadcasts. It is not a laboratory experiment — it is running at events today, delivering 225 languages with sub-second latency at a cost of roughly $1.30 per language-hour.


Want to see real-time speech translation in action? Start a free session — speak in any of 49 languages, your audience hears in 225. No setup, no credit card.