What Is Word Error Rate (WER)? Formula, Calculation + Benchmarks (2026)

WER is the standard accuracy metric for speech recognition. Here is the formula, worked calculation, Python implementation, vendor benchmarks, and what counts as good.

What Is Word Error Rate (WER)? Formula, Calculation + Benchmarks (2026)

Word Error Rate (WER) is the standard measurement for speech recognition accuracy — the metric every STT vendor publishes, every research paper reports, and every team uses to decide which transcription engine to buy. The formula is simple, the worked examples are short, but interpreting WER correctly is where most users go wrong. This guide covers exactly what WER measures, how to calculate it (manually and in Python), what numbers count as "good," and where WER quietly misleads even careful evaluators.

Table of contents

Key takeaways

  • WER (Word Error Rate) measures speech recognition accuracy as the percentage of words a model gets wrong — the lower, the better.

  • Formula: WER = (S + D + I) / N — substitutions plus deletions plus insertions, divided by the number of words in the reference.

  • Below 5% WER is excellent (medical/legal grade); 5–10% is solid for consumer products; above 25% usually means the system needs cleanup before you trust the output.

  • WER doesn't measure meaning — a transcript can have a low WER and still be useless if the few errors landed on the most important words.

What is Word Error Rate?

Word Error Rate visualization with reference and predicted text comparison

Word Error Rate is a metric that compares two transcripts — a reference (the ground-truth, human-verified version) and a prediction (what your STT system produced) — and quantifies how different they are. WER counts three kinds of mistakes the prediction can make: substitutions (wrong word), deletions (missing word), and insertions (extra word). Total mistakes divided by the number of words in the reference gives you WER, expressed as a percentage.

Lower WER means fewer mistakes. A WER of 0% means the prediction matches the reference exactly. A WER of 25% means roughly one in four words is wrong. Anything above 100% is technically possible (covered later) but indicates a transcript that's worse than useless.

Why WER matters

WER is the universal currency for STT comparison. Three reasons it dominates:

  • Universal benchmark. Every STT vendor publishes WER on standard datasets (LibriSpeech, CommonVoice, AMI Meeting Corpus). You can compare Deepgram, Google, Microsoft, and OpenAI on the same number.

  • Easy to compute. The formula is simple, the algorithms are open-source, and you can verify any vendor claim by running their model on your own labeled data.

  • Decision-making input. Procurement teams and engineers use WER to pick between vendors. "Vendor A is 1.5 percentage points better on our data" is a defensible reason to choose them.

WER shows up in our pillar guide on speech-to-text and our deep dive on real-time speech-to-text — every accuracy claim ultimately reduces to a WER number.

The WER formula

The math is straightforward:

WER = (S + D + I) / N

Where:
  S = number of Substitutions
  D = number of Deletions
  I = number of Insertions
  N = total words in the Reference text

To get a percentage, multiply by 100.

Connection to Levenshtein distance

WER is essentially edit distance applied at the word level instead of the character level. The Levenshtein distance algorithm — the same one used in spell checkers — computes the minimum number of insertions, deletions, and substitutions needed to transform one string into another. WER takes that count and divides by the length of the reference. Run Levenshtein on words instead of characters, and you have WER.

The 3 error types

Every WER discrepancy boils down to one of three operations:

Substitution (S)

The prediction has a word in the position where the reference had a different word.

Reference:  "send the report to Aisha"
Prediction: "send the report to Asia"
                              ^^^^
                              S = 1 substitution

Deletion (D)

The prediction is missing a word that the reference has.

Reference:  "the meeting starts at three"
Prediction: "meeting starts at three"
            ^^^
            D = 1 deletion

Insertion (I)

The prediction has an extra word that the reference doesn't.

Reference:  "we need to ship today"
Prediction: "we really need to ship today"
                ^^^^^^
                I = 1 insertion

WER weights all three error types equally, even though in practice deletions can be much worse for understanding ("the meeting was cancelled" → "meeting cancelled" still parses; "the meeting was cancelled" → "" doesn't).

How to calculate WER (worked example)

Step-by-step calculation of Word Error Rate with formula breakdown

Let's run through a concrete example.

Step 1 — Reference and prediction

Reference:  "the quarterly report is due on Friday"     (7 words)
Prediction: "the quarterly report due on Tuesday"        (6 words)

Step 2 — Align the two transcripts

Position

1

2

3

4

5

6

7

Reference

the

quarterly

report

is

due

on

Friday

Prediction

the

quarterly

report

due

on

Tuesday

Op

OK

OK

OK

D

OK

OK

S

Step 3 — Count the errors

  • Substitutions (S): 1 ("Friday" → "Tuesday")

  • Deletions (D): 1 ("is" missing)

  • Insertions (I): 0

  • Reference length (N): 7

Step 4 — Apply the formula

WER = (S + D + I) / N
    = (1 + 1 + 0) / 7
    = 2 / 7
    ≈ 0.286
    = 28.6%

A WER of 28.6% on this 7-word sentence means roughly 29% of the reference content was mis-transcribed. Note that this is calculated on a single sentence — real WER reports typically average over many minutes or hours of audio.

WER vs accuracy

People often try to express WER as accuracy: "97% accurate" sounds friendlier than "3% WER." This works only when WER is below 100%. The relationship:

Accuracy ≈ 1 - WER       (when WER ≤ 100%)

Why WER can exceed 100%

Insertions don't replace anything in the reference — they add to the prediction. If a model hallucinates extra words, the count of S + D + I can exceed N (the reference length). Imagine a 10-word reference where the prediction has 50 words, half of them wrong: the WER could easily be 200%+. Real-world: when an STT engine processes a noisy recording it sometimes produces "ghost" words, driving WER over 100% on that segment.

This is why accuracy = 1 - WER can produce a negative number. The relationship is a useful approximation when WER is reasonable; it breaks down at high error rates.

How to calculate WER programmatically

Manual calculation is fine for one example. For a real evaluation, use a library.

Python: the jiwer library

The de facto standard Python library for WER is jiwer. Install with pip install jiwer and use:

from jiwer import wer, compute_measures

reference = "the quarterly report is due on Friday"
prediction = "the quarterly report due on Tuesday"

# Single number
error_rate = wer(reference, prediction)
print(f"WER: {error_rate:.3f}")          # WER: 0.286

# Detailed breakdown
m = compute_measures(reference, prediction)
print(m)
# {'wer': 0.2857, 'mer': 0.2857, 'wil': 0.4082,
#  'wip': 0.5918, 'hits': 5,
#  'substitutions': 1, 'deletions': 1, 'insertions': 0}

jiwer also exposes related metrics: MER (Match Error Rate), WIL (Word Information Lost), and WIP (Word Information Preserved). For most practical use, plain WER is what you report.

Other libraries and tools

  • HuggingFace evaluateevaluate.load("wer") works with any audio dataset; useful when benchmarking against models from the HuggingFace Hub.

  • SCLITE (NIST) — the official tool used in NIST evaluations. Higher setup cost, used when you need official-quality numbers.

  • SpeechBrain — full ASR research framework that includes WER computation alongside training and evaluation.

  • jiwer + pandas — for per-utterance WER in a dataset (run jiwer on each sample, aggregate in pandas).

What WER is "good"? Benchmarks by use case

"Good" depends heavily on the application. Here's the lay of the land:

WER

Quality

Suitable for

<5%

Excellent

Medical, legal, court transcripts (with human review on top)

5–10%

Very good

Most consumer products: meeting transcripts, voice assistants, captions

10–15%

Acceptable

Noisy environments, accented English, draft-quality output

15–25%

Borderline

Recordings with heavy background noise, strong accents, technical jargon

>25%

Poor

Output needs significant human cleanup; not safe for downstream use

Vendor benchmarks (2026)

Approximate published WER on the LibriSpeech test-clean dataset (well-mic'd, well-enunciated English audiobook recordings — the easy benchmark):

  • OpenAI Whisper Large v3 — ~2.5% WER

  • Deepgram Nova-3 — ~3% WER

  • AssemblyAI Universal-2 — ~3% WER

  • Google Cloud STT (latest) — ~4% WER

  • Microsoft Azure Speech — ~4% WER

  • Speechmatics — ~3.5% WER (strong on accents)

On harder benchmarks (LibriSpeech test-other, Common Voice, AMI Meeting Corpus), the same models post WER 8–15% — closer to real-world conditions. Always run on your own data when evaluating: vendor benchmarks reflect their best dataset, not yours.

How to read benchmark claims

If a vendor claims "under 4% WER" without naming the dataset, treat the number with skepticism. The same model on LibriSpeech test-clean will be 3× to 5× lower WER than on phone-call audio with three speakers. Always look for:

  • The dataset name (LibriSpeech, AMI, Switchboard, Common Voice, etc.).

  • The specific test split (test-clean vs test-other for LibriSpeech).

  • Whether normalization rules are stated.

Factors that affect WER

  • Audio quality. Sample rate, microphone, noise floor, and distance from the speaker all matter. 16 kHz clean studio audio gives 3% WER on the same model that produces 15% WER on a noisy phone call.

  • Speaker characteristics. Accent, speaking rate, and age. Children and elderly speakers tend to have higher WER on most models. Heavy accents can add 5–10 percentage points.

  • Content difficulty. Common conversational English is easy; medical jargon, code-switching between languages, and rare proper nouns are hard.

  • Model and training data. Models trained on more data, more languages, and more speaker diversity perform better. Whisper's 680k-hour training set is part of why it's so accurate.

  • Normalization rules. Two evaluators can score the same prediction differently depending on whether they treat punctuation as significant, lowercase the text, or expand contractions.

Transcript normalization (often overlooked)

Before computing WER, both reference and prediction need to be normalized — otherwise small formatting differences inflate WER artificially. Common normalization steps:

  • Lowercase everything (so "Hello" and "hello" are equal).

  • Strip punctuation (or treat it as a word, but apply consistently to both sides).

  • Expand contractions ("don't" → "do not"), or apply consistently.

  • Normalize numbers ("five" vs "5" — pick one).

  • Strip filler tokens like "um," "uh," "you know" — only if your reference does the same.

Most STT vendors and benchmarks use a standard normalization (often based on the Whisper text normalizer, which is open-source). When in doubt, use the same normalization the dataset's official evaluator uses.

Limitations of WER

  • Doesn't measure semantic correctness. A WER of 10% can be a transcript that's perfectly understandable, or one where the few errors landed on the most important words ("$10K" → "$10M"). WER doesn't distinguish.

  • Treats all errors as equal. Substituting "a" for "the" counts the same as substituting "Friday" for "Monday." For most downstream uses, those are very different errors.

  • Sensitive to small differences. Whether the punctuation is in or out of the calculation can shift WER by 1–2 percentage points. Always check normalization before comparing scores.

  • Doesn't capture readability. A 5% WER transcript with no punctuation, no capitalization, and run-on sentences can be harder to read than a 10% WER transcript with clean formatting.

  • Doesn't capture downstream usefulness. For a meeting summary, a slightly higher WER on filler words doesn't matter if the action items are captured correctly. WER doesn't know what's important.

Speechmatics published a well-known critique titled "The Problem with Word Error Rate" that argues for complementary metrics — semantic similarity, named-entity recall, downstream-task performance — alongside WER. Worth reading if you're making a serious vendor decision.

WER vs CER vs BLEU

Metric

What it measures

Best for

WER

Word-level edit distance

English-and-similar languages, conversational speech

CER

Character-level edit distance

Asian languages without spaces (Chinese, Japanese), OCR, fine-grained errors

BLEU

n-gram overlap with reference

Machine translation, where multiple correct outputs exist

For STT in English, French, German, Spanish, etc., WER is the right choice. For STT in Chinese or Japanese (where word boundaries are ambiguous), CER is more reliable. For machine translation, BLEU (or its successors like chrF and BERTScore) are appropriate. WER applied to a translation task wouldn't make sense — there's no single "correct" translation to compare against word-for-word.

WER calculator tools

  • jiwer (Python) — the standard, open-source, accurate, runs on any text input. Best choice for engineers.

  • HuggingFace evaluate — Python, integrates with HuggingFace datasets and models. Best when benchmarking against published models.

  • SCLITE (NIST) — the gold-standard tool used in NIST evaluations. Authoritative numbers, higher setup cost.

  • MetricGate — free online WER calculator. Paste reference and prediction, get the WER. Useful for quick checks.

  • Amberscript WER calculator — another free online option, with detailed error breakdown.

  • SpeechBrain — full ASR research toolkit that includes WER alongside training utilities.

Frequently asked questions

Can WER exceed 100%?

Yes. Insertions add to the error count without reducing the reference length, so a transcript with many extra words can produce WER > 100%. In practice, this happens when an STT engine hallucinates words on noisy or silent audio. A WER over 100% is a strong signal that something is broken in your pipeline.

Is lower WER always better?

Lower is usually better, but not always meaningfully. A model that scores 4.2% vs 4.5% on the same dataset is statistically tied — pick on other factors (cost, latency, language support). Also: a low WER on the wrong dataset (LibriSpeech test-clean) doesn't mean a low WER on your data (noisy team meetings). Always test on your real audio.

What's the difference between WER and accuracy?

WER measures error; accuracy measures correctness. The simple relationship is Accuracy ≈ 1 - WER, but this only holds when WER is below 100%. At higher error rates, "accuracy" stops being meaningful. Engineers use WER; marketing teams sometimes use accuracy. The same data point.

Why does my WER vary on the same audio?

Likely normalization differences. Are punctuation and capitalization counted the same way both times? Are filler words ("um," "uh") in or out? Are numbers expanded ("five" vs "5")? Standardize your normalization rules and pin them in your evaluation pipeline.

For unaided AI output, under 5% WER is the floor. In practice, medical and legal transcription rely on AI as a draft and a human as the final reviewer — the human catches the few errors that matter most. AI alone is rarely accepted as the certified record in regulated fields.

Can I calculate WER in Excel?

It's painful but possible. Excel doesn't have built-in edit distance, so you'd implement Levenshtein with VBA or Power Query. Realistically, copy your data into Python and use jiwer — five minutes of setup, accurate results, repeatable.

What's the best free WER calculator?

For one-off checks: MetricGate or Amberscript online calculators. For Python developers: jiwer (free, open-source). For HuggingFace integrations: the evaluate library. All produce identical numbers if normalization is the same — pick on convenience.

Conclusion

Word Error Rate is the right tool for the right job: a clean, comparable number that tells you how often a speech recognition system gets words wrong. The math is simple, the libraries are mature, and the benchmarks are public. Where most teams go wrong isn't the formula — it's interpretation: chasing a 0.5% WER improvement that doesn't matter for their use case, or trusting a vendor's marketing benchmark instead of testing on their own data. Run WER on your own audio, hold normalization constant, and pair it with a downstream-task evaluation when the stakes are high. That's how the metric actually earns its place in your decision.