Our customers translate entire sites, from marketing pages, product documentation, CMS content, to structured data that includes HTML markup, URLs, glossary terms, and custom instructions. A translation model might perform perfectly on simple text but fail when these additional constraints are involved. To make sure translations behave correctly in production, we run a continuous evaluation pipeline for new models.
This article looks into existing research, how we evaluate translation quality, what “quality” means in our context, and how we decide which models ship in Framer.
Evaluating translation quality
Before choosing a model, the first question is straightforward: how do you measure translation quality?
This turns out to be a more complex problem than it sounds. Over the years, researchers have proposed many evaluation methods, each with its own trade-offs. Those trade-offs matter when evaluating modern LLM-based translators.
BLEU: the classic metric
BLEU (Bilingual Evaluation Understudy) is one of the most commonly used metrics. It evaluates how much a model’s translation overlaps with a reference translation, assigning higher scores when the wording more closely matches the reference.
For many years BLEU was the standard metric in machine translation research. However, it has clear limitations for modern language models. LLMs often produce translations that are perfectly correct while using different phrasing than the reference. Because BLEU rewards strict overlap, it can undervalue translations that express the same meaning using different wording. In practice, BLEU works best when there is a single expected translation, but natural language rarely behaves that way.
LLMs as evaluators
A more recent approach is to use another language model as the evaluator. Instead of comparing outputs to a reference translation, we ask an LLM to analyze a translation and score its quality directly.
Even relatively old models align well with human expert judgement. For example, Kocmi & Federmann (2023) found that GPT-3.5 (released over three years ago) used as a reference-free evaluator (GEMBA-DA), achieved 86% system-level pairwise accuracy against human MQM (Multidimensional Quality Metrics) labels across multiple language pairs.
Several evaluation methods follow this idea:
Method | Description | Notes |
|---|---|---|
A neural model designed to estimate translation quality. Some variants require reference translations, while others evaluate outputs directly. | COMET-QE evaluates translations without requiring reference translations. | |
A prompting-based method that uses a general-purpose LLM to evaluate outputs using structured reasoning prompts. | Flexible but typically requires multiple model calls. | |
An LLM-based method designed specifically for machine translation evaluation. | GEMBA-MQM produces detailed quality metrics; GEMBA-DA outputs a score from 1–100. |
Translation quality is not just “good translation”
Evaluating translation quality might sound straightforward: check whether the translation is correct, done. In practice, inputs are more complex.
As mentioned in the beginning, in Framer, translations are not just text to text. Because of this diversity, we cannot benchmark every model against every tone or domain. Instead, we focus on what we can validate consistently: whether a model behaves correctly under the constraints that appear in real Framer projects.
These constraints often expose weaknesses that traditional translation benchmarks miss. A model may translate plain text perfectly but behave unpredictably once it needs to preserve glossary terms, follow URL formatting rules, or keep HTML structure intact.
For example, glossary configurations may require exact replacements while preserving case sensitivity:
Many models struggle to apply rules like this consistently, especially when the prompt also contains other instructions or structured content. These failures may look small, but in real projects they quickly break user expectations.
Beyond translation quality, we also consider two product constraints: cost, since token pricing varies between models, and speed, since translations must return quickly in Framer’s localization UI. Waiting two minutes for a single translation for a very minimal gain is unexpected; especially when apps like Google Translate are instant.
Continuous Evaluation: GEMBA-DMA
The challenges above show a single assessment does not cut it. It needs continuous evaluation to monitor improvements (and regressions) to quality and speed. We present a novel technique based on GEMBA-DA as our scoring method: GEMBA-DMA – GPT Estimation Metric Based Assessment - Direct Multi-Assessment.
The most significant change is, we propose using multiple models to judge. Second, we propose a broad range of models and not just OpenAI models. Third, we let our winners score the next winners. This makes sure quality judgement evolves as frontier LLMs advance.
Our evaluation pipeline works roughly as follows: we generate translations using several candidate models, run them across a representative set of languages, and then ask multiple judge models to score the outputs. This produces stable rankings without requiring reference translations.
Worth mentioning, judges never know the source of what they score; so seeing Opus 4.6 as judge score GPT 5.4 higher than its own translation is not unusual.
Evaluation Results
The tables below show a sample of our evaluation results across several frontier models and popular Open-Source models. We translate a shared set of inputs across multiple languages and score the outputs using our GEMBA-DMA judging setup. Alongside the quality scores, we also record runtime and token usage so we can evaluate the cost and latency trade-offs of each model.
Model | Reasoning | Duration | Cost | Avg |
|---|---|---|---|---|
GPT 5.2 | Low | 18.42s | $0.0172 | 95 |
GPT-5.3 Codex | High | 52.26s | $0.0472 | 95 |
GPT 5.4 | Low | 20.92s | $0.0195 | 95 |
Opus 4.6 | None | 23.45s | $0.0471 | 94 |
Gemini 3 Flash | High | 26.35s | $0.0120 | 94 |
GPT 5.4 Mini | Medium | 12.34s | $0.0083 | 94 |
Gemini 3.1 Flash Lite | None | 8.59s | $0.0021 | 93 |
Sonnet 4.6 | Low | 48.36s | $0.0294 | 92 |
GPT 5.4 Nano | Low | 6.79s | $0.0016 | 92 |
Grok 4.20 | None | 7.57s | $0.0097 | 91 |
Kimi K2.5 | None | 135.61s | $0.0169 | 90 |
Gemini 3.1 Pro | Minimal | 26.84s | $0.0412 | 90 |
GPT OSS 120B | High on Groq | 18.78s | $0.0051 | 90 |
GLM 5 | Medium on Baseten | 29.20s | $0.0046 | 90 |
Qwen3.5 397B A17B | Low | 91.02s | $0.0300 | 85 |
Haiku 4.5 | Medium | 45.07s | $0.0156 | 85 |
DeepSeek V3.2 | High | 112.71s | $0.0011 | 83 |
MiMo V2 Pro | Minimal | 95.42s | $0.0184 | 82 |
MiniMax M2.7 | High | 63.20s | $0.0059 | 79 |
Full results here. Note: Duration can fluctuate, so e.g. the results do not necessarily mean GPT 5.2 low is slower than medium.
The frontier models are pretty close to each other in quality. Cost and speed can be drastically different between those though, and scores per language do also differ slightly. Surprisingly, smaller models seem to perform on-par to frontier models, e.g. GPT 5.4 Nano is very fast, cheap and maintains quality in similar ranges to our winner GPT 5.2.
Also worth highlighting, giving models more thinking time, or increasing the reasoning level did not always improve the translation. From our investigation, it seems models can start overthinking translations, leading to worse quality.
Structural Tests
Quality alone is not enough for a production translation system. A model might score highly on natural language quality but still fail structural constraints that matter in real websites. For that reason we run a second suite of deterministic tests that validate HTML handling, glossary preservation, and slug formatting.
These checks make it easy to spot models that look strong in translation quality but behave unpredictably when additional rules are involved. In practice, the combination of LLM-based scoring and structural validation gives us a much clearer signal of which models are safe to deploy in Framer.
Model | Reasoning | HTML | Glossary | Slug |
|---|---|---|---|---|
GPT 5.2 | Low | 100% | 67% | Pass |
GPT-5.3 Codex | High | 100% | 100% | Pass |
Gemini 3 Flash | Low | 100% | 83% | Pass |
Gemini 3 Flash | Medium | 100% | 100% | Pass |
GPT 5.4 | Medium | 100% | 100% | Pass |
Testing the hard cases
One key lesson from machine translation research is that overly simple test sets can give a false sense of progress. Many benchmarks consist of clean, well-formed sentences that modern models handle almost flawlessly, making it hard to meaningfully differentiate between systems.
The WMT25 General MT shared task (Kocmi et al., 2025) addressed this problem by introducing difficulty sampling, which focuses evaluation on harder examples. Their results showed that even frontier LLMs still struggle with robustness on non-standard input and domain-specific content.
We follow a similar philosophy. Instead of idealized examples, we evaluate models on our own internal content and targeted cases that resemble real Framer projects. This also helps avoid train-test contamination, since the evaluation data is not part of public datasets.
Structural validation tests
Alongside LLM-based scoring, we maintain a set of deterministic validation tests designed to catch structural failures that could break real websites.
Check | Description |
|---|---|
HTML structure checks | Translate inputs containing HTML and verify that the markup structure remains valid. |
Glossary preservation checks | Translate content with a configured glossary and verify that required terms are preserved exactly according to the configured rules. |
URL slug checks | Translate page slugs and verify that the output matches our allowed slug format. |
Slug errors are usually binary. They either follow the format or break the URL. These tests complement LLM scoring by giving us fast signals when structural regressions appear.
Summary
With GEMBA-DMA, whenever a new frontier model becomes available, we include it in our evaluation set and run the full benchmark suite. If it shows improvements in translation quality, structural accuracy, speed, and cost, we consider promoting it to production.
Our multi-assessments also keep our evaluations practical, consistent, and grounded in real product constraints. Most importantly, it ensures translations remain reliable across both structured and unpredictable content found on real websites.
With all that, we are confident our evaluation pipeline allows fast, practical, and repeatable evals today and tomorrow.
Found this post interesting? We are hiring. Join us at framer.com/careers.








