Framer Blog: How we pick translation models for Framer

Our customers translate entire sites, from marketing pages, product documentation, CMS content, to structured data that includes HTML markup, URLs, glossary terms, and custom instructions. A translation model might perform perfectly on simple text but fail when these additional constraints are involved. To make sure translations behave correctly in production, we run a continuous evaluation pipeline for new models.

This article looks into existing research, how we evaluate translation quality, what “quality” means in our context, and how we decide which models ship in Framer.

Evaluating translation quality

Before choosing a model, the first question is straightforward: how do you measure translation quality?

This turns out to be a more complex problem than it sounds. Over the years, researchers have proposed many evaluation methods, each with its own trade-offs. Those trade-offs matter when evaluating modern LLM-based translators.

BLEU: the classic metric

BLEU (Bilingual Evaluation Understudy) is one of the most commonly used metrics. It evaluates how much a model’s translation overlaps with a reference translation, assigning higher scores when the wording more closely matches the reference.

For many years BLEU was the standard metric in machine translation research. However, it has clear limitations for modern language models. LLMs often produce translations that are perfectly correct while using different phrasing than the reference. Because BLEU rewards strict overlap, it can undervalue translations that express the same meaning using different wording. In practice, BLEU works best when there is a single expected translation, but natural language rarely behaves that way.

LLMs as evaluators

A more recent approach is to use another language model as the evaluator. Instead of comparing outputs to a reference translation, we ask an LLM to analyze a translation and score its quality directly.

Even relatively old models align well with human expert judgement. For example, Kocmi & Federmann (2023) found that GPT-3.5 (released over three years ago) used as a reference-free evaluator (GEMBA-DA), achieved 86% system-level pairwise accuracy against human MQM (Multidimensional Quality Metrics) labels across multiple language pairs.

Several evaluation methods follow this idea:

Method	Description	Notes
COMET	A neural model designed to estimate translation quality. Some variants require reference translations, while others evaluate outputs directly.	COMET-QE evaluates translations without requiring reference translations.
G-Eval	A prompting-based method that uses a general-purpose LLM to evaluate outputs using structured reasoning prompts.	Flexible but typically requires multiple model calls.
GEMBA	An LLM-based method designed specifically for machine translation evaluation.	GEMBA-MQM produces detailed quality metrics; GEMBA-DA outputs a score from 1–100.

Translation quality is not just “good translation”

Evaluating translation quality might sound straightforward: check whether the translation is correct, done. In practice, inputs are more complex.

As mentioned in the beginning, in Framer, translations are not just text to text. Because of this diversity, we cannot benchmark every model against every tone or domain. Instead, we focus on what we can validate consistently: whether a model behaves correctly under the constraints that appear in real Framer projects.

These constraints often expose weaknesses that traditional translation benchmarks miss. A model may translate plain text perfectly but behave unpredictably once it needs to preserve glossary terms, follow URL formatting rules, or keep HTML structure intact.

For example, glossary configurations may require exact replacements while preserving case sensitivity:

Original: "My glossary term"
Translation: "Mijn woordenlijst term"

Original: "my glossary term"
Translation: "mijn woordenlijst term"

Original: "MY GLOSSARY TERM"
Translation: "MIJN WOORDENLIJST TERM"

Many models struggle to apply rules like this consistently, especially when the prompt also contains other instructions or structured content. These failures may look small, but in real projects they quickly break user expectations.

Beyond translation quality, we also consider two product constraints: cost, since token pricing varies between models, and speed, since translations must return quickly in Framer’s localization UI. Waiting two minutes for a single translation for a very minimal gain is unexpected; especially when apps like Google Translate are instant.

Continuous Evaluation: GEMBA-DMA

The challenges above show a single assessment does not cut it. It needs continuous evaluation to monitor improvements (and regressions) to quality and speed. We present a novel technique based on GEMBA-DA as our scoring method: GEMBA-DMA – GPT Estimation Metric Based Assessment - Direct Multi-Assessment.

The most significant change is, we propose using multiple models to judge. Second, we propose a broad range of models and not just OpenAI models. Third, we let our winners score the next winners. This makes sure quality judgement evolves as frontier LLMs advance.

Our evaluation pipeline works roughly as follows: we generate translations using several candidate models, run them across a representative set of languages, and then ask multiple judge models to score the outputs. This produces stable rankings without requiring reference translations.

Worth mentioning, judges never know the source of what they score; so seeing Opus 4.6 as judge score GPT 5.4 higher than its own translation is not unusual.

Evaluation Results

The tables below show a sample of our evaluation results across several frontier models and popular Open-Source models. We translate a shared set of inputs across multiple languages and score the outputs using our GEMBA-DMA judging setup. Alongside the quality scores, we also record runtime and token usage so we can evaluate the cost and latency trade-offs of each model.

Model	Reasoning	Duration	Cost	Avg
GPT 5.2	Low	18.42s	$0.0172	95
GPT-5.3 Codex	High	52.26s	$0.0472	95
GPT 5.4	Low	20.92s	$0.0195	95
Opus 4.6	None	23.45s	$0.0471	94
Gemini 3 Flash	High	26.35s	$0.0120	94
GPT 5.4 Mini	Medium	12.34s	$0.0083	94
Gemini 3.1 Flash Lite	None	8.59s	$0.0021	93
Sonnet 4.6	Low	48.36s	$0.0294	92
GPT 5.4 Nano	Low	6.79s	$0.0016	92
Grok 4.20	None	7.57s	$0.0097	91
Kimi K2.5	None	135.61s	$0.0169	90
Gemini 3.1 Pro	Minimal	26.84s	$0.0412	90
GPT OSS 120B	High on Groq	18.78s	$0.0051	90
GLM 5	Medium on Baseten	29.20s	$0.0046	90
Qwen3.5 397B A17B	Low	91.02s	$0.0300	85
Haiku 4.5	Medium	45.07s	$0.0156	85
DeepSeek V3.2	High	112.71s	$0.0011	83
MiMo V2 Pro	Minimal	95.42s	$0.0184	82
MiniMax M2.7	High	63.20s	$0.0059	79

Full results here. Note: Duration can fluctuate, so e.g. the results do not necessarily mean GPT 5.2 low is slower than medium.

The frontier models are pretty close to each other in quality. Cost and speed can be drastically different between those though, and scores per language do also differ slightly. Surprisingly, smaller models seem to perform on-par to frontier models, e.g. GPT 5.4 Nano is very fast, cheap and maintains quality in similar ranges to our winner GPT 5.2.

Also worth highlighting, giving models more thinking time, or increasing the reasoning level did not always improve the translation. From our investigation, it seems models can start overthinking translations, leading to worse quality.

Structural Tests

Quality alone is not enough for a production translation system. A model might score highly on natural language quality but still fail structural constraints that matter in real websites. For that reason we run a second suite of deterministic tests that validate HTML handling, glossary preservation, and slug formatting.

These checks make it easy to spot models that look strong in translation quality but behave unpredictably when additional rules are involved. In practice, the combination of LLM-based scoring and structural validation gives us a much clearer signal of which models are safe to deploy in Framer.

Model	Reasoning	HTML	Glossary	Slug
GPT 5.2	Low	100%	67%	Pass
GPT-5.3 Codex	High	100%	100%	Pass
Gemini 3 Flash	Low	100%	83%	Pass
Gemini 3 Flash	Medium	100%	100%	Pass
GPT 5.4	Medium	100%	100%	Pass

Testing the hard cases

One key lesson from machine translation research is that overly simple test sets can give a false sense of progress. Many benchmarks consist of clean, well-formed sentences that modern models handle almost flawlessly, making it hard to meaningfully differentiate between systems.

The WMT25 General MT shared task (Kocmi et al., 2025) addressed this problem by introducing difficulty sampling, which focuses evaluation on harder examples. Their results showed that even frontier LLMs still struggle with robustness on non-standard input and domain-specific content.

We follow a similar philosophy. Instead of idealized examples, we evaluate models on our own internal content and targeted cases that resemble real Framer projects. This also helps avoid train-test contamination, since the evaluation data is not part of public datasets.

Structural validation tests

Alongside LLM-based scoring, we maintain a set of deterministic validation tests designed to catch structural failures that could break real websites.

Check	Description
HTML structure checks	Translate inputs containing HTML and verify that the markup structure remains valid.
Glossary preservation checks	Translate content with a configured glossary and verify that required terms are preserved exactly according to the configured rules.
URL slug checks	Translate page slugs and verify that the output matches our allowed slug format.

Slug errors are usually binary. They either follow the format or break the URL. These tests complement LLM scoring by giving us fast signals when structural regressions appear.

Summary

With GEMBA-DMA, whenever a new frontier model becomes available, we include it in our evaluation set and run the full benchmark suite. If it shows improvements in translation quality, structural accuracy, speed, and cost, we consider promoting it to production.

Our multi-assessments also keep our evaluations practical, consistent, and grounded in real product constraints. Most importantly, it ensures translations remain reliable across both structured and unpredictable content found on real websites.

With all that, we are confident our evaluation pipeline allows fast, practical, and repeatable evals today and tomorrow.

Found this post interesting? We are hiring. Join us at framer.com/careers.

Our customers translate entire sites, from marketing pages, product documentation, CMS content, to structured data that includes HTML markup, URLs, glossary terms, and custom instructions. A translation model might perform perfectly on simple text but fail when these additional constraints are involved. To make sure translations behave correctly in production, we run a continuous evaluation pipeline for new models.

This article looks into existing research, how we evaluate translation quality, what “quality” means in our context, and how we decide which models ship in Framer.

Evaluating translation quality

Before choosing a model, the first question is straightforward: how do you measure translation quality?

This turns out to be a more complex problem than it sounds. Over the years, researchers have proposed many evaluation methods, each with its own trade-offs. Those trade-offs matter when evaluating modern LLM-based translators.

BLEU: the classic metric

BLEU (Bilingual Evaluation Understudy) is one of the most commonly used metrics. It evaluates how much a model’s translation overlaps with a reference translation, assigning higher scores when the wording more closely matches the reference.

For many years BLEU was the standard metric in machine translation research. However, it has clear limitations for modern language models. LLMs often produce translations that are perfectly correct while using different phrasing than the reference. Because BLEU rewards strict overlap, it can undervalue translations that express the same meaning using different wording. In practice, BLEU works best when there is a single expected translation, but natural language rarely behaves that way.

LLMs as evaluators

A more recent approach is to use another language model as the evaluator. Instead of comparing outputs to a reference translation, we ask an LLM to analyze a translation and score its quality directly.

Even relatively old models align well with human expert judgement. For example, Kocmi & Federmann (2023) found that GPT-3.5 (released over three years ago) used as a reference-free evaluator (GEMBA-DA), achieved 86% system-level pairwise accuracy against human MQM (Multidimensional Quality Metrics) labels across multiple language pairs.

Several evaluation methods follow this idea:

Method	Description	Notes
COMET	A neural model designed to estimate translation quality. Some variants require reference translations, while others evaluate outputs directly.	COMET-QE evaluates translations without requiring reference translations.
G-Eval	A prompting-based method that uses a general-purpose LLM to evaluate outputs using structured reasoning prompts.	Flexible but typically requires multiple model calls.
GEMBA	An LLM-based method designed specifically for machine translation evaluation.	GEMBA-MQM produces detailed quality metrics; GEMBA-DA outputs a score from 1–100.

Translation quality is not just “good translation”

Evaluating translation quality might sound straightforward: check whether the translation is correct, done. In practice, inputs are more complex.

As mentioned in the beginning, in Framer, translations are not just text to text. Because of this diversity, we cannot benchmark every model against every tone or domain. Instead, we focus on what we can validate consistently: whether a model behaves correctly under the constraints that appear in real Framer projects.

These constraints often expose weaknesses that traditional translation benchmarks miss. A model may translate plain text perfectly but behave unpredictably once it needs to preserve glossary terms, follow URL formatting rules, or keep HTML structure intact.

For example, glossary configurations may require exact replacements while preserving case sensitivity:

Original: "My glossary term"
Translation: "Mijn woordenlijst term"

Original: "my glossary term"
Translation: "mijn woordenlijst term"

Original: "MY GLOSSARY TERM"
Translation: "MIJN WOORDENLIJST TERM"

Many models struggle to apply rules like this consistently, especially when the prompt also contains other instructions or structured content. These failures may look small, but in real projects they quickly break user expectations.

Beyond translation quality, we also consider two product constraints: cost, since token pricing varies between models, and speed, since translations must return quickly in Framer’s localization UI. Waiting two minutes for a single translation for a very minimal gain is unexpected; especially when apps like Google Translate are instant.

Continuous Evaluation: GEMBA-DMA

The challenges above show a single assessment does not cut it. It needs continuous evaluation to monitor improvements (and regressions) to quality and speed. We present a novel technique based on GEMBA-DA as our scoring method: GEMBA-DMA – GPT Estimation Metric Based Assessment - Direct Multi-Assessment.

The most significant change is, we propose using multiple models to judge. Second, we propose a broad range of models and not just OpenAI models. Third, we let our winners score the next winners. This makes sure quality judgement evolves as frontier LLMs advance.

Our evaluation pipeline works roughly as follows: we generate translations using several candidate models, run them across a representative set of languages, and then ask multiple judge models to score the outputs. This produces stable rankings without requiring reference translations.

Worth mentioning, judges never know the source of what they score; so seeing Opus 4.6 as judge score GPT 5.4 higher than its own translation is not unusual.

Evaluation Results

The tables below show a sample of our evaluation results across several frontier models and popular Open-Source models. We translate a shared set of inputs across multiple languages and score the outputs using our GEMBA-DMA judging setup. Alongside the quality scores, we also record runtime and token usage so we can evaluate the cost and latency trade-offs of each model.

Model	Reasoning	Duration	Cost	Avg
GPT 5.2	Low	18.42s	$0.0172	95
GPT-5.3 Codex	High	52.26s	$0.0472	95
GPT 5.4	Low	20.92s	$0.0195	95
Opus 4.6	None	23.45s	$0.0471	94
Gemini 3 Flash	High	26.35s	$0.0120	94
GPT 5.4 Mini	Medium	12.34s	$0.0083	94
Gemini 3.1 Flash Lite	None	8.59s	$0.0021	93
Sonnet 4.6	Low	48.36s	$0.0294	92
GPT 5.4 Nano	Low	6.79s	$0.0016	92
Grok 4.20	None	7.57s	$0.0097	91
Kimi K2.5	None	135.61s	$0.0169	90
Gemini 3.1 Pro	Minimal	26.84s	$0.0412	90
GPT OSS 120B	High on Groq	18.78s	$0.0051	90
GLM 5	Medium on Baseten	29.20s	$0.0046	90
Qwen3.5 397B A17B	Low	91.02s	$0.0300	85
Haiku 4.5	Medium	45.07s	$0.0156	85
DeepSeek V3.2	High	112.71s	$0.0011	83
MiMo V2 Pro	Minimal	95.42s	$0.0184	82
MiniMax M2.7	High	63.20s	$0.0059	79

Full results here. Note: Duration can fluctuate, so e.g. the results do not necessarily mean GPT 5.2 low is slower than medium.

The frontier models are pretty close to each other in quality. Cost and speed can be drastically different between those though, and scores per language do also differ slightly. Surprisingly, smaller models seem to perform on-par to frontier models, e.g. GPT 5.4 Nano is very fast, cheap and maintains quality in similar ranges to our winner GPT 5.2.

Also worth highlighting, giving models more thinking time, or increasing the reasoning level did not always improve the translation. From our investigation, it seems models can start overthinking translations, leading to worse quality.

Structural Tests

Quality alone is not enough for a production translation system. A model might score highly on natural language quality but still fail structural constraints that matter in real websites. For that reason we run a second suite of deterministic tests that validate HTML handling, glossary preservation, and slug formatting.

These checks make it easy to spot models that look strong in translation quality but behave unpredictably when additional rules are involved. In practice, the combination of LLM-based scoring and structural validation gives us a much clearer signal of which models are safe to deploy in Framer.

Model	Reasoning	HTML	Glossary	Slug
GPT 5.2	Low	100%	67%	Pass
GPT-5.3 Codex	High	100%	100%	Pass
Gemini 3 Flash	Low	100%	83%	Pass
Gemini 3 Flash	Medium	100%	100%	Pass
GPT 5.4	Medium	100%	100%	Pass

Testing the hard cases

One key lesson from machine translation research is that overly simple test sets can give a false sense of progress. Many benchmarks consist of clean, well-formed sentences that modern models handle almost flawlessly, making it hard to meaningfully differentiate between systems.

The WMT25 General MT shared task (Kocmi et al., 2025) addressed this problem by introducing difficulty sampling, which focuses evaluation on harder examples. Their results showed that even frontier LLMs still struggle with robustness on non-standard input and domain-specific content.

We follow a similar philosophy. Instead of idealized examples, we evaluate models on our own internal content and targeted cases that resemble real Framer projects. This also helps avoid train-test contamination, since the evaluation data is not part of public datasets.

Structural validation tests

Alongside LLM-based scoring, we maintain a set of deterministic validation tests designed to catch structural failures that could break real websites.

Check	Description
HTML structure checks	Translate inputs containing HTML and verify that the markup structure remains valid.
Glossary preservation checks	Translate content with a configured glossary and verify that required terms are preserved exactly according to the configured rules.
URL slug checks	Translate page slugs and verify that the output matches our allowed slug format.

Slug errors are usually binary. They either follow the format or break the URL. These tests complement LLM scoring by giving us fast signals when structural regressions appear.

Summary

With GEMBA-DMA, whenever a new frontier model becomes available, we include it in our evaluation set and run the full benchmark suite. If it shows improvements in translation quality, structural accuracy, speed, and cost, we consider promoting it to production.

Our multi-assessments also keep our evaluations practical, consistent, and grounded in real product constraints. Most importantly, it ensures translations remain reliable across both structured and unpredictable content found on real websites.

With all that, we are confident our evaluation pipeline allows fast, practical, and repeatable evals today and tomorrow.

Found this post interesting? We are hiring. Join us at framer.com/careers.

Our customers translate entire sites, from marketing pages, product documentation, CMS content, to structured data that includes HTML markup, URLs, glossary terms, and custom instructions. A translation model might perform perfectly on simple text but fail when these additional constraints are involved. To make sure translations behave correctly in production, we run a continuous evaluation pipeline for new models.

This article looks into existing research, how we evaluate translation quality, what “quality” means in our context, and how we decide which models ship in Framer.

Evaluating translation quality

Before choosing a model, the first question is straightforward: how do you measure translation quality?

This turns out to be a more complex problem than it sounds. Over the years, researchers have proposed many evaluation methods, each with its own trade-offs. Those trade-offs matter when evaluating modern LLM-based translators.

BLEU: the classic metric

BLEU (Bilingual Evaluation Understudy) is one of the most commonly used metrics. It evaluates how much a model’s translation overlaps with a reference translation, assigning higher scores when the wording more closely matches the reference.

For many years BLEU was the standard metric in machine translation research. However, it has clear limitations for modern language models. LLMs often produce translations that are perfectly correct while using different phrasing than the reference. Because BLEU rewards strict overlap, it can undervalue translations that express the same meaning using different wording. In practice, BLEU works best when there is a single expected translation, but natural language rarely behaves that way.

LLMs as evaluators

A more recent approach is to use another language model as the evaluator. Instead of comparing outputs to a reference translation, we ask an LLM to analyze a translation and score its quality directly.

Even relatively old models align well with human expert judgement. For example, Kocmi & Federmann (2023) found that GPT-3.5 (released over three years ago) used as a reference-free evaluator (GEMBA-DA), achieved 86% system-level pairwise accuracy against human MQM (Multidimensional Quality Metrics) labels across multiple language pairs.

Several evaluation methods follow this idea:

Method	Description	Notes
COMET	A neural model designed to estimate translation quality. Some variants require reference translations, while others evaluate outputs directly.	COMET-QE evaluates translations without requiring reference translations.
G-Eval	A prompting-based method that uses a general-purpose LLM to evaluate outputs using structured reasoning prompts.	Flexible but typically requires multiple model calls.
GEMBA	An LLM-based method designed specifically for machine translation evaluation.	GEMBA-MQM produces detailed quality metrics; GEMBA-DA outputs a score from 1–100.

Translation quality is not just “good translation”

Evaluating translation quality might sound straightforward: check whether the translation is correct, done. In practice, inputs are more complex.

As mentioned in the beginning, in Framer, translations are not just text to text. Because of this diversity, we cannot benchmark every model against every tone or domain. Instead, we focus on what we can validate consistently: whether a model behaves correctly under the constraints that appear in real Framer projects.

These constraints often expose weaknesses that traditional translation benchmarks miss. A model may translate plain text perfectly but behave unpredictably once it needs to preserve glossary terms, follow URL formatting rules, or keep HTML structure intact.

For example, glossary configurations may require exact replacements while preserving case sensitivity:

Original: "My glossary term"
Translation: "Mijn woordenlijst term"

Original: "my glossary term"
Translation: "mijn woordenlijst term"

Original: "MY GLOSSARY TERM"
Translation: "MIJN WOORDENLIJST TERM"

Many models struggle to apply rules like this consistently, especially when the prompt also contains other instructions or structured content. These failures may look small, but in real projects they quickly break user expectations.

Beyond translation quality, we also consider two product constraints: cost, since token pricing varies between models, and speed, since translations must return quickly in Framer’s localization UI. Waiting two minutes for a single translation for a very minimal gain is unexpected; especially when apps like Google Translate are instant.

Continuous Evaluation: GEMBA-DMA

The challenges above show a single assessment does not cut it. It needs continuous evaluation to monitor improvements (and regressions) to quality and speed. We present a novel technique based on GEMBA-DA as our scoring method: GEMBA-DMA – GPT Estimation Metric Based Assessment - Direct Multi-Assessment.

The most significant change is, we propose using multiple models to judge. Second, we propose a broad range of models and not just OpenAI models. Third, we let our winners score the next winners. This makes sure quality judgement evolves as frontier LLMs advance.

Our evaluation pipeline works roughly as follows: we generate translations using several candidate models, run them across a representative set of languages, and then ask multiple judge models to score the outputs. This produces stable rankings without requiring reference translations.

Worth mentioning, judges never know the source of what they score; so seeing Opus 4.6 as judge score GPT 5.4 higher than its own translation is not unusual.

Evaluation Results

The tables below show a sample of our evaluation results across several frontier models and popular Open-Source models. We translate a shared set of inputs across multiple languages and score the outputs using our GEMBA-DMA judging setup. Alongside the quality scores, we also record runtime and token usage so we can evaluate the cost and latency trade-offs of each model.

Model	Reasoning	Duration	Cost	Avg
GPT 5.2	Low	18.42s	$0.0172	95
GPT-5.3 Codex	High	52.26s	$0.0472	95
GPT 5.4	Low	20.92s	$0.0195	95
Opus 4.6	None	23.45s	$0.0471	94
Gemini 3 Flash	High	26.35s	$0.0120	94
GPT 5.4 Mini	Medium	12.34s	$0.0083	94
Gemini 3.1 Flash Lite	None	8.59s	$0.0021	93
Sonnet 4.6	Low	48.36s	$0.0294	92
GPT 5.4 Nano	Low	6.79s	$0.0016	92
Grok 4.20	None	7.57s	$0.0097	91
Kimi K2.5	None	135.61s	$0.0169	90
Gemini 3.1 Pro	Minimal	26.84s	$0.0412	90
GPT OSS 120B	High on Groq	18.78s	$0.0051	90
GLM 5	Medium on Baseten	29.20s	$0.0046	90
Qwen3.5 397B A17B	Low	91.02s	$0.0300	85
Haiku 4.5	Medium	45.07s	$0.0156	85
DeepSeek V3.2	High	112.71s	$0.0011	83
MiMo V2 Pro	Minimal	95.42s	$0.0184	82
MiniMax M2.7	High	63.20s	$0.0059	79

Full results here. Note: Duration can fluctuate, so e.g. the results do not necessarily mean GPT 5.2 low is slower than medium.

The frontier models are pretty close to each other in quality. Cost and speed can be drastically different between those though, and scores per language do also differ slightly. Surprisingly, smaller models seem to perform on-par to frontier models, e.g. GPT 5.4 Nano is very fast, cheap and maintains quality in similar ranges to our winner GPT 5.2.

Also worth highlighting, giving models more thinking time, or increasing the reasoning level did not always improve the translation. From our investigation, it seems models can start overthinking translations, leading to worse quality.

Structural Tests

Quality alone is not enough for a production translation system. A model might score highly on natural language quality but still fail structural constraints that matter in real websites. For that reason we run a second suite of deterministic tests that validate HTML handling, glossary preservation, and slug formatting.

These checks make it easy to spot models that look strong in translation quality but behave unpredictably when additional rules are involved. In practice, the combination of LLM-based scoring and structural validation gives us a much clearer signal of which models are safe to deploy in Framer.

Model	Reasoning	HTML	Glossary	Slug
GPT 5.2	Low	100%	67%	Pass
GPT-5.3 Codex	High	100%	100%	Pass
Gemini 3 Flash	Low	100%	83%	Pass
Gemini 3 Flash	Medium	100%	100%	Pass
GPT 5.4	Medium	100%	100%	Pass

Testing the hard cases

One key lesson from machine translation research is that overly simple test sets can give a false sense of progress. Many benchmarks consist of clean, well-formed sentences that modern models handle almost flawlessly, making it hard to meaningfully differentiate between systems.

The WMT25 General MT shared task (Kocmi et al., 2025) addressed this problem by introducing difficulty sampling, which focuses evaluation on harder examples. Their results showed that even frontier LLMs still struggle with robustness on non-standard input and domain-specific content.

We follow a similar philosophy. Instead of idealized examples, we evaluate models on our own internal content and targeted cases that resemble real Framer projects. This also helps avoid train-test contamination, since the evaluation data is not part of public datasets.

Structural validation tests

Alongside LLM-based scoring, we maintain a set of deterministic validation tests designed to catch structural failures that could break real websites.

Check	Description
HTML structure checks	Translate inputs containing HTML and verify that the markup structure remains valid.
Glossary preservation checks	Translate content with a configured glossary and verify that required terms are preserved exactly according to the configured rules.
URL slug checks	Translate page slugs and verify that the output matches our allowed slug format.

Slug errors are usually binary. They either follow the format or break the URL. These tests complement LLM scoring by giving us fast signals when structural regressions appear.

Summary

With GEMBA-DMA, whenever a new frontier model becomes available, we include it in our evaluation set and run the full benchmark suite. If it shows improvements in translation quality, structural accuracy, speed, and cost, we consider promoting it to production.

Our multi-assessments also keep our evaluations practical, consistent, and grounded in real product constraints. Most importantly, it ensures translations remain reliable across both structured and unpredictable content found on real websites.

With all that, we are confident our evaluation pipeline allows fast, practical, and repeatable evals today and tomorrow.

Found this post interesting? We are hiring. Join us at framer.com/careers.

Search

How we pick translation models for Framer

Evaluating translation quality

BLEU: the classic metric

LLMs as evaluators

Translation quality is not just “good translation”

Continuous Evaluation: GEMBA-DMA

Evaluation Results

Structural Tests

Testing the hard cases

Structural validation tests

Summary

Evaluating translation quality

BLEU: the classic metric

LLMs as evaluators

Translation quality is not just “good translation”

Continuous Evaluation: GEMBA-DMA

Evaluation Results

Structural Tests

Testing the hard cases

Structural validation tests

Summary

Evaluating translation quality

BLEU: the classic metric

LLMs as evaluators

Translation quality is not just “good translation”

Continuous Evaluation: GEMBA-DMA

Evaluation Results

Structural Tests

Testing the hard cases

Structural validation tests

Summary

Related articles

Browse all

Bundling at Framer: Rolldown for faster sites

How Framer sites use Traffic-aware Pre-Rendering

Rabbit hole of React error handling

50% faster Framer sites by optimizing React Suspense

How Framer does AVIF

Why Framer uses React to build sites