How we pick translation models for Framer

How we eval translation models for Framer

Published

Published

0 min read

Framer translates user-generated content into more than 200 languages. Doing so reliable is a challenge when translations must work across the different types of content that appear on real websites.

Our customers translate entire sites, from marketing pages, product documentation, CMS content, to structured data that includes HTML markup, URLs, glossary terms, and custom instructions. A translation model might perform perfectly on simple text but fail when these additional constraints are involved. To make sure translations behave correctly in production, we run a continuous evaluation pipeline for new models.

This article looks into existing research, how we evaluate translation quality, what “quality” means in our context, and how we decide which models ship in Framer.

Evaluating translation quality

Before choosing a model, the first question is straightforward: how do you measure translation quality?

This turns out to be a more complex problem than it sounds. Over the years, researchers have proposed many evaluation methods, each with its own trade-offs. Those trade-offs matter when evaluating modern LLM-based translators.

BLEU: the classic metric

BLEU (Bilingual Evaluation Understudy) is one of the most commonly used metrics. It evaluates how much a model’s translation overlaps with a reference translation, assigning higher scores when the wording more closely matches the reference.

For many years BLEU was the standard metric in machine translation research. However, it has clear limitations for modern language models. LLMs often produce translations that are perfectly correct while using different phrasing than the reference. Because BLEU rewards strict overlap, it can undervalue translations that express the same meaning using different wording. In practice, BLEU works best when there is a single expected translation, but natural language rarely behaves that way.

LLMs as evaluators

A more recent approach is to use another language model as the evaluator. Instead of comparing outputs to a reference translation, we ask an LLM to analyze a translation and score its quality directly.

Even relatively old models align well with human expert judgement. For example, Kocmi & Federmann (2023) found that GPT-3.5 (released over three years ago) used as a reference-free evaluator (GEMBA-DA), achieved 86% system-level pairwise accuracy against human MQM (Multidimensional Quality Metrics) labels across multiple language pairs.

Several evaluation methods follow this idea:

Method

Description

Notes

COMET

A neural model designed to estimate translation quality. Some variants require reference translations, while others evaluate outputs directly.

COMET-QE evaluates translations without requiring reference translations.

G-Eval

A prompting-based method that uses a general-purpose LLM to evaluate outputs using structured reasoning prompts.

Flexible but typically requires multiple model calls.

GEMBA

An LLM-based method designed specifically for machine translation evaluation.

GEMBA-MQM produces detailed quality metrics; GEMBA-DA outputs a score from 1–100.

Translation quality is not just “good translation”

Evaluating translation quality might sound straightforward: check whether the translation is correct, done. In practice, inputs are more complex.

As mentioned in the beginning, in Framer, translations are not just text to text. Because of this diversity, we cannot benchmark every model against every tone or domain. Instead, we focus on what we can validate consistently: whether a model behaves correctly under the constraints that appear in real Framer projects.

These constraints often expose weaknesses that traditional translation benchmarks miss. A model may translate plain text perfectly but behave unpredictably once it needs to preserve glossary terms, follow URL formatting rules, or keep HTML structure intact.

For example, glossary configurations may require exact replacements while preserving case sensitivity:

Original: "My glossary term"
Translation: "Mijn woordenlijst term"

Original: "my glossary term"
Translation: "mijn woordenlijst term"

Original: "MY GLOSSARY TERM"
Translation: "MIJN WOORDENLIJST TERM"

Many models struggle to apply rules like this consistently, especially when the prompt also contains other instructions or structured content. These failures may look small, but in real projects they quickly break user expectations.

Beyond translation quality, we also consider two product constraints: cost, since token pricing varies between models, and speed, since translations must return quickly in Framer’s localization UI. Waiting two minutes for a single translation for a very minimal gain is unexpected; especially when apps like Google Translate are instant.

Continuous Evaluation: GEMBA-DMA

The challenges above show a single assessment does not cut it. It needs continuous evaluation to monitor improvements (and regressions) to quality and speed. We present a novel technique based on GEMBA-DA as our scoring method: GEMBA-DMAGPT Estimation Metric Based Assessment - Direct Multi-Assessment.

The most significant change is, we propose using multiple models to judge. Second, we propose a broad range of models and not just OpenAI models. Third, we let our winners score the next winners. This makes sure quality judgement evolves as frontier LLMs advance.

Our evaluation pipeline works roughly as follows: we generate translations using several candidate models, run them across a representative set of languages, and then ask multiple judge models to score the outputs. This produces stable rankings without requiring reference translations.

Worth mentioning, judges never know the source of what they score; so seeing Opus 4.6 as judge score GPT 5.4 higher than its own translation is not unusual.

Evaluation Results

The tables below show a sample of our evaluation results across several frontier models and popular Open-Source models. We translate a shared set of inputs across multiple languages and score the outputs using our GEMBA-DMA judging setup. Alongside the quality scores, we also record runtime and token usage so we can evaluate the cost and latency trade-offs of each model.

Model

Reasoning

Duration

Cost

Avg

GPT 5.2

Low

18.42s

$0.0172

95

GPT-5.3 Codex

High

52.26s

$0.0472

95

GPT 5.4

Low

20.92s

$0.0195

95

Opus 4.6

None

23.45s

$0.0471

94

Gemini 3 Flash

High

26.35s

$0.0120

94

GPT 5.4 Mini

Medium

12.34s

$0.0083

94

Gemini 3.1 Flash Lite

None

8.59s

$0.0021

93

Sonnet 4.6

Low

48.36s

$0.0294

92

GPT 5.4 Nano

Low

6.79s

$0.0016

92

Grok 4.20

None

7.57s

$0.0097

91

Kimi K2.5

None

135.61s

$0.0169

90

Gemini 3.1 Pro

Minimal

26.84s

$0.0412

90

GPT OSS 120B

High on Groq

18.78s

$0.0051

90

GLM 5

Medium on Baseten

29.20s

$0.0046

90

Qwen3.5 397B A17B

Low

91.02s

$0.0300

85

Haiku 4.5

Medium

45.07s

$0.0156

85

DeepSeek V3.2

High

112.71s

$0.0011

83

MiMo V2 Pro

Minimal

95.42s

$0.0184

82

MiniMax M2.7

High

63.20s

$0.0059

79

Full results here. Note: Duration can fluctuate, so e.g. the results do not necessarily mean GPT 5.2 low is slower than medium.

The frontier models are pretty close to each other in quality. Cost and speed can be drastically different between those though, and scores per language do also differ slightly. Surprisingly, smaller models seem to perform on-par to frontier models, e.g. GPT 5.4 Nano is very fast, cheap and maintains quality in similar ranges to our winner GPT 5.2.

Also worth highlighting, giving models more thinking time, or increasing the reasoning level did not always improve the translation. From our investigation, it seems models can start overthinking translations, leading to worse quality.

Structural Tests

Quality alone is not enough for a production translation system. A model might score highly on natural language quality but still fail structural constraints that matter in real websites. For that reason we run a second suite of deterministic tests that validate HTML handling, glossary preservation, and slug formatting.

These checks make it easy to spot models that look strong in translation quality but behave unpredictably when additional rules are involved. In practice, the combination of LLM-based scoring and structural validation gives us a much clearer signal of which models are safe to deploy in Framer.

Model

Reasoning

HTML

Glossary

Slug

GPT 5.2

Low

100%

67%

Pass

GPT-5.3 Codex

High

100%

100%

Pass

Gemini 3 Flash

Low

100%

83%

Pass

Gemini 3 Flash

Medium

100%

100%

Pass

GPT 5.4

Medium

100%

100%

Pass

Testing the hard cases

One key lesson from machine translation research is that overly simple test sets can give a false sense of progress. Many benchmarks consist of clean, well-formed sentences that modern models handle almost flawlessly, making it hard to meaningfully differentiate between systems.

The WMT25 General MT shared task (Kocmi et al., 2025) addressed this problem by introducing difficulty sampling, which focuses evaluation on harder examples. Their results showed that even frontier LLMs still struggle with robustness on non-standard input and domain-specific content.

We follow a similar philosophy. Instead of idealized examples, we evaluate models on our own internal content and targeted cases that resemble real Framer projects. This also helps avoid train-test contamination, since the evaluation data is not part of public datasets.

Structural validation tests

Alongside LLM-based scoring, we maintain a set of deterministic validation tests designed to catch structural failures that could break real websites.

Check

Description

HTML structure checks

Translate inputs containing HTML and verify that the markup structure remains valid.

Glossary preservation checks

Translate content with a configured glossary and verify that required terms are preserved exactly according to the configured rules.

URL slug checks

Translate page slugs and verify that the output matches our allowed slug format.

Slug errors are usually binary. They either follow the format or break the URL. These tests complement LLM scoring by giving us fast signals when structural regressions appear.

Summary

With GEMBA-DMA, whenever a new frontier model becomes available, we include it in our evaluation set and run the full benchmark suite. If it shows improvements in translation quality, structural accuracy, speed, and cost, we consider promoting it to production.

Our multi-assessments also keep our evaluations practical, consistent, and grounded in real product constraints. Most importantly, it ensures translations remain reliable across both structured and unpredictable content found on real websites.

With all that, we are confident our evaluation pipeline allows fast, practical, and repeatable evals today and tomorrow.


Found this post interesting? We are hiring. Join us at framer.com/careers.


Our customers translate entire sites, from marketing pages, product documentation, CMS content, to structured data that includes HTML markup, URLs, glossary terms, and custom instructions. A translation model might perform perfectly on simple text but fail when these additional constraints are involved. To make sure translations behave correctly in production, we run a continuous evaluation pipeline for new models.

This article looks into existing research, how we evaluate translation quality, what “quality” means in our context, and how we decide which models ship in Framer.

Evaluating translation quality

Before choosing a model, the first question is straightforward: how do you measure translation quality?

This turns out to be a more complex problem than it sounds. Over the years, researchers have proposed many evaluation methods, each with its own trade-offs. Those trade-offs matter when evaluating modern LLM-based translators.

BLEU: the classic metric

BLEU (Bilingual Evaluation Understudy) is one of the most commonly used metrics. It evaluates how much a model’s translation overlaps with a reference translation, assigning higher scores when the wording more closely matches the reference.

For many years BLEU was the standard metric in machine translation research. However, it has clear limitations for modern language models. LLMs often produce translations that are perfectly correct while using different phrasing than the reference. Because BLEU rewards strict overlap, it can undervalue translations that express the same meaning using different wording. In practice, BLEU works best when there is a single expected translation, but natural language rarely behaves that way.

LLMs as evaluators

A more recent approach is to use another language model as the evaluator. Instead of comparing outputs to a reference translation, we ask an LLM to analyze a translation and score its quality directly.

Even relatively old models align well with human expert judgement. For example, Kocmi & Federmann (2023) found that GPT-3.5 (released over three years ago) used as a reference-free evaluator (GEMBA-DA), achieved 86% system-level pairwise accuracy against human MQM (Multidimensional Quality Metrics) labels across multiple language pairs.

Several evaluation methods follow this idea:

Method

Description

Notes

COMET

A neural model designed to estimate translation quality. Some variants require reference translations, while others evaluate outputs directly.

COMET-QE evaluates translations without requiring reference translations.

G-Eval

A prompting-based method that uses a general-purpose LLM to evaluate outputs using structured reasoning prompts.

Flexible but typically requires multiple model calls.

GEMBA

An LLM-based method designed specifically for machine translation evaluation.

GEMBA-MQM produces detailed quality metrics; GEMBA-DA outputs a score from 1–100.

Translation quality is not just “good translation”

Evaluating translation quality might sound straightforward: check whether the translation is correct, done. In practice, inputs are more complex.

As mentioned in the beginning, in Framer, translations are not just text to text. Because of this diversity, we cannot benchmark every model against every tone or domain. Instead, we focus on what we can validate consistently: whether a model behaves correctly under the constraints that appear in real Framer projects.

These constraints often expose weaknesses that traditional translation benchmarks miss. A model may translate plain text perfectly but behave unpredictably once it needs to preserve glossary terms, follow URL formatting rules, or keep HTML structure intact.

For example, glossary configurations may require exact replacements while preserving case sensitivity:

Original: "My glossary term"
Translation: "Mijn woordenlijst term"

Original: "my glossary term"
Translation: "mijn woordenlijst term"

Original: "MY GLOSSARY TERM"
Translation: "MIJN WOORDENLIJST TERM"

Many models struggle to apply rules like this consistently, especially when the prompt also contains other instructions or structured content. These failures may look small, but in real projects they quickly break user expectations.

Beyond translation quality, we also consider two product constraints: cost, since token pricing varies between models, and speed, since translations must return quickly in Framer’s localization UI. Waiting two minutes for a single translation for a very minimal gain is unexpected; especially when apps like Google Translate are instant.

Continuous Evaluation: GEMBA-DMA

The challenges above show a single assessment does not cut it. It needs continuous evaluation to monitor improvements (and regressions) to quality and speed. We present a novel technique based on GEMBA-DA as our scoring method: GEMBA-DMAGPT Estimation Metric Based Assessment - Direct Multi-Assessment.

The most significant change is, we propose using multiple models to judge. Second, we propose a broad range of models and not just OpenAI models. Third, we let our winners score the next winners. This makes sure quality judgement evolves as frontier LLMs advance.

Our evaluation pipeline works roughly as follows: we generate translations using several candidate models, run them across a representative set of languages, and then ask multiple judge models to score the outputs. This produces stable rankings without requiring reference translations.

Worth mentioning, judges never know the source of what they score; so seeing Opus 4.6 as judge score GPT 5.4 higher than its own translation is not unusual.

Evaluation Results

The tables below show a sample of our evaluation results across several frontier models and popular Open-Source models. We translate a shared set of inputs across multiple languages and score the outputs using our GEMBA-DMA judging setup. Alongside the quality scores, we also record runtime and token usage so we can evaluate the cost and latency trade-offs of each model.

Model

Reasoning

Duration

Cost

Avg

GPT 5.2

Low

18.42s

$0.0172

95

GPT-5.3 Codex

High

52.26s

$0.0472

95

GPT 5.4

Low

20.92s

$0.0195

95

Opus 4.6

None

23.45s

$0.0471

94

Gemini 3 Flash

High

26.35s

$0.0120

94

GPT 5.4 Mini

Medium

12.34s

$0.0083

94

Gemini 3.1 Flash Lite

None

8.59s

$0.0021

93

Sonnet 4.6

Low

48.36s

$0.0294

92

GPT 5.4 Nano

Low

6.79s

$0.0016

92

Grok 4.20

None

7.57s

$0.0097

91

Kimi K2.5

None

135.61s

$0.0169

90

Gemini 3.1 Pro

Minimal

26.84s

$0.0412

90

GPT OSS 120B

High on Groq

18.78s

$0.0051

90

GLM 5

Medium on Baseten

29.20s

$0.0046

90

Qwen3.5 397B A17B

Low

91.02s

$0.0300

85

Haiku 4.5

Medium

45.07s

$0.0156

85

DeepSeek V3.2

High

112.71s

$0.0011

83

MiMo V2 Pro

Minimal

95.42s

$0.0184

82

MiniMax M2.7

High

63.20s

$0.0059

79

Full results here. Note: Duration can fluctuate, so e.g. the results do not necessarily mean GPT 5.2 low is slower than medium.

The frontier models are pretty close to each other in quality. Cost and speed can be drastically different between those though, and scores per language do also differ slightly. Surprisingly, smaller models seem to perform on-par to frontier models, e.g. GPT 5.4 Nano is very fast, cheap and maintains quality in similar ranges to our winner GPT 5.2.

Also worth highlighting, giving models more thinking time, or increasing the reasoning level did not always improve the translation. From our investigation, it seems models can start overthinking translations, leading to worse quality.

Structural Tests

Quality alone is not enough for a production translation system. A model might score highly on natural language quality but still fail structural constraints that matter in real websites. For that reason we run a second suite of deterministic tests that validate HTML handling, glossary preservation, and slug formatting.

These checks make it easy to spot models that look strong in translation quality but behave unpredictably when additional rules are involved. In practice, the combination of LLM-based scoring and structural validation gives us a much clearer signal of which models are safe to deploy in Framer.

Model

Reasoning

HTML

Glossary

Slug

GPT 5.2

Low

100%

67%

Pass

GPT-5.3 Codex

High

100%

100%

Pass

Gemini 3 Flash

Low

100%

83%

Pass

Gemini 3 Flash

Medium

100%

100%

Pass

GPT 5.4

Medium

100%

100%

Pass

Testing the hard cases

One key lesson from machine translation research is that overly simple test sets can give a false sense of progress. Many benchmarks consist of clean, well-formed sentences that modern models handle almost flawlessly, making it hard to meaningfully differentiate between systems.

The WMT25 General MT shared task (Kocmi et al., 2025) addressed this problem by introducing difficulty sampling, which focuses evaluation on harder examples. Their results showed that even frontier LLMs still struggle with robustness on non-standard input and domain-specific content.

We follow a similar philosophy. Instead of idealized examples, we evaluate models on our own internal content and targeted cases that resemble real Framer projects. This also helps avoid train-test contamination, since the evaluation data is not part of public datasets.

Structural validation tests

Alongside LLM-based scoring, we maintain a set of deterministic validation tests designed to catch structural failures that could break real websites.

Check

Description

HTML structure checks

Translate inputs containing HTML and verify that the markup structure remains valid.

Glossary preservation checks

Translate content with a configured glossary and verify that required terms are preserved exactly according to the configured rules.

URL slug checks

Translate page slugs and verify that the output matches our allowed slug format.

Slug errors are usually binary. They either follow the format or break the URL. These tests complement LLM scoring by giving us fast signals when structural regressions appear.

Summary

With GEMBA-DMA, whenever a new frontier model becomes available, we include it in our evaluation set and run the full benchmark suite. If it shows improvements in translation quality, structural accuracy, speed, and cost, we consider promoting it to production.

Our multi-assessments also keep our evaluations practical, consistent, and grounded in real product constraints. Most importantly, it ensures translations remain reliable across both structured and unpredictable content found on real websites.

With all that, we are confident our evaluation pipeline allows fast, practical, and repeatable evals today and tomorrow.


Found this post interesting? We are hiring. Join us at framer.com/careers.


Our customers translate entire sites, from marketing pages, product documentation, CMS content, to structured data that includes HTML markup, URLs, glossary terms, and custom instructions. A translation model might perform perfectly on simple text but fail when these additional constraints are involved. To make sure translations behave correctly in production, we run a continuous evaluation pipeline for new models.

This article looks into existing research, how we evaluate translation quality, what “quality” means in our context, and how we decide which models ship in Framer.

Evaluating translation quality

Before choosing a model, the first question is straightforward: how do you measure translation quality?

This turns out to be a more complex problem than it sounds. Over the years, researchers have proposed many evaluation methods, each with its own trade-offs. Those trade-offs matter when evaluating modern LLM-based translators.

BLEU: the classic metric

BLEU (Bilingual Evaluation Understudy) is one of the most commonly used metrics. It evaluates how much a model’s translation overlaps with a reference translation, assigning higher scores when the wording more closely matches the reference.

For many years BLEU was the standard metric in machine translation research. However, it has clear limitations for modern language models. LLMs often produce translations that are perfectly correct while using different phrasing than the reference. Because BLEU rewards strict overlap, it can undervalue translations that express the same meaning using different wording. In practice, BLEU works best when there is a single expected translation, but natural language rarely behaves that way.

LLMs as evaluators

A more recent approach is to use another language model as the evaluator. Instead of comparing outputs to a reference translation, we ask an LLM to analyze a translation and score its quality directly.

Even relatively old models align well with human expert judgement. For example, Kocmi & Federmann (2023) found that GPT-3.5 (released over three years ago) used as a reference-free evaluator (GEMBA-DA), achieved 86% system-level pairwise accuracy against human MQM (Multidimensional Quality Metrics) labels across multiple language pairs.

Several evaluation methods follow this idea:

Method

Description

Notes

COMET

A neural model designed to estimate translation quality. Some variants require reference translations, while others evaluate outputs directly.

COMET-QE evaluates translations without requiring reference translations.

G-Eval

A prompting-based method that uses a general-purpose LLM to evaluate outputs using structured reasoning prompts.

Flexible but typically requires multiple model calls.

GEMBA

An LLM-based method designed specifically for machine translation evaluation.

GEMBA-MQM produces detailed quality metrics; GEMBA-DA outputs a score from 1–100.

Translation quality is not just “good translation”

Evaluating translation quality might sound straightforward: check whether the translation is correct, done. In practice, inputs are more complex.

As mentioned in the beginning, in Framer, translations are not just text to text. Because of this diversity, we cannot benchmark every model against every tone or domain. Instead, we focus on what we can validate consistently: whether a model behaves correctly under the constraints that appear in real Framer projects.

These constraints often expose weaknesses that traditional translation benchmarks miss. A model may translate plain text perfectly but behave unpredictably once it needs to preserve glossary terms, follow URL formatting rules, or keep HTML structure intact.

For example, glossary configurations may require exact replacements while preserving case sensitivity:

Original: "My glossary term"
Translation: "Mijn woordenlijst term"

Original: "my glossary term"
Translation: "mijn woordenlijst term"

Original: "MY GLOSSARY TERM"
Translation: "MIJN WOORDENLIJST TERM"

Many models struggle to apply rules like this consistently, especially when the prompt also contains other instructions or structured content. These failures may look small, but in real projects they quickly break user expectations.

Beyond translation quality, we also consider two product constraints: cost, since token pricing varies between models, and speed, since translations must return quickly in Framer’s localization UI. Waiting two minutes for a single translation for a very minimal gain is unexpected; especially when apps like Google Translate are instant.

Continuous Evaluation: GEMBA-DMA

The challenges above show a single assessment does not cut it. It needs continuous evaluation to monitor improvements (and regressions) to quality and speed. We present a novel technique based on GEMBA-DA as our scoring method: GEMBA-DMAGPT Estimation Metric Based Assessment - Direct Multi-Assessment.

The most significant change is, we propose using multiple models to judge. Second, we propose a broad range of models and not just OpenAI models. Third, we let our winners score the next winners. This makes sure quality judgement evolves as frontier LLMs advance.

Our evaluation pipeline works roughly as follows: we generate translations using several candidate models, run them across a representative set of languages, and then ask multiple judge models to score the outputs. This produces stable rankings without requiring reference translations.

Worth mentioning, judges never know the source of what they score; so seeing Opus 4.6 as judge score GPT 5.4 higher than its own translation is not unusual.

Evaluation Results

The tables below show a sample of our evaluation results across several frontier models and popular Open-Source models. We translate a shared set of inputs across multiple languages and score the outputs using our GEMBA-DMA judging setup. Alongside the quality scores, we also record runtime and token usage so we can evaluate the cost and latency trade-offs of each model.

Model

Reasoning

Duration

Cost

Avg

GPT 5.2

Low

18.42s

$0.0172

95

GPT-5.3 Codex

High

52.26s

$0.0472

95

GPT 5.4

Low

20.92s

$0.0195

95

Opus 4.6

None

23.45s

$0.0471

94

Gemini 3 Flash

High

26.35s

$0.0120

94

GPT 5.4 Mini

Medium

12.34s

$0.0083

94

Gemini 3.1 Flash Lite

None

8.59s

$0.0021

93

Sonnet 4.6

Low

48.36s

$0.0294

92

GPT 5.4 Nano

Low

6.79s

$0.0016

92

Grok 4.20

None

7.57s

$0.0097

91

Kimi K2.5

None

135.61s

$0.0169

90

Gemini 3.1 Pro

Minimal

26.84s

$0.0412

90

GPT OSS 120B

High on Groq

18.78s

$0.0051

90

GLM 5

Medium on Baseten

29.20s

$0.0046

90

Qwen3.5 397B A17B

Low

91.02s

$0.0300

85

Haiku 4.5

Medium

45.07s

$0.0156

85

DeepSeek V3.2

High

112.71s

$0.0011

83

MiMo V2 Pro

Minimal

95.42s

$0.0184

82

MiniMax M2.7

High

63.20s

$0.0059

79

Full results here. Note: Duration can fluctuate, so e.g. the results do not necessarily mean GPT 5.2 low is slower than medium.

The frontier models are pretty close to each other in quality. Cost and speed can be drastically different between those though, and scores per language do also differ slightly. Surprisingly, smaller models seem to perform on-par to frontier models, e.g. GPT 5.4 Nano is very fast, cheap and maintains quality in similar ranges to our winner GPT 5.2.

Also worth highlighting, giving models more thinking time, or increasing the reasoning level did not always improve the translation. From our investigation, it seems models can start overthinking translations, leading to worse quality.

Structural Tests

Quality alone is not enough for a production translation system. A model might score highly on natural language quality but still fail structural constraints that matter in real websites. For that reason we run a second suite of deterministic tests that validate HTML handling, glossary preservation, and slug formatting.

These checks make it easy to spot models that look strong in translation quality but behave unpredictably when additional rules are involved. In practice, the combination of LLM-based scoring and structural validation gives us a much clearer signal of which models are safe to deploy in Framer.

Model

Reasoning

HTML

Glossary

Slug

GPT 5.2

Low

100%

67%

Pass

GPT-5.3 Codex

High

100%

100%

Pass

Gemini 3 Flash

Low

100%

83%

Pass

Gemini 3 Flash

Medium

100%

100%

Pass

GPT 5.4

Medium

100%

100%

Pass

Testing the hard cases

One key lesson from machine translation research is that overly simple test sets can give a false sense of progress. Many benchmarks consist of clean, well-formed sentences that modern models handle almost flawlessly, making it hard to meaningfully differentiate between systems.

The WMT25 General MT shared task (Kocmi et al., 2025) addressed this problem by introducing difficulty sampling, which focuses evaluation on harder examples. Their results showed that even frontier LLMs still struggle with robustness on non-standard input and domain-specific content.

We follow a similar philosophy. Instead of idealized examples, we evaluate models on our own internal content and targeted cases that resemble real Framer projects. This also helps avoid train-test contamination, since the evaluation data is not part of public datasets.

Structural validation tests

Alongside LLM-based scoring, we maintain a set of deterministic validation tests designed to catch structural failures that could break real websites.

Check

Description

HTML structure checks

Translate inputs containing HTML and verify that the markup structure remains valid.

Glossary preservation checks

Translate content with a configured glossary and verify that required terms are preserved exactly according to the configured rules.

URL slug checks

Translate page slugs and verify that the output matches our allowed slug format.

Slug errors are usually binary. They either follow the format or break the URL. These tests complement LLM scoring by giving us fast signals when structural regressions appear.

Summary

With GEMBA-DMA, whenever a new frontier model becomes available, we include it in our evaluation set and run the full benchmark suite. If it shows improvements in translation quality, structural accuracy, speed, and cost, we consider promoting it to production.

Our multi-assessments also keep our evaluations practical, consistent, and grounded in real product constraints. Most importantly, it ensures translations remain reliable across both structured and unpredictable content found on real websites.

With all that, we are confident our evaluation pipeline allows fast, practical, and repeatable evals today and tomorrow.


Found this post interesting? We are hiring. Join us at framer.com/careers.


Design bold. Launch fast.