QIMMA: The Arabic LLM Leaderboard That Actually Checks Its Homework

If you’ve been following Arabic LLM evaluation, you’ve probably noticed the same thing I have: there are more benchmarks and leaderboards popping up every month, but nobody seems to be asking the obvious question — are we actually measuring what we think we’re measuring?

A team from TII (Technology Innovation Institute) built QIMMA (قمّة, Arabic for “summit”) to answer that. And what they found is honestly a bit embarrassing for the field: even the most popular Arabic benchmarks have systematic quality issues that quietly corrupt evaluation results.

The Problem: Arabic NLP Evaluation Is Kind of a Mess

Arabic is spoken by over 400 million people across a huge range of dialects and cultural contexts. You’d think the evaluation infrastructure would reflect that. It doesn’t.

Translation issues are everywhere. A lot of Arabic benchmarks are just translated from English, which introduces distributional shifts. Questions that make perfect sense in English come out awkward or culturally weird in Arabic. You end up measuring how well a model can parse machine-translated gibberish, not how well it understands Arabic.

Native Arabic benchmarks aren’t much better. They get released without rigorous quality checks. Annotation inconsistencies, wrong gold answers, encoding errors, cultural bias in ground-truth labels — all documented in established resources.

Reproducibility is a joke. Evaluation scripts and per-sample outputs rarely see the light of day. Good luck auditing results or building on someone else’s work.

Coverage is fragmented. Existing leaderboards cover isolated tasks and narrow domains. Want to evaluate a model holistically? You’re stitching together half a dozen different platforms.

Here’s where QIMMA sits relative to the existing landscape:

| Leaderboard | Open Source | Native Arabic | Quality Validation | Coding Eval | Public Outputs |
|—|—|—|—|—|—|
| OALL v1 | ✅ | Mixed | ❌ | ❌ | ✅ |
| OALL v2 | ✅ | Mostly | ❌ | ❌ | ✅ |
| BALSAM | Partial | 50% | ❌ | ❌ | ❌ |
| AraGen | ✅ | 100% | ✅ | ❌ | ❌ |
| SILMA ABL | ✅ | 100% | ✅ | ❌ | ✅ |
| ILMAAM | Partial | 100% | ✅ | ❌ | ❌ |
| HELM Arabic | ✅ | Mixed | ❌ | ❌ | ✅ |
| ⛰ QIMMA | ✅ | 99% | ✅ | ✅ | ✅ |

QIMMA is the only platform that checks all five boxes. That alone makes it worth paying attention to.

What’s Actually in QIMMA

The team consolidated 109 subsets from 14 source benchmarks into a unified evaluation suite of over 52,000 samples, spanning 7 domains:

Cultural: AraDiCE-Culture, ArabCulture, PalmX (MCQ)
STEM: ArabicMMLU, GAT, 3LM STEM (MCQ)
Legal: ArabLegalQA, MizanQA (MCQ, QA)
Medical: MedArabiQ, MedAraBench (MCQ, QA)
Safety: AraTrust (MCQ)
Poetry & Literature: FannOrFlop (QA)
Coding: 3LM HumanEval+, 3LM MBPP+ (Code)

A few things worth highlighting:

99% native Arabic content. The only exception is code evaluation, which is language-agnostic by nature.
First Arabic leaderboard with code evaluation. They adapted HumanEval+ and MBPP+ with Arabic-language problem statements. This is long overdue.
Real diversity in domains. Education, governance, healthcare, creative expression, software development — not just the usual MMLU-and-done approach.

The Quality Validation Pipeline: This Is Where It Gets Interesting

Before running a single model, QIMMA applies a multi-stage validation pipeline to every sample in every benchmark. This is the methodological core of the whole thing.

Stage 1: Multi-Model Automated Assessment

Each sample gets independently evaluated by two LLMs with strong Arabic capability but different training data compositions:

Qwen3-235B-A22B-Instruct
DeepSeek-V3-671B

The idea is that their combined judgment is more robust than either alone. Each model scores a sample against a 10-point rubric with binary scores per criterion. A sample is eliminated if either model scores it below 7/10. If only one model flags it, it goes to human review.

Stage 2: Human Annotation and Review

Flagged samples get reviewed by native Arabic speakers with cultural and dialectal familiarity. Human annotators make final calls on cultural context, regional variation, dialectal nuance, and subjective interpretation. For culturally sensitive content, multiple perspectives are considered — because “correctness” genuinely varies across Arab regions.

What They Found: Systematic Quality Problems

The pipeline revealed recurring quality issues across benchmarks. Not isolated errors — systematic problems. Translation artifacts that made questions nonsensical in Arabic. Culturally inappropriate examples that would confuse any native speaker. Encoding issues that broke text rendering. Annotation errors in ground-truth labels.

This is higher than I expected. I’ve been critical of benchmark quality in NLP for years, but seeing the numbers laid out like this is sobering. The team reports that a significant percentage of samples across multiple benchmarks failed validation. The exact figures are in the paper, but the pattern is clear: if you don’t validate, you’re measuring noise.

What the Rankings Look Like After Cleanup

Once you filter out the garbage samples, the model rankings shift. Some models that looked good on uncleaned benchmarks drop significantly. Others that were underestimated rise. The full leaderboard is available on Hugging Face, and I’d encourage anyone working on Arabic LLMs to check it out.

A few observations:

Models trained on diverse Arabic data (not just MSA) tend to perform better after cleanup. Makes sense — if you’re removing translation artifacts, models that actually understand natural Arabic benefit.
Smaller models with focused Arabic training can outperform larger general-purpose models on specific domains like poetry and law. This isn’t surprising, but it’s good to see it quantified.
Code evaluation reveals a different capability axis entirely. Some models that excel at Arabic text tasks struggle with Arabic-prompted coding, and vice versa.

Why This Matters

QIMMA isn’t just another leaderboard. It’s a methodological intervention. By forcing quality validation to happen before evaluation, it raises the bar for what “Arabic LLM evaluation” means.

The team has open-sourced everything — the pipeline, the validation scripts, the per-sample outputs. That means anyone can audit their results, reproduce their findings, or build on their work. This is how evaluation should work.

If you’re building or deploying Arabic LLMs, QIMMA gives you a clearer picture of actual capability. If you’re creating benchmarks, it gives you a template for quality control. And if you’re just following the field, it’s a reminder that not all leaderboards are created equal.

Check it out on Hugging Face: QIMMA Leaderboard