Training mRNA Language Models Across 25 Species for $165

Training mRNA Language Models Across 25 Species for $165

5 0 0

You ever wonder how much it costs to train a language model that speaks mRNA? Turns out, about $165 in GPU time. That’s what OpenMed spent to build a multi-species codon optimization system, and they documented the whole messy process.

This is Part II of their protein AI pipeline series. Part I was the survey: AlphaFold, ESMFold, ProteinMPNN, the usual suspects. This time they actually built something: a pipeline that takes a protein concept and spits out a synthesis-ready DNA sequence. Three stages: predict the 3D structure, design the amino acid sequence, optimize the codons for expression. The first two use existing tools (ESMFold and ProteinMPNN). The third is where they went deep.

Why Codon Optimization Matters

The genetic code is degenerate. Most amino acids are encoded by multiple codons, but organisms have strong preferences. The Pfizer-BioNTech COVID vaccine was codon-optimized for human cells. Get the codons wrong and your protein expresses at 1% of what it could. Get them right and you get 100x improvement.

Traditional approaches use hand-crafted frequency tables. OpenMed wanted to train a model that learns these preferences directly from natural coding sequences. That means treating codons as tokens and training a masked language model, just like BERT or RoBERTa, but on DNA instead of English.

The Architecture Shootout

They compared five architectures:

  • CodonBERT (6M params): A tiny BERT baseline based on Sanofi’s published work.
  • ModernBERT-base (90M): Latest efficiency innovations from NLP, with RoPE and long context support.
  • CodonRoBERTa-base (92M): Same architecture family as ESM-2, 12 layers.
  • CodonRoBERTa-large (312M): 24 layers, test if more parameters help.
  • CodonRoBERTa-large-v2 (312M): Same architecture, better hyperparameters.

The hypothesis was that RoBERTa, which already works well for amino acid sequences, would also work for codons. Turns out that was right.

Results That Surprised Me

CodonRoBERTa-large-v2 hit a perplexity of 4.10 and a Spearman CAI correlation of 0.40. ModernBERT-base, despite all its modern innovations, scored worse. This is higher than I expected from a 312M parameter model trained on a relatively small dataset (250k coding sequences).

The CAI (Codon Adaptation Index) correlation is the important one. It measures how well the model’s predicted codon preferences match real expression data. 0.40 is decent for a first attempt, but not production-ready. They acknowledge this honestly.

What surprised me most was that ModernBERT underperformed. I’ve been following ModernBERT since its release, and its efficiency gains are real. But for codon modeling, the simpler RoBERTa architecture won. Maybe the codon token space (64 tokens) doesn’t benefit from ModernBERT’s attention innovations. Or maybe the training data is too small for ModernBERT to shine.

Scaling to 25 Species for $165

This is the part that got my attention. They trained four production models across 25 organisms in 55 GPU-hours. At current cloud pricing, that’s about $165. For a multi-species mRNA language model suite.

They built a species-conditioned system where you specify the target organism and the model adjusts its codon preferences accordingly. No other open-source project offers this. The code is available, the weights are available, and you can run it yourself.

The dataset grew from 250k coding sequences in the initial experiments to 381k across 25 species. That’s still small by NLP standards, but for codon modeling it’s substantial.

The Pipeline in Practice

Here’s how it works end-to-end:

  1. Protein Folding: ESMFold predicts the 3D structure from an amino acid sequence. Average pTM score was 0.79 across 30 test chains. That’s not AlphaFold territory, but it’s fast and free.
  1. Sequence Design: ProteinMPNN takes the predicted structure and designs amino acid sequences that should fold into it. They got 42% sequence recovery on the scaffold 7K00, which means 42% of the designed amino acids matched the original. That’s reasonable for a design tool.
  1. mRNA Optimization: The trained CodonRoBERTa model takes the amino acid sequence and outputs codon-optimized DNA for the target species. This is where the new models come in.

What I Would Do Differently

OpenMed is honest about the limitations. The CAI correlation of 0.40 is okay but not great. They trained on coding sequences from natural genomes, which means the model learns what organisms do use, not necessarily what optimizes expression. There’s a gap between natural codon bias and optimal expression.

I would have liked to see experiments with larger datasets. 381k sequences across 25 species averages to about 15k per species. That’s thin. Human codon usage is well-studied, but for something like zebrafish or yeast, 15k coding sequences might not capture the full distribution.

Also, they didn’t test against the obvious baseline: simple frequency tables. I’d want to know how much better the language model is than just using the most common codon for each amino acid in the target species. If the improvement is marginal, the complexity might not be worth it.

Where This Stands

This is a solid proof of concept. The pipeline works, the models are open-source, and the cost is negligible. For anyone doing protein engineering in academia or small biotech, this is immediately useful.

But it’s not a replacement for experimental validation. Codon optimization is still an empirical science. You can predict preferences, but you still need to test expression in the lab. The model gives you a starting point, not a guarantee.

What OpenMed has done is lower the barrier to entry. You don’t need a $100k GPU cluster to train mRNA language models. You don’t need to be at a big pharma company. You can do it on a single consumer GPU for the cost of a nice dinner.

That’s the kind of progress that actually matters.

Comments (0)

Be the first to comment!