Running Transformers.js in a Chrome Extension: What I Learned Building With Gemma 4

I recently built a Chrome extension that runs Gemma 4 E2B locally using Transformers.js. It’s a side-panel assistant that can extract page content, answer questions, and highlight elements without phoning home to any server.

The full source is on GitHub if you want to poke around. But the interesting part isn’t the extension itself — it’s the architectural decisions you have to make when running AI models inside a Manifest V3 service worker.

The Three-Runtime Split

Manifest V3 forces you into a specific architecture: background service worker, side panel (or popup), and content scripts. You can fight it, or you can lean in.

I leaned in. The background service worker becomes your control plane. It holds the model instances, manages the agent lifecycle, and coordinates everything. The side panel is just a thin chat UI — it sends events, receives updates, and renders. The content script is even simpler: it extracts DOM content and applies highlights on command.

This split avoids duplicate model loads. If you loaded a 2GB model in the side panel and another in a popup, you’d hit memory limits fast. Keeping everything in the background means one set of pipelines, one cache, one inference engine shared across all tabs and sessions.

The Messaging Dance

Once you separate runtimes, messaging becomes your backbone. Every interaction is an event flowing through Chrome’s runtime messaging API.

The pattern is straightforward: the side panel sends a typed action (like AGENT_GENERATE_TEXT), the background processes it, runs inference, and pushes updates back. The content script only responds to direct requests from the background — it never talks to the side panel directly.

I defined all message types in a shared types file. This is worth the overhead. Without typed messages, you’ll spend hours debugging “undefined is not a function” errors because you misspelled a string somewhere.

Model Loading and the Service Worker Problem

Here’s where things get tricky. Service workers can be suspended at any time. Chrome will kill your background script if it’s idle for 30 seconds, and when it wakes back up, all your model state is gone.

This means you can’t treat model instances as long-lived singletons. You need to check what’s cached, reinitialize if needed, and design your state as recoverable. The extension I built checks for cached models on startup, estimates remaining download size, and emits progress events back to the UI.

Model caching under Manifest V3 actually works in your favor here. Artifacts are stored under the extension origin (chrome-extension://) rather than per-website, so one download serves all tabs. But the initialization code needs to handle the case where the service worker restarts mid-session.

Two Models, One Purpose

I’m running two models: Gemma 4 for text generation and reasoning, and all-MiniLM-L6-v2 for embeddings. The split is intentional — you don’t need a 2B parameter model to compute vector similarity for “find pages about machine learning.”

Both models run in the background service worker. The embedding pipeline uses ONNX with fp32 precision, while Gemma runs in q4f16 quantized format to keep memory reasonable. Even so, you’re looking at several hundred MB of model data loaded into memory. This isn’t something you want running on a Chromebook with 4GB RAM.

What I’d Do Differently

If I were starting over, I’d spend more time on the download UX. The initial model download can take minutes on a decent connection, and the service worker can’t show a progress bar on its own. You have to pipe progress events through messaging to the side panel, which then renders a progress indicator. It works, but it’s fragile.

I’d also reconsider the side panel vs popup decision. Side panels are persistent, which is nice for chat, but they eat screen real estate. A popup that opens on demand might be better for simpler use cases.

The Bottom Line

Running Transformers.js in a Chrome extension is absolutely possible, but Manifest V3 adds constraints you can’t ignore. The architecture matters more than the model choice. Get the runtime split right, type your messages strictly, and plan for service worker restarts. Do that, and you’ve got a local AI assistant that works offline, respects privacy, and doesn’t need a server.

I’m curious to see what other people build with this pattern. The pieces are all there — it’s just a matter of wiring them together correctly.