Google just dropped Gemini 3.1 Flash TTS, and I’ve been poking at it for a couple of days. This is the latest text-to-speech model from the Gemini team, and it’s a bigger deal than the usual incremental update. The headline feature is something they call “audio tags” — basically, you can embed natural language commands directly into your text to control how the AI speaks. Want a whisper? A dramatic pause? A slightly sarcastic tone? You just write it in.
I’ve been testing this in Google AI Studio, and it’s surprisingly intuitive. You type something like [whisper] This is the secret part [/whisper] and the model actually delivers. It’s not perfect — sometimes the pacing feels a bit off, and it can overdo the drama if you’re not careful — but it’s a huge step up from the usual flat, robotic delivery. The model also supports native multi-speaker dialogue, so you can have two characters in a conversation without swapping APIs or stitching audio files together.
Quality that actually competes
Google claims this is their most natural and expressive model yet, and the numbers back it up. On the Artificial Analysis TTS leaderboard, which runs blind human preference tests, Gemini 3.1 Flash TTS scored an Elo of 1,211. That’s higher than I expected — it puts it in the same league as some of the best proprietary models out there. Artificial Analysis also placed it in their “most attractive quadrant” for balancing high-quality speech with low cost, which is where things get interesting for developers.
The model supports over 70 languages, which is more than most competitors, and it uses SynthID watermarking on every generated audio clip. That’s Google’s invisible digital watermark that helps identify AI-generated content. It’s not foolproof — no watermarking system is — but it’s a solid deterrent against misuse, especially for enterprise deployments where compliance matters.
Where you can use it
As of today, Gemini 3.1 Flash TTS is rolling out in preview on three fronts:
- Google AI Studio and the Gemini API for developers who want to experiment and build custom apps.
- Vertex AI for enterprise customers who need scale, security, and integration with their existing pipelines.
- Google Vids for Workspace users, which is Google’s video creation tool. This is the most consumer-facing use case — you can generate voiceovers for explainer videos or presentations without hiring a voice actor.
I spent most of my time in AI Studio, and the workflow is straightforward: you write or paste your script, add audio tags where you want emphasis or style changes, and hit generate. You can also fine-tune voices and export settings for consistent use across projects. It’s not a full voice cloning system — you can’t train it on a specific person’s voice — but for generic narration or character dialogue, it’s more than capable.
The competition and the catch
Google isn’t the only player in this space. ElevenLabs has been the gold standard for expressive TTS for a while, and OpenAI’s TTS models are solid too. What sets Gemini 3.1 Flash TTS apart is the combination of granular control via audio tags, multi-speaker support, and the sheer language coverage. ElevenLabs supports 29 languages; Google does 70+. That’s a meaningful difference if you’re building a global product.
That said, there are downsides. The model is still in preview, which means it’s not production-ready for everyone. I hit a few latency issues during peak hours, and the audio tag syntax can be a bit finicky — if you forget a closing tag, the model sometimes ignores the entire command. Google’s documentation is decent but not exhaustive, so expect some trial and error.
Also, the cost. Google hasn’t published final pricing yet, but if it follows the pattern of other Gemini models, it won’t be cheap at scale. The “low cost” quadrant placement is relative to other high-quality TTS models, not free. Small developers might find themselves priced out if they’re generating a lot of audio.
Should you care?
If you’re building any kind of voice-based application — chatbots, audiobooks, voiceovers, accessibility tools — this is worth a serious look. The audio tags alone save you from having to hack together multiple TTS models or post-process audio to add emotion. And the multi-speaker support is a genuine time-saver for dialogue-heavy content.
If you’re just a curious user, try it in Google Vids or AI Studio. It’s free for now in preview, and you can see for yourself whether the expressiveness lives up to the hype. My take: it’s not perfect, but it’s the first Google TTS model that actually feels like it was built by people who understand how humans speak, not just how text sounds when read aloud.

Comments (0)
Login Log in to comment.
Be the first to comment!