Google has started rolling out Gemini 3.1 Flash TTS, a new preview text-to-speech model that adds more granular control over how generated speech sounds. In an April 15 post on X, Google AI Studio team member Logan Kilpatrick said the model supports scene direction, speaker-level instructions, inline audio tags, more natural voices, and 70 languages, with a demo video showing rapid style shifts inside a single clip.

What launched

Google's product post says 3.1 Flash TTS is available in preview through the Gemini API and Google AI Studio, with enterprise preview access on Vertex AI and Workspace exposure via Google Vids. Cloud documentation lists the model ID as gemini-3.1-flash-tts-preview and says it supports both single-speaker output and multi-speaker dialogue.

The company is positioning audio tags as the main control surface. Google Cloud says developers can steer pacing, tone, pauses, accents, and non-verbal sounds through 200-plus inline natural-language tags such as [whispers], [laughs], and [short pause], while also choosing from 30 prebuilt voices.

Why it matters

The conservative takeaway is that Google is pushing TTS closer to promptable performance rather than plain narration. That matters for audiobook tooling, customer support, accessibility software, and character-driven apps that need speech to stay expressive without a separate editing pass.

Google also says all audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID, a notable safeguard as expressive synthetic voices become easier to produce at scale.