Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Catalog entry last reviewed 92 days ago.

Voxtral TTS

Model family: voxtral

Size

small (4.1B params)

Context

8,192 tokens

Released

2026-03-25

Openness

open-weight

License

Creative Commons Attribution-NonCommercial 4.0 International · commercial: no

Cost tier

mixed

Rating

4.0 ★ — Genuinely strong TTS quality — beats ElevenLabs in zero-shot voice cloning preference tests at roughly a quarter of the per-character cost via API. Half-star haircut reflects the non-commercial open-weight license (unusual for Mistral and meaningful for self-hosted commercial use) and the relative recency of the release (independent third-party benchmarking still catching up as of April 2026).

Modalities

audio-output, text

Capabilities

multilingual, text-to-speech

Access

api-first-party, weights-download-direct, weights-download-hf

tts
text-to-speech
audio
voice-cloning
multilingual
open-weight
non-commercial-license
edge
eu-based

Quick Take

Mistral's first text-to-speech model — 9 languages, zero-shot voice cloning from 3 seconds of audio, and roughly 27% of ElevenLabs' per-character cost through Mistral's API. Non-commercial license on the open weights.

Plain-English Description

Voxtral TTS is Mistral's March 2026 entry into the text-to-speech market, and a notable one. It's the first Mistral model that isn't Apache 2.0 — the open weights on Hugging Face ship under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0), which means research and personal use are free but commercial use requires either paying for the hosted API or negotiating a separate commercial license with Mistral. For most commercial users this isn't a big deal — the API at $0.016 per 1,000 characters is the commercial license, and it's already cheaper than ElevenLabs — but it's a licensing departure that's worth understanding before you build on the self-hosted weights.

The model itself is a 4.1-billion-parameter autoregressive flow-matching system built on top of Ministral 3B as its language model backbone. The architecture has three stages: a 3.4B transformer decoder that predicts semantic tokens from input text and a voice reference, a 390M flow-matching acoustic transformer that converts those tokens into audio representations, and a 300M neural audio codec Mistral built from scratch to operate at 12.5 Hz with 80-millisecond frames. That three-stage structure is what enables voice cloning — the codec captures speaker characteristics in a compressed latent space, and the transformer components generate new speech in that captured voice.

Voxtral TTS generates speech in nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. It ships with 20 built-in preset voices, and can clone a new voice from as little as 3 seconds of reference audio (5–25 seconds recommended for quality). On latency, Mistral reports 70 milliseconds of model latency on H200 GPUs, with community reports of around 90ms real-world time-to-first-audio on high-end hardware. In Mistral's own human preference evaluations (blind native-speaker listening tests), Voxtral TTS beat ElevenLabs Flash v2.5 68.4% of the time in zero-shot voice cloning and 58.3% in flagship-voice comparisons. Those numbers are from Mistral's own testing; independent third-party benchmarking has been slower to materialize as of April 2026, and the Artificial Analysis Speech Arena Leaderboard hadn't added Voxtral TTS to its rankings at last check.

Best For

Enterprise voice agents where ElevenLabs pricing is a problem. The API is roughly 27% of ElevenLabs' per-character cost. For high-volume voice deployments (customer service bots, IVR systems, live translation), this pricing delta compounds quickly.
Multilingual TTS applications. The 9-language support is broad enough for most European and major Asian markets. Voice cloning works cross-lingually — supply a French reference, generate English output in that voice.
Workflows where low latency matters. 70–90ms time-to-first-audio enables real-time voice agents, live subtitling reverse-direction (text → spoken translation), and other streaming applications.
Research and evaluation workloads. The open weights under CC BY-NC 4.0 are fine for research, internal evaluation, non-commercial projects, and personal use without any commercial fee.
Teams who want European-jurisdiction TTS. ElevenLabs is US-based; Voxtral TTS is French. For GDPR-sensitive voice deployments, this posture is procurement-relevant.

Not For

Self-hosted commercial deployments without a commercial license. The open weights are non-commercial-use only. If you want to deploy the weights in a revenue-generating product, you need either to contact Mistral for a commercial license or to use the hosted API instead.
Applications requiring languages outside the nine supported. English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic. If you need Korean, Japanese, Chinese, Russian, or other languages, look elsewhere (ElevenLabs, OpenAI TTS, Google Cloud Text-to-Speech, Azure Neural TTS).
Use cases requiring voice customization beyond fixed presets when self-hosting. Mistral's HF materials indicate the open-weight version is limited to fixed voices; deep voice customization is available only through the hosted API platform.
Extreme-latency real-time applications below 70ms. Some voice-agent workloads need sub-50ms TTS latency. Voxtral TTS doesn't hit that floor; alternative streaming TTS models may be required.

License — Plain-English Summary

Voxtral TTS is the unusual one in Mistral's lineup. The open weights on Hugging Face are Creative Commons Attribution-NonCommercial 4.0 — free to download, modify, and share for non-commercial use only. Research, evaluation, personal use, and non-commercial academic work are all fine. Commercial use requires either paying for Mistral's hosted API (which is itself the commercial license) or negotiating a separate commercial agreement with Mistral. For most teams, the hosted API at $0.016/1K characters is the practical commercial path — you pay per use and you're fully licensed for commercial output. This is the first Mistral model to break the Apache 2.0 pattern, and the posture is intentional: Mistral wants the open-weights route for research and community adoption while reserving commercial revenue for the API. Plan accordingly.

How It Compares

vs. ElevenLabs Flash v2.5 — Mistral's own testing shows Voxtral TTS winning 68.4% of zero-shot voice cloning comparisons and 58.3% of flagship-voice comparisons in blind human preference tests. API pricing is roughly 27% of ElevenLabs'. ElevenLabs has a larger voice library, longer ecosystem track record, and more third-party integrations.
vs. OpenAI TTS-1 / TTS-1-HD — OpenAI's TTS is simpler to integrate if you're already in the OpenAI ecosystem but lacks zero-shot voice cloning and has a smaller language footprint. Voxtral TTS is more capable at voice cloning; OpenAI is simpler to adopt.
vs. Google Cloud Neural2 — Google is the established enterprise TTS incumbent with broader language support (40+) and deeper integration with other Google Cloud services. Voxtral TTS is cheaper per character and has better voice cloning; Google has more languages and more platform integration.

Under the Hood

The three-component architecture of Voxtral TTS is worth understanding for teams considering self-hosting. The 3.4B decoder backbone is based on Ministral 3B (Mistral's edge-class language model), giving the TTS system native text understanding capabilities that simpler TTS models lack. The 390M flow-matching acoustic transformer converts semantic tokens from the backbone into audio-space representations using a flow-matching objective rather than diffusion or autoregressive audio generation. The 300M neural codec is Mistral's own architecture, operating at 12.5 Hz with 80ms frames — efficient enough that the full 4.1B model fits in ~8GB BF16 or ~3GB quantized.

The flow-matching design allows tunable compute-quality tradeoffs at inference time. The Rust implementation voxtral-mini-realtime-rs from the community demonstrates sub-10-Euler-step inference for real-time streaming, with Q4 GGUF quantization fitting entirely in a browser tab via WASM + WebGPU.

On benchmark specifics, Mistral used human preference evaluations rather than automated metrics like Mean Opinion Score, arguing in the research paper that automated scores don't reliably capture naturalness across languages. The blind listening-test methodology is defensible but produces results that aren't directly comparable to other TTS models benchmarked on MOS — a point worth noting when comparing to published ElevenLabs or Google TTS quality numbers.

Cost

Self-hosted cost: $0.00 beyond compute
API providers: mistral
Notes: Mistral's hosted API is priced at $0.016 per 1,000 characters of generated audio. Self-hosting open weights is free for non-commercial use only (see license). For commercial self-hosting, contact Mistral for a commercial agreement — the hosted API IS the standard commercial license path.

Pricing data is 92 days old. Verify with the source before relying on it.

Hardware requirements

Min VRAM: 8 GB
Recommended VRAM: 16 GB
Runs on laptop: Yes
Notes: BF16 weights (~8GB) run on a single 16GB GPU with inference overhead. Quantized (Q4 GGUF) versions drop to ~3GB and can run on edge devices and Apple Silicon Macs. Mistral claims smartphone deployment is possible at aggressive quantization, though this is unverified independently.

Comparable models

Commercial-use conditions

Open weights are released for research and non-commercial use only. Commercial use requires a separate commercial agreement with Mistral. For most commercial deployments, Mistral's hosted API at $0.016 per 1,000 characters IS the commercial license — you pay the API fee and you're licensed for commercial use of the outputs.