← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · Mistral AI

Voxtral TTS

Model family: voxtral

Size
small (4.1B params)
Context
8,192 tokens
Released
2026-03-25
Openness
open-weight
License
Cost tier
mixed
Rating
4.0 — Genuinely strong TTS quality — beats ElevenLabs in zero-shot voice cloning preference tests at roughly a quarter of the per-character cost via API. Half-star haircut reflects the non-commercial open-weight license (unusual for Mistral and meaningful for self-hosted commercial use) and the relative recency of the release (independent third-party benchmarking still catching up as of April 2026).
Modalities
audio-output, text
Capabilities
multilingual, text-to-speech
Access
api-first-party, weights-download-direct, weights-download-hf

Quick Take

Mistral's first text-to-speech model — 9 languages, zero-shot voice cloning from 3 seconds of audio, and roughly 27% of ElevenLabs' per-character cost through Mistral's API. Non-commercial license on the open weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself..

Plain-English Description

Voxtral TTS is Mistral's March 2026 entry into the text-to-speech market, and a notable one. It's the first Mistral model that isn't Apache 2.0 — the open weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. on Hugging Face ship under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0), which means research and personal use are free but commercial use requires either paying for the hosted APIAccessing a model by sending requests to the creator's (or a provider's) servers, typically pay-per-use. Hosted APIs handle all the operational work — scaling, hardware, uptime — in exchange for a per-token or per-request fee. Every closed-API model is hosted; many open-weight models are also available via hosted APIs from providers like Together, Fireworks, or Groq. or negotiating a separate commercial license with Mistral. For most commercial users this isn't a big deal — the API at $0.016 per 1,000 characters is the commercial license, and it's already cheaper than ElevenLabs — but it's a licensing departure that's worth understanding before you build on the self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. weights.

The model itself is a 4.1-billion-parameter autoregressive flow-matching system built on top of Ministral 3B as its language model backbone. The architecture has three stages: a 3.4B transformerThe core model architecture that powers nearly every modern AI language model. Introduced by Google researchers in 2017, it uses a mechanism called attention to process text by looking at every word in context with every other word simultaneously, rather than one at a time. "Transformer" is the T in GPT, BERT, and most other model names. decoderThe part of a model that generates output, one token at a time, from an internal representation. Chat models are almost all decoder-only architectures — they take your prompt, process it, and stream out a response token by token. "Decoder-only" is the technical name for the family most people just call "chatbots." that predicts semantic tokens from input text and a voice reference, a 390M flow-matching acoustic transformer that converts those tokens into audio representations, and a 300M neural audio codec Mistral built from scratch to operate at 12.5 Hz with 80-millisecond frames. That three-stage structure is what enables voice cloning — the codec captures speaker characteristics in a compressed latent space, and the transformer components generate new speech in that captured voice.

Voxtral TTS generates speech in nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. It ships with 20 built-in preset voices, and can clone a new voice from as little as 3 seconds of reference audio (5–25 seconds recommended for quality). On latency, Mistral reports 70 milliseconds of model latency on H200 GPUs, with community reports of around 90ms real-world time-to-first-audio on high-end hardware. In Mistral's own human preference evaluations (blind native-speaker listening tests), Voxtral TTS beat ElevenLabs Flash v2.5 68.4% of the time in zero-shot voice cloning and 58.3% in flagship-voice comparisons. Those numbers are from Mistral's own testing; independent third-party benchmarking has been slower to materialize as of April 2026, and the Artificial AnalysisAn independent benchmarking site that runs standardized tests across commercial and open-weight models and publishes comparable results on capability, speed, and cost. Widely cited for API provider comparisons — if you want to know whether Llama 3.3 70B is faster on Groq or Together, Artificial Analysis is the reference. Speech Arena Leaderboard hadn't added Voxtral TTS to its rankings at last check.

Best For

  • Enterprise voice agents where ElevenLabs pricing is a problem. The API is roughly 27% of ElevenLabs' per-character cost. For high-volume voice deployments (customer service bots, IVR systems, live translation), this pricing delta compounds quickly.
  • Multilingual TTS applications. The 9-language support is broad enough for most European and major Asian markets. Voice cloning works cross-lingually — supply a French reference, generate English output in that voice.
  • Workflows where low latency matters. 70–90ms time-to-first-audio enables real-time voice agents, live subtitling reverse-direction (text → spoken translation), and other streaming applications.
  • Research and evaluation workloads. The open weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. under CC BY-NC 4.0 are fine for research, internal evaluation, non-commercial projects, and personal use without any commercial fee.
  • Teams who want European-jurisdiction TTS. ElevenLabs is US-based; Voxtral TTS is French. For GDPR-sensitive voice deployments, this posture is procurement-relevant.

Not For

  • Self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. commercial deployments without a commercial license. The open weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. are non-commercial-use only. If you want to deploy the weights in a revenue-generating product, you need either to contact Mistral for a commercial license or to use the hosted APIAccessing a model by sending requests to the creator's (or a provider's) servers, typically pay-per-use. Hosted APIs handle all the operational work — scaling, hardware, uptime — in exchange for a per-token or per-request fee. Every closed-API model is hosted; many open-weight models are also available via hosted APIs from providers like Together, Fireworks, or Groq. instead.
  • Applications requiring languages outside the nine supported. English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic. If you need Korean, Japanese, Chinese, Russian, or other languages, look elsewhere (ElevenLabs, OpenAI TTS, Google Cloud Text-to-Speech, Azure Neural TTS).
  • Use cases requiring voice customization beyond fixed presets when self-hosting. Mistral's HF materials indicate the open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. version is limited to fixed voices; deep voice customization is available only through the hosted API platform.
  • Extreme-latency real-time applications below 70ms. Some voice-agent workloads need sub-50ms TTS latency. Voxtral TTS doesn't hit that floor; alternative streaming TTS models may be required.

License — Plain-English Summary

Voxtral TTS is the unusual one in Mistral's lineup. The open weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. on Hugging Face are Creative Commons Attribution-NonCommercial 4.0 — free to download, modify, and share for non-commercial use only. Research, evaluation, personal use, and non-commercial academic work are all fine. Commercial use requires either paying for Mistral's hosted APIAccessing a model by sending requests to the creator's (or a provider's) servers, typically pay-per-use. Hosted APIs handle all the operational work — scaling, hardware, uptime — in exchange for a per-token or per-request fee. Every closed-API model is hosted; many open-weight models are also available via hosted APIs from providers like Together, Fireworks, or Groq. (which is itself the commercial license) or negotiating a separate commercial agreement with Mistral. For most teams, the hosted API at $0.016/1K characters is the practical commercial path — you pay per use and you're fully licensed for commercial output. This is the first Mistral model to break the Apache 2.0 pattern, and the posture is intentional: Mistral wants the open-weights route for research and community adoption while reserving commercial revenue for the API. Plan accordingly.

How It Compares

  • vs. ElevenLabs Flash v2.5 — Mistral's own testing shows Voxtral TTS winning 68.4% of zero-shot voice cloning comparisons and 58.3% of flagship-voice comparisons in blind human preference tests. API pricing is roughly 27% of ElevenLabs'. ElevenLabs has a larger voice library, longer ecosystem track record, and more third-party integrations.
  • vs. OpenAI TTS-1 / TTS-1-HD — OpenAI's TTS is simpler to integrate if you're already in the OpenAI ecosystem but lacks zero-shot voice cloning and has a smaller language footprint. Voxtral TTS is more capable at voice cloning; OpenAI is simpler to adopt.
  • vs. Google Cloud Neural2Google is the established enterprise TTS incumbent with broader language support (40+) and deeper integration with other Google Cloud services. Voxtral TTS is cheaper per character and has better voice cloning; Google has more languages and more platform integration.

Under the Hood

The three-component architecture of Voxtral TTS is worth understanding for teams considering self-hosting. The 3.4B decoderThe part of a model that generates output, one token at a time, from an internal representation. Chat models are almost all decoder-only architectures — they take your prompt, process it, and stream out a response token by token. "Decoder-only" is the technical name for the family most people just call "chatbots." backbone is based on Ministral 3B (Mistral's edge-class language model), giving the TTS system native text understanding capabilities that simpler TTS models lack. The 390M flow-matching acoustic transformerThe core model architecture that powers nearly every modern AI language model. Introduced by Google researchers in 2017, it uses a mechanism called attention to process text by looking at every word in context with every other word simultaneously, rather than one at a time. "Transformer" is the T in GPT, BERT, and most other model names. converts semantic tokens from the backbone into audio-space representations using a flow-matching objective rather than diffusion or autoregressive audio generation. The 300M neural codec is Mistral's own architecture, operating at 12.5 Hz with 80ms frames — efficient enough that the full 4.1B model fits in ~8GB BF16 or ~3GB quantized.

The flow-matching design allows tunable compute-quality tradeoffs at inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs. time. The Rust implementation voxtral-mini-realtime-rs from the community demonstrates sub-10-Euler-step inference for real-time streaming, with Q4 GGUF quantizationCompressing a model by reducing the numerical precision of its stored weights — for example, from 16-bit numbers to 4-bit numbers. The compressed model uses roughly a quarter of the memory and runs faster on most hardware, at the cost of slight accuracy loss. Quantization is what makes big models runnable on laptops — a 70B model in 4-bit quantization can fit on hardware that couldn't load the full-precision version. fitting entirely in a browser tab via WASM + WebGPU.

On benchmark specifics, Mistral used human preference evaluations rather than automated metrics like Mean Opinion Score, arguing in the research paper that automated scores don't reliably capture naturalness across languages. The blind listening-test methodology is defensible but produces results that aren't directly comparable to other TTS models benchmarked on MOS — a point worth noting when comparing to published ElevenLabs or Google TTS quality numbers.

Cost

Self-hosted cost
$0.00 beyond compute
API providers
mistral
Notes
Mistral's hosted API is priced at $0.016 per 1,000 characters of generated audio. Self-hosting open weights is free for non-commercial use only (see license). For commercial self-hosting, contact Mistral for a commercial agreement — the hosted API IS the standard commercial license path.

Hardware requirements

Min VRAM
8 GB
Recommended VRAM
16 GB
Runs on laptop
Yes
Notes
BF16 weights (~8GB) run on a single 16GB GPU with inference overhead. Quantized (Q4 GGUF) versions drop to ~3GB and can run on edge devices and Apple Silicon Macs. Mistral claims smartphone deployment is possible at aggressive quantization, though this is unverified independently.

Comparable models

Commercial-use conditions

Open weights are released for research and non-commercial use only. Commercial use requires a separate commercial agreement with Mistral. For most commercial deployments, Mistral's hosted API at $0.016 per 1,000 characters IS the commercial license — you pay the API fee and you're licensed for commercial use of the outputs.

Sources