What is Voxtral TTS and why can't I clone voices?

Voxtral-4B-TTS generates speech from text and preset voice tokens, but Mistral omitted the encoder, so no custom audio input for cloning.

How to hack voice cloning on Voxtral?

Approximate with open encoders like EnCodec, project semantics via Whisper, fine-tune on presets — expect 50-70% fidelity.

Will Mistral release the full Voxtral encoder?

Likely, given community pressure; watch HF Spaces for updates.

🤖 Large Language Models

Voxtral TTS: Mistral's Audio Breakthrough Hamstrung by Missing Encoder

Mistral's Voxtral-4B-TTS dazzles with its token-based audio generation, but a gutted encoder means no custom voice cloning. Here's why that's a massive miss—and how to work around it.

theAIcatchup Apr 10, 2026 4 min read

Diagram of Voxtral TTS architecture showing autoregressive model and missing encoder

⚡ Key Takeaways

Voxtral's token architecture excels for streaming TTS but missing encoder blocks custom voice cloning. 𝕏
Proxy hacks exist via Whisper and open codecs, but fidelity lags without official weights. 𝕏
Mistral's truncation echoes past AI gating tactics, risking open-source fragmentation. 𝕏

Published by

theAIcatchup

AI news that actually matters.

#Mistral AI #TTS models #Voxtral #voice cloning

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Towards Data Science

⚡ Key Takeaways

The 60-Second TL;DR

theAIcatchup

Share this article

Worth sharing?

Related Stories

Mistral's Voxtral TTS Drops Open Weights That Mock ElevenLabs' Pricing

Inside the Rush to Wire Company DNA into AI Models

ChatGPT's Dark Side: How AI Delusions Drove a Stalker, Sparking OpenAI Lawsuit

Baidu's 0.9B PaddleOCR-VL 1.5 Just Beat GPT-4o at Reading Documents—But Who's Cashing In?

Stay in the loop