A Guide to Voice Cloning on Voxtral with a Missing Encoder

has been not too long ago released by Mistral. It’s a highly effective textual content to speech mannequin that’s beating ElevenLabs v2.5 Flash, in accordance with Mistral’s assessments. Aside from state-of-the-art efficiency on text-to-speech duties (amongst fashions of comparable sizes), Mistral introduced voice cloning capabilities and revealed the weights of their mannequin. That attracted an enormous curiosity, as a result of a sufficiently small for native inference, top quality text-to-speech (TTS) mannequin with voice cloning capabilities, is one thing in demand throughout each enterprise and the neighborhood.

The difficulty, nevertheless, is that Mistral eliminated the encoder weights of the audio autoencoder, so customers can not clone any voice, we are able to simply use the voices Mistral ready for us. That could be a big limitation in comparison with the paper and preliminary announcement.

Right here I’m offering (1) an outline of the Voxtral TTS structure with some technical particulars and comparisons, (2) my analysis on the audio autoencoder and the way it really encodes audio, (3) a examine on how we are able to nonetheless get illustration for any audio to doubtlessly use voice cloning, despite the fact that the revealed weights are truncated in a part of the encoder.

Why I’m positive Voxtral TTS is a know-how value understanding (quick private story)

Years in the past, I used to be working at Skyeng the place we had been constructing our computerized speech recognition (ASR) system. It was in 2021, Whisper has not been launched by OpenAI but, and the ASR job was a sizzling matter, particularly for some unusual speech — our ASR was for not native audio system.

At the moment I used to be studying many papers about audio understanding and encoding (primarily utilizing transformer encoders). Once I learn the Wav2Vec2 paper I acquired a powerful feeling that we must always undertake this know-how, perceive in tiny particulars, and use it, as a result of I used to be positive the approaches and the state of talked about applied sciences are basic sufficient to persist as the brand new classics within the audio understanding and sign processing area. I used to be proper. And with the Voxtral-TTS launch I acquired the identical feeling.

Voxtral TTS overview

Voxtral-4B-TTS is a 4-billion parameters mannequin that makes use of an autoregressive massive language mannequin (LLM-)primarily based 3B spine (Ministral 3B mannequin). Merely talking, the mannequin takes as enter some audio tokens that signify voice to clone and textual content tokens to be voiced. Much like LLMs, the mannequin autoregressively generates tokens, with the distinction that these tokens are voice tokens. Within the paper that describes the mannequin, there’s the next illustration for this half:

Autoregressive audio generator (primarily based on LLM), supply: https://arxiv.org/pdf/2603.25551

Necessary issues to notice: each the audio reference and the generated audio will be cut up to non-overlapping and unbiased tokens, every token represents 80 ms of audio; there’s a difficult head that consists of a linear head and a flow-matching transformer to create semantic and acoustic elements (additionally referred to as tokens) of a person voice token. These two particulars make this mannequin particular. Unbiased audio (voice) tokens make it able to native audio streaming. And this difficult head is a genius mixture of two present approaches to audio era — discrete tokens prediction (linear head) and sophisticated distribution approximation via diffusion course of utilizing diffusion fashions or flow-matching transformers strategy (the identical we often use for picture and video era).

That appears conceptually elegant and delightful. However there’s one other part required — the audio autoencoder. An audio autoencoder is answerable for acquiring these acoustic and semantic tokens which can be basic to the mannequin. The structure overview of the 350M autoencoder from the unique paper:

Audio autoencoder, supply: https://arxiv.org/pdf/2603.25551

The audio autoencoder, Voxtral Codec, is the mannequin that produces 37 discrete tokens for every 80ms audio body — in the course of the structure inside the quantization block — and may reconstruct the audio again from these discrete tokens. The basic description of the autoencoder is encoder -> bottleneck -> decoder. Mistral hasn’t launched an encoder, nevertheless. Which means we nonetheless can generate audio after the autoregressive mannequin produces audio tokens making use of the decoder of the Voxtral Codec, however we are able to’t natively feed some audio to get audio tokens that we are able to use as voice situation within the autoregressive mannequin, as Mistral hasn’t launched the weights of the encoder and the decoder isn’t invertible (as most deep studying fashions).

One other fascinating element: within the bottleneck there’s a mentioning of two sorts of tokens, semantic and acoustic. The implementation right here can be value understanding. The decoder produces 292-dim latent, that’s cut up right into a 256-dim vector and 36 single-dimension scalars. The 256-dim vector is mapped to codebooks embedding and is later represented by the code — closest codebook ID (that’s Vector Quantization occurring). Every of 36 single-dimension scalars are mapped to a spread from 0 to 21 utilizing scaled tanh activation. These parts are already a simplification over among the earlier approaches that used Residual Vector Quantization.

Just a few phrases about semantic tokens. The embeddings of the semantic tokens are linearly projected to match logits of the Whisper decoder when it’s doing speech to textual content for a similar audio. In different phrases, there’s a constraint utilized to hyperlink the semantic tokens to the latent state Whisper (ASR mannequin) makes use of proper earlier than producing textual content out of speech. That’s the reason these tokens are referred to as semantic — they’re alleged to be related to the text-related latents and textual content is related to semantics.

The query I acquired right here (and examined later): do these semantic tokens really signify the that means, phrases that ought to be voiced, whereas the acoustic represents the voice itself?

Copy of some technical notes I took for my conspectus after studying the Voxtral paper (with some cross-references)

Voxtral-TTS (TTS from Mistral) — https://arxiv.org/pdf/2603.25551. Much like Qwen-3 TTS (12Hz variant) to some extent. Mistral educated their very own codec (encoder-decoder structure). On this half the encoder produces a 292-dimension latent state vector that’s cut up into 256 for VQ and 36 finite scalar quantization (FSQ). Compared to Mimi and Qwen-3, Mistral is utilizing FSQ as an alternative of RVQ and concatenate as an alternative of summing the vectors within the decoder spine. One other distinction — VQ vectors are linearly projected to match Whisper decoder (earlier than self-supervised WavLM was used) hidden states (semantics management).

These vectors are used within the decoder-only 3B mannequin to do voice cloning and text-to-speech. In the course of the coaching there are noise replacements and quantization avoidances in round 50% of instances to make the mannequin extra strong. For every of the states there’s an embedding, they’re summed and used because the enter to the transformer. This transformer autoregressively predicts hidden states for audio tokens and (finish of audio) tokens. These hidden states are used with linear head to foretell VQ-based semantic token and to situation a 3-layer move matching (remark: diffusion with an improved goal) mannequin that generates steady acoustic tokens from noise (they’re later quantized with FSQ) — (classifier-free steering) CFG will be utilized. Audio is splitted to quick not-interleaving components. In the course of the coaching each semantic cross entropy loss and move matching for a single sampled timestamp are computed (loss is reweighted — for instance decrease weight of the loss for silences). Publish-training DPO helps to additional enhance the outcomes.

Finite scalar quantization FSQ — https://arxiv.org/pdf/2309.15505. A easy strategy to enhance RVQ and VQ (VQ is simply vector quantization, whereas RVQ is residual vector quantization). RVQ is predicated on adjusting every of the VQ codebooks with smaller codebooks, so to reconstruct we have to summarize a set of codebook embeddings. FSQ is straightforward (and produces unbiased codes in comparison with RVQ): we use (scaled) tanh activation for every dimension of finite values, around the outcomes to the closest integer and use it as a code — it helps higher make the most of codebooks and prepare in an easier method (with out trainable codebooks embeddings).

Do semantic tokens actually signify semantics?

Some assessments I used to be about to conduct:

Do semantic tokens signify the phrases to be voiced? In the event that they actually signify phrases, then we are able to manipulate these tokens to alter the that means of the speech preserving the identical voice. A extra reasonable consequence can be simply damaged phrases, or no sense within the generated speech, however the voice stays the identical.
How strong is Voxtral Codec’s decoder to noise in codes? If we randomly substitute among the codes and the audio stays related, which means that there’s a approach to approximate the audio via some gradient-based code worth choice. In any other case, if a small change in codes destroys the audio — it’s practically not possible to reconstruct the codes from the audio with out the precise encoder.

We all know that we should not have weights for the encoder a part of the Voxtral Codec, however we’ve a decoder, we’ve autoregressive spine weights, and we’ve some voices Mistral offered as references (we are able to generate speech from textual content utilizing these voices).

The pipeline:

Having the reference voices embeddings we are able to apply the Coordinate Descent algorithm to extract the codes for the voice. Right here I’ve a script that’s doing that.
Decoder weights and codes should not sufficient for the Voxtral Codec’s decoder to work; we additionally want the structure applied in a type of code. The vllm-omni has Voxtral implementation and it’s underneath a permissive Apache 2.0 license. I used Claude Code to extract the Voxtral Codec structure code from the vllm-omni. The standalone Voxtral Codec structure extracted from the vllm-omni can be in my GitHub repository.
I ready a Jupyter Notebook that takes voice embeddings (offered by Mistral), reconstructs its codes, optionally destroys among the codes (semantic and a few acoustic) and reconstructs the audio from it utilizing the Voxtral Codec’s decoder with the actual weights.

Right here you possibly can obtain and take heed to audio reconstructed from the embeddings — audio

Right here is the audio reconstructed from the identical embeddings, however with semantic and a few acoustic tokens randomization (following the script) — audio

From these audio recordsdata, it’s clear that semantic tokens, regardless of their title, do probably not signify the precise phrases or meanings to be voiced. And what’s extra necessary — the decoder is powerful to some adjustments within the codes, which suggests we are able to attempt to apply gradient descent to immediately prepare codes for the precise audio.

A gradient descent strategy to reconstruct codes when the encoder is lacking

By coaching the code itself I imply we initialize a single layer in a type of nn.Parameter(torch.tensor(num_frames, num_codes_per_frame)), the place for every body we’ve 37 codes, and prepare it on actual audio reconstruction loss. That might work if we had been working in steady house and the goal object to reconstruct was some easy sign, not a high-frequency audio waveform.

Issues on account of discrete tokens

Every token is discrete, just like tokens within the LLM; if there are two discrete tokens A and B, we can not progressively optimize a transition from A to B. In Voxtral we’ve separate semantic and acoustic tokens and the acoustic ones are simpler to mannequin due to the finite-scalar quantization (FSQ) they use — they’re obtained from steady house via a rounding operation. However nonetheless each semantic and acoustic tokens require the straight-through estimator (STE) on the ahead step and differentiable transition on the backward.

For acoustic tokens we might merely apply STE immediately:

...
# educated acoustic codes for the chosen audio
# initialization
self.acoustic_values = nn.Parameter(
            torch.randn(num_frames, 36)
        )
...

# variety of ranges for a single acoustic token
# in accordance with the paper and vllm-omni implementation
acoustic_levels = 22

acoustic_normalized = torch.tanh(self.acoustic_values)

# Quantize to discrete ranges
acoustic_scaled = ((acoustic_normalized + 1) / 2) * (acoustic_levels - 1)
acoustic_quantized = acoustic_scaled.spherical()
        
# STE: ahead makes use of quantized, backward makes use of steady
acoustic_codes = acoustic_scaled + (acoustic_quantized - acoustic_scaled).detach()

For semantic tokens it’s extra difficult. We can not apply tanh activation, as a result of every semantic codec is related to the multi dimensional embedding, so we are literally coaching “which embedding to pick out out of 8192 choices”, not “which worth/scalar to make use of”:

...
# educated semantic codes for the chosen audio
# initialization
semantic_vocab = 8192
self.semantic_logits = nn.Parameter(
            torch.randn(num_frames, semantic_vocab)
        )
...

# Smooth possibilities (for gradients)
probs = F.softmax(self.semantic_logits, dim=-1)  # [T, 8192]
        
# Arduous choice (for ahead go)
hard_codes = probs.argmax(dim=-1)  # [T] integer indices in vary 0-8191

# Get semantic embedding desk
# makes use of the precise codebooks, as a result of every of 8192 semantic codes is mapped
# to 256-dim embedding
sem_embedding = tokenizer.quantizer.semantic_codebook.embedding  # [8192, 256]
            
# Smooth embedding (weighted sum for gradients)
soft_emb = torch.matmul(probs, sem_embedding)  # [T, 256]
            
# Arduous embedding (discrete lookup for ahead)
hard_emb = F.embedding(hard_codes, sem_embedding)  # [T, 256]
            
# STE: ahead makes use of laborious, backward makes use of smooth
semantic_emb = soft_emb + (hard_emb - soft_emb).detach()  # [T, 256]

A full implementation of the coaching with illustrated and defined STE is here.

The problems come up due to the difficult sign to reconstruct

In each machine studying modeling job, the extra high-quality coaching sign you possibly can present, the higher coaching goes. That’s particularly necessary in modeling excessive frequency, excessive dimensional information. In these experiments utilizing L1 loss as reconstruction loss alone led to poor outcomes — the mannequin hit native optima and couldn’t converge additional with only a sign from high-frequency data-reconstruction loss sign. Within the audio processing area there’s a record of widespread strategies to supply further coaching indicators — Quick-Time Fourier Rework (STFT), Mel Spectrograms. Each these strategies rework the sign from the time area to time-frequency illustration via frequency bins and frequency filterbanks. Following the outline from the Mistral paper, I utilized an identical STFT as further loss to the reconstruction L1 loss I had: “…A multi-resolution discriminator with 8 STFT sizes (2296, 1418, 876, 542, 334, 206, 126, 76) is educated together with the codec…”

There are additionally voice-cloning-specific losses we might apply. There are fashions which can be in a position to create speaker embedding, that’s used for speaker identification and diarization. For instance. SpeechBrain’s models. These fashions can produce embeddings for the voice. If two embeddings are shut to one another, which means two embedded audios are extremely doubtless from the identical speaker, in any other case — totally different. We will apply speaker loss as an extra loss part that ought to pressure the mannequin to create codes that we are able to decode with Voxtral Codec’s decoder to get an identical voice to the goal one.

Outcomes of coaching

I educated the described mannequin for 5000 epochs (with a single pattern). It took me round an hour on a Mac machine with an M-series processor to recreate codes for 8 seconds of audio. The speaker diarization loss part made coaching slower, however led to higher remaining loss.

The audio I acquired reconstructed from the educated codes is here. You’ll be able to hear that it is vitally just like the goal audio first 8-second fragment.

On this coaching setup, we don’t do any analysis epochs, as a result of we’ve only one pattern and our job is to overfit the educated parameters to have the ability to reconstruct the ultimate audio. It could sound somewhat bit uncommon, however that’s precisely the case when overfitting is the target.

AI utilization disclaimer

I used to be utilizing LLM instruments to assist me on this analysis and to speed up experimentations. It was very useful. Nevertheless there have been a number of episodes once I needed to modify the LLM’s choices to make them work — round some ML and DL particulars.

Please use AI instruments responsibly!

Contacts

My LinkedIn if anyone needs to attach: Roman Smirnov

Source link

A Guide to Voice Cloning on Voxtral with a Missing Encoder

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Best Workout Headphones We Tested and Sweated In (2025)

Bipartisan lawmakers propose limits on prediction markets involving war, elections, and sports

Three investigates 999 calls not getting through during outage

A Guide to Voice Cloning on Voxtral with a Missing Encoder

Voxtral TTS overview

Do semantic tokens actually signify semantics?

A gradient descent strategy to reconstruct codes when the encoder is lacking

Issues on account of discrete tokens

The problems come up due to the difficult sign to reconstruct

Outcomes of coaching

AI utilization disclaimer

Contacts

Related Posts