Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K
    • US-sanctioned currency exchange says $15 million heist done by “unfriendly states”
    • This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»A Guide to Voice Cloning on Voxtral with a Missing Encoder
    Artificial Intelligence

    A Guide to Voice Cloning on Voxtral with a Missing Encoder

    Editor Times FeaturedBy Editor Times FeaturedApril 10, 2026No Comments14 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    has been not too long ago released by Mistral. It’s a highly effective textual content to speech mannequin that’s beating ElevenLabs v2.5 Flash, in accordance with Mistral’s assessments. Aside from state-of-the-art efficiency on text-to-speech duties (amongst fashions of comparable sizes), Mistral introduced voice cloning capabilities and revealed the weights of their mannequin. That attracted an enormous curiosity, as a result of a sufficiently small for native inference, top quality text-to-speech (TTS) mannequin with voice cloning capabilities, is one thing in demand throughout each enterprise and the neighborhood.

    The difficulty, nevertheless, is that Mistral eliminated the encoder weights of the audio autoencoder, so customers can not clone any voice, we are able to simply use the voices Mistral ready for us. That could be a big limitation in comparison with the paper and preliminary announcement.

    Right here I’m offering (1) an outline of the Voxtral TTS structure with some technical particulars and comparisons, (2) my analysis on the audio autoencoder and the way it really encodes audio, (3) a examine on how we are able to nonetheless get illustration for any audio to doubtlessly use voice cloning, despite the fact that the revealed weights are truncated in a part of the encoder.

    Why I’m positive Voxtral TTS is a know-how value understanding (quick private story)

    Years in the past, I used to be working at Skyeng the place we had been constructing our computerized speech recognition (ASR) system. It was in 2021, Whisper has not been launched by OpenAI but, and the ASR job was a sizzling matter, particularly for some unusual speech — our ASR was for not native audio system.

    At the moment I used to be studying many papers about audio understanding and encoding (primarily utilizing transformer encoders). Once I learn the Wav2Vec2 paper I acquired a powerful feeling that we must always undertake this know-how, perceive in tiny particulars, and use it, as a result of I used to be positive the approaches and the state of talked about applied sciences are basic sufficient to persist as the brand new classics within the audio understanding and sign processing area. I used to be proper. And with the Voxtral-TTS launch I acquired the identical feeling.

    Voxtral TTS overview

    Voxtral-4B-TTS is a 4-billion parameters mannequin that makes use of an autoregressive massive language mannequin (LLM-)primarily based 3B spine (Ministral 3B mannequin). Merely talking, the mannequin takes as enter some audio tokens that signify voice to clone and textual content tokens to be voiced. Much like LLMs, the mannequin autoregressively generates tokens, with the distinction that these tokens are voice tokens. Within the paper that describes the mannequin, there’s the next illustration for this half:

    Autoregressive audio generator (primarily based on LLM), supply: https://arxiv.org/pdf/2603.25551

    Necessary issues to notice: each the audio reference and the generated audio will be cut up to non-overlapping and unbiased tokens, every token represents 80 ms of audio; there’s a difficult head that consists of a linear head and a flow-matching transformer to create semantic and acoustic elements (additionally referred to as tokens) of a person voice token. These two particulars make this mannequin particular. Unbiased audio (voice) tokens make it able to native audio streaming. And this difficult head is a genius mixture of two present approaches to audio era — discrete tokens prediction (linear head) and sophisticated distribution approximation via diffusion course of utilizing diffusion fashions or flow-matching transformers strategy (the identical we often use for picture and video era).

    That appears conceptually elegant and delightful. However there’s one other part required — the audio autoencoder. An audio autoencoder is answerable for acquiring these acoustic and semantic tokens which can be basic to the mannequin. The structure overview of the 350M autoencoder from the unique paper:

    Audio autoencoder, supply: https://arxiv.org/pdf/2603.25551

    The audio autoencoder, Voxtral Codec, is the mannequin that produces 37 discrete tokens for every 80ms audio body — in the course of the structure inside the quantization block — and may reconstruct the audio again from these discrete tokens. The basic description of the autoencoder is encoder -> bottleneck -> decoder. Mistral hasn’t launched an encoder, nevertheless. Which means we nonetheless can generate audio after the autoregressive mannequin produces audio tokens making use of the decoder of the Voxtral Codec, however we are able to’t natively feed some audio to get audio tokens that we are able to use as voice situation within the autoregressive mannequin, as Mistral hasn’t launched the weights of the encoder and the decoder isn’t invertible (as most deep studying fashions).

    One other fascinating element: within the bottleneck there’s a mentioning of two sorts of tokens, semantic and acoustic. The implementation right here can be value understanding. The decoder produces 292-dim latent, that’s cut up right into a 256-dim vector and 36 single-dimension scalars. The 256-dim vector is mapped to codebooks embedding and is later represented by the code — closest codebook ID (that’s Vector Quantization occurring). Every of 36 single-dimension scalars are mapped to a spread from 0 to 21 utilizing scaled tanh activation. These parts are already a simplification over among the earlier approaches that used Residual Vector Quantization.

    Just a few phrases about semantic tokens. The embeddings of the semantic tokens are linearly projected to match logits of the Whisper decoder when it’s doing speech to textual content for a similar audio. In different phrases, there’s a constraint utilized to hyperlink the semantic tokens to the latent state Whisper (ASR mannequin) makes use of proper earlier than producing textual content out of speech. That’s the reason these tokens are referred to as semantic — they’re alleged to be related to the text-related latents and textual content is related to semantics.

    The query I acquired right here (and examined later): do these semantic tokens really signify the that means, phrases that ought to be voiced, whereas the acoustic represents the voice itself?

    Copy of some technical notes I took for my conspectus after studying the Voxtral paper (with some cross-references)

    Voxtral-TTS (TTS from Mistral) — https://arxiv.org/pdf/2603.25551. Much like Qwen-3 TTS (12Hz variant) to some extent. Mistral educated their very own codec (encoder-decoder structure). On this half the encoder produces a 292-dimension latent state vector that’s cut up into 256 for VQ and 36 finite scalar quantization (FSQ). Compared to Mimi and Qwen-3, Mistral is utilizing FSQ as an alternative of RVQ and concatenate as an alternative of summing the vectors within the decoder spine. One other distinction — VQ vectors are linearly projected to match Whisper decoder (earlier than self-supervised WavLM was used) hidden states (semantics management).

    These vectors are used within the decoder-only 3B mannequin to do voice cloning and text-to-speech. In the course of the coaching there are noise replacements and quantization avoidances in round 50% of instances to make the mannequin extra strong. For every of the states there’s an embedding, they’re summed and used because the enter to the transformer. This transformer autoregressively predicts hidden states for audio tokens and (finish of audio) tokens. These hidden states are used with linear head to foretell VQ-based semantic token and to situation a 3-layer move matching (remark: diffusion with an improved goal) mannequin that generates steady acoustic tokens from noise (they’re later quantized with FSQ) — (classifier-free steering) CFG will be utilized. Audio is splitted to quick not-interleaving components. In the course of the coaching each semantic cross entropy loss and move matching for a single sampled timestamp are computed (loss is reweighted — for instance decrease weight of the loss for silences). Publish-training DPO helps to additional enhance the outcomes.

    Finite scalar quantization FSQ — https://arxiv.org/pdf/2309.15505. A easy strategy to enhance RVQ and VQ (VQ is simply vector quantization, whereas RVQ is residual vector quantization). RVQ is predicated on adjusting every of the VQ codebooks with smaller codebooks, so to reconstruct we have to summarize a set of codebook embeddings. FSQ is straightforward (and produces unbiased codes in comparison with RVQ): we use (scaled) tanh activation for every dimension of finite values, around the outcomes to the closest integer and use it as a code — it helps higher make the most of codebooks and prepare in an easier method (with out trainable codebooks embeddings).

    Do semantic tokens actually signify semantics?

    Some assessments I used to be about to conduct:

    1. Do semantic tokens signify the phrases to be voiced? In the event that they actually signify phrases, then we are able to manipulate these tokens to alter the that means of the speech preserving the identical voice. A extra reasonable consequence can be simply damaged phrases, or no sense within the generated speech, however the voice stays the identical.
    2. How strong is Voxtral Codec’s decoder to noise in codes? If we randomly substitute among the codes and the audio stays related, which means that there’s a approach to approximate the audio via some gradient-based code worth choice. In any other case, if a small change in codes destroys the audio — it’s practically not possible to reconstruct the codes from the audio with out the precise encoder.

    We all know that we should not have weights for the encoder a part of the Voxtral Codec, however we’ve a decoder, we’ve autoregressive spine weights, and we’ve some voices Mistral offered as references (we are able to generate speech from textual content utilizing these voices).

    The pipeline:

    • Having the reference voices embeddings we are able to apply the Coordinate Descent algorithm to extract the codes for the voice. Right here I’ve a script that’s doing that.
    • Decoder weights and codes should not sufficient for the Voxtral Codec’s decoder to work; we additionally want the structure applied in a type of code. The vllm-omni has Voxtral implementation and it’s underneath a permissive Apache 2.0 license. I used Claude Code to extract the Voxtral Codec structure code from the vllm-omni. The standalone Voxtral Codec structure extracted from the vllm-omni can be in my GitHub repository.
    • I ready a Jupyter Notebook that takes voice embeddings (offered by Mistral), reconstructs its codes, optionally destroys among the codes (semantic and a few acoustic) and reconstructs the audio from it utilizing the Voxtral Codec’s decoder with the actual weights.

    Right here you possibly can obtain and take heed to audio reconstructed from the embeddings — audio

    Right here is the audio reconstructed from the identical embeddings, however with semantic and a few acoustic tokens randomization (following the script) — audio

    From these audio recordsdata, it’s clear that semantic tokens, regardless of their title, do probably not signify the precise phrases or meanings to be voiced. And what’s extra necessary — the decoder is powerful to some adjustments within the codes, which suggests we are able to attempt to apply gradient descent to immediately prepare codes for the precise audio.

    A gradient descent strategy to reconstruct codes when the encoder is lacking

    By coaching the code itself I imply we initialize a single layer in a type of nn.Parameter(torch.tensor(num_frames, num_codes_per_frame)), the place for every body we’ve 37 codes, and prepare it on actual audio reconstruction loss. That might work if we had been working in steady house and the goal object to reconstruct was some easy sign, not a high-frequency audio waveform.

    Issues on account of discrete tokens

    Every token is discrete, just like tokens within the LLM; if there are two discrete tokens A and B, we can not progressively optimize a transition from A to B. In Voxtral we’ve separate semantic and acoustic tokens and the acoustic ones are simpler to mannequin due to the finite-scalar quantization (FSQ) they use — they’re obtained from steady house via a rounding operation. However nonetheless each semantic and acoustic tokens require the straight-through estimator (STE) on the ahead step and differentiable transition on the backward.

    For acoustic tokens we might merely apply STE immediately:

    ...
    # educated acoustic codes for the chosen audio
    # initialization
    self.acoustic_values = nn.Parameter(
                torch.randn(num_frames, 36)
            )
    ...
    
    # variety of ranges for a single acoustic token
    # in accordance with the paper and vllm-omni implementation
    acoustic_levels = 22
    
    acoustic_normalized = torch.tanh(self.acoustic_values)
    
    # Quantize to discrete ranges
    acoustic_scaled = ((acoustic_normalized + 1) / 2) * (acoustic_levels - 1)
    acoustic_quantized = acoustic_scaled.spherical()
            
    # STE: ahead makes use of quantized, backward makes use of steady
    acoustic_codes = acoustic_scaled + (acoustic_quantized - acoustic_scaled).detach() 

    For semantic tokens it’s extra difficult. We can not apply tanh activation, as a result of every semantic codec is related to the multi dimensional embedding, so we are literally coaching “which embedding to pick out out of 8192 choices”, not “which worth/scalar to make use of”:

    ...
    # educated semantic codes for the chosen audio
    # initialization
    semantic_vocab = 8192
    self.semantic_logits = nn.Parameter(
                torch.randn(num_frames, semantic_vocab)
            )
    ...
    
    # Smooth possibilities (for gradients)
    probs = F.softmax(self.semantic_logits, dim=-1)  # [T, 8192]
            
    # Arduous choice (for ahead go)
    hard_codes = probs.argmax(dim=-1)  # [T] integer indices in vary 0-8191
    
    # Get semantic embedding desk
    # makes use of the precise codebooks, as a result of every of 8192 semantic codes is mapped
    # to 256-dim embedding
    sem_embedding = tokenizer.quantizer.semantic_codebook.embedding  # [8192, 256]
                
    # Smooth embedding (weighted sum for gradients)
    soft_emb = torch.matmul(probs, sem_embedding)  # [T, 256]
                
    # Arduous embedding (discrete lookup for ahead)
    hard_emb = F.embedding(hard_codes, sem_embedding)  # [T, 256]
                
    # STE: ahead makes use of laborious, backward makes use of smooth
    semantic_emb = soft_emb + (hard_emb - soft_emb).detach()  # [T, 256]

    A full implementation of the coaching with illustrated and defined STE is here.

    The problems come up due to the difficult sign to reconstruct

    In each machine studying modeling job, the extra high-quality coaching sign you possibly can present, the higher coaching goes. That’s particularly necessary in modeling excessive frequency, excessive dimensional information. In these experiments utilizing L1 loss as reconstruction loss alone led to poor outcomes — the mannequin hit native optima and couldn’t converge additional with only a sign from high-frequency data-reconstruction loss sign. Within the audio processing area there’s a record of widespread strategies to supply further coaching indicators — Quick-Time Fourier Rework (STFT), Mel Spectrograms. Each these strategies rework the sign from the time area to time-frequency illustration via frequency bins and frequency filterbanks. Following the outline from the Mistral paper, I utilized an identical STFT as further loss to the reconstruction L1 loss I had: “…A multi-resolution discriminator with 8 STFT sizes (2296, 1418, 876, 542, 334, 206, 126, 76) is educated together with the codec…”

    There are additionally voice-cloning-specific losses we might apply. There are fashions which can be in a position to create speaker embedding, that’s used for speaker identification and diarization. For instance. SpeechBrain’s models. These fashions can produce embeddings for the voice. If two embeddings are shut to one another, which means two embedded audios are extremely doubtless from the identical speaker, in any other case — totally different. We will apply speaker loss as an extra loss part that ought to pressure the mannequin to create codes that we are able to decode with Voxtral Codec’s decoder to get an identical voice to the goal one.

    Outcomes of coaching

    I educated the described mannequin for 5000 epochs (with a single pattern). It took me round an hour on a Mac machine with an M-series processor to recreate codes for 8 seconds of audio. The speaker diarization loss part made coaching slower, however led to higher remaining loss.

    The audio I acquired reconstructed from the educated codes is here. You’ll be able to hear that it is vitally just like the goal audio first 8-second fragment.

    On this coaching setup, we don’t do any analysis epochs, as a result of we’ve only one pattern and our job is to overfit the educated parameters to have the ability to reconstruct the ultimate audio. It could sound somewhat bit uncommon, however that’s precisely the case when overfitting is the target.

    AI utilization disclaimer

    I used to be utilizing LLM instruments to assist me on this analysis and to speed up experimentations. It was very useful. Nevertheless there have been a number of episodes once I needed to modify the LLM’s choices to make them work — round some ML and DL particulars.

    Please use AI instruments responsibly!

    Contacts

    My LinkedIn if anyone needs to attach: Roman Smirnov



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K

    April 18, 2026

    US-sanctioned currency exchange says $15 million heist done by “unfriendly states”

    April 18, 2026

    This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20

    April 18, 2026

    Portable water filter provides safe drinking water from any source

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Tiberius Aerospace unveils long-range supersonic artillery round

    May 22, 2025

    Immune therapy for canine bone cancer fast-tracks human treatment

    May 26, 2025

    When “no” means “yes”: Why AI chatbots can’t process Persian social etiquette

    October 5, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.