Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    • Swedish semiconductor startup AlixLabs closes €15 million Series A to scale atomic-level etching technology
    • Republican Mutiny Sinks Trump’s Push to Extend Warrantless Surveillance
    • Yocha Dehe slams Vallejo Council over rushed casino deal approval process
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Sesame  Speech Model:  How This Viral AI Model Generates Human-Like Speech
    Artificial Intelligence

    Sesame  Speech Model:  How This Viral AI Model Generates Human-Like Speech

    Editor Times FeaturedBy Editor Times FeaturedApril 19, 2025No Comments10 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    revealed a demo of their newest Speech-to-Speech mannequin. A conversational AI agent who’s actually good at talking, they supply related solutions, they communicate with expressions, and actually, they’re simply very enjoyable and interactive to play with.

    Be aware {that a} technical paper isn’t out but, however they do have a short blog post that gives quite a lot of details about the methods they used and former algorithms they constructed upon. 

    Fortunately, they supplied sufficient data for me to jot down this text and make a YouTube video out of it. Learn on!

    Coaching a Conversational Speech Mannequin

    Sesame is a Conversational Speech Mannequin, or a CSM. It inputs each textual content and audio, and generates speech as audio. Whereas they haven’t revealed their coaching information sources within the articles, we will nonetheless attempt to take a stable guess. The weblog put up closely cites one other CSM, 2024’s Moshi, and luckily, the creators of Moshi did reveal their information sources of their paper. Moshi makes use of 7 million hours of unsupervised speech information, 170 hours of pure and scripted conversations (for multi-stream coaching), and 2000 extra hours of phone conversations (The Fischer Dataset).


    Sesame builds upon the Moshi Paper (2024)

    However what does it actually take to generate audio?

    In uncooked type, audio is only a lengthy sequence of amplitude values — a waveform. For instance, should you’re sampling audio at 24 kHz, you might be capturing 24,000 float values each second.

    There are 24000 values right here to symbolize 1 second of speech! (Picture generated by writer)

    After all, it’s fairly resource-intensive to course of 24000 float values for only one second of knowledge, particularly as a result of transformer computations scale quadratically with sequence size. It might be nice if we might compress this sign and cut back the variety of samples required to course of the audio.

    We are going to take a deep dive into the Mimi encoder and particularly Residual Vector Quantizers (RVQ), that are the spine of Audio/Speech modeling in Deep Learning immediately. We are going to finish the article by studying about how Sesame generates audio utilizing its particular dual-transformer structure.

    Preprocessing audio

    Compression and have extraction are the place convolution helps us. Sesame makes use of the Mimi speech encoder to course of audio. Mimi was launched within the aforementioned Moshi paper as effectively. Mimi is a self-supervised audio encoder-decoder mannequin that converts audio waveforms into discrete “latent” tokens first, after which reconstructs the unique sign. Sesame solely makes use of the encoder part of Mimi to tokenize the enter audio tokens. Let’s learn the way.

    Mimi inputs the uncooked speech waveform at 24Khz, passes them via a number of strided convolution layers to downsample the sign, with a stride issue of 4, 5, 6, 8, and a couple of. Which means that the primary CNN block downsamples the audio by 4x, then 5x, then 6x, and so forth. In the long run, it downsamples by an element of 1920, lowering it to only 12.5 frames per second.

    The convolution blocks additionally undertaking the unique float values to an embedding dimension of 512. Every embedding aggregates the native options of the unique 1D waveform. 1 second of audio is now represented as round 12 vectors of dimension 512. This fashion, Mimi reduces the sequence size from 24000 to only 12 and converts them into dense steady vectors.

    Earlier than making use of any quantization, the Mimi Encoder downsamples the enter 24KHz audio by 1920 occasions, and embeds it into 512 dimensions. In different phrases, you get 12.5 frames per second with every body as a 512-dimensional vector. (Image from author’s video)

    What’s Audio Quantization?

    Given the continual embeddings obtained after the convolution layer, we wish to tokenize the enter speech. If we will symbolize speech as a sequence of tokens, we will apply customary language studying transformers to coach generative fashions.

    Mimi makes use of a Residual Vector Quantizer or RVQ tokenizer to attain this. We are going to discuss concerning the residual half quickly, however first, let’s take a look at what a easy vanilla Vector quantizer does.

    Vector Quantization

    The thought behind Vector Quantization is straightforward: you prepare a codebook , which is a group of, say, 1000 random vector codes all of dimension 512 (similar as your embedding dimension).

    A Vanilla Vector Quantizer. A codebook of embeddings is educated. Given an enter embedding, we map/quantize it to the closest codebook entry. (Screenshot from author’s video)

    Then, given the enter vector, we’ll map it to the closest vector in our codebook — principally snapping some extent to its nearest cluster heart. This implies we’ve successfully created a hard and fast vocabulary of tokens to symbolize every audio body, as a result of regardless of the enter body embedding could also be, we’ll symbolize it with the closest cluster centroid. If you wish to be taught extra about Vector Quantization, take a look at my video on this subject the place I am going a lot deeper with this.

    Extra about Vector Quantization! (Video by writer)

    Residual Vector Quantization

    The issue with easy vector quantization is that the lack of data could also be too excessive as a result of we’re mapping every vector to its cluster’s centroid. This “snap” is never good, so there may be all the time an error between the unique embedding and the closest codebook.

    The large concept of Residual Vector Quantization is that it doesn’t cease at having only one codebook. As a substitute, it tries to make use of a number of codebooks to symbolize the enter vector.

    1. First, you quantize the unique vector utilizing the primary codebook.
    2. Then, you subtract that centroid out of your unique vector. What you’re left with is the residual — the error that wasn’t captured within the first quantization.
    3. Now take this residual, and quantize it once more, utilizing a second codebook full of brand name new code vectors — once more by snapping it to the closest centroid.
    4. Subtract that too, and also you get a smaller residual. Quantize once more with a 3rd codebook… and you may maintain doing this for as many codebooks as you need.
    Residual Vector Quantizers (RVQ) hierarchically encode the enter embeddings by utilizing a brand new codebook and VQ layer to symbolize the earlier codebook’s error. (Illustration by the writer)

    Every step hierarchically captures somewhat extra element that was missed within the earlier spherical. For those who repeat this for, let’s say, N codebooks, you get a group of N discrete tokens from every stage of quantization to symbolize one audio body.

    The best factor about RVQs is that they’re designed to have a excessive inductive bias in the direction of capturing essentially the most important content material within the very first quantizer. Within the subsequent quantizers, they be taught increasingly more fine-grained options.

    For those who’re accustomed to PCA, you may consider the primary codebook as containing the first principal elements, capturing essentially the most essential data. The following codebooks symbolize higher-order elements, containing data that provides extra particulars.

    Residual Vector Quantizers (RVQ) makes use of a number of codebooks to encode the enter vector — one entry from every codebook. (Screenshot from author’s video)

    Acoustic vs Semantic Codebooks

    Since Mimi is educated on the duty of audio reconstruction, the encoder compresses the sign to the discretized latent area, and the decoder reconstructs it again from the latent area. When optimizing for this process, the RVQ codebooks be taught to seize the important acoustic content material of the enter audio contained in the compressed latent area. 

    Mimi additionally individually trains a single codebook (vanilla VQ) that solely focuses on embedding the semantic content material of the audio. For this reason Mimi is known as a split-RVQ tokenizer – it divides the quantization course of into two impartial parallel paths: one for semantic data and one other for acoustic data.

    The Mimi Structure (Supply: Moshi paper) License: Free

    To coach semantic representations, Mimi used information distillation with an current speech mannequin known as WavLM as a semantic instructor. Mainly, Mimi introduces a further loss operate that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.


    Audio Decoder

    Given a dialog containing textual content and audio, we first convert them right into a sequence of token embeddings utilizing the textual content and audio tokenizers. This token sequence is then enter right into a transformer mannequin as a time sequence. Within the weblog put up, this mannequin is known as the Autoregressive Spine Transformer. Its process is to course of this time sequence and output the “zeroth” codebook token.

    A lighterweight transformer known as the audio decoder then reconstructs the subsequent codebook tokens conditioned on this zeroth code generated by the spine transformer. Be aware that the zeroth code already incorporates quite a lot of details about the historical past of the dialog for the reason that spine transformer has visibility of your complete previous sequence. The light-weight audio decoder solely operates on the zeroth token and generates the opposite N-1 codes. These codes are generated by utilizing N-1 distinct linear layers that output the chance of selecting every code from their corresponding codebooks. 

    You possibly can think about this course of as predicting a textual content token from the vocabulary in a text-only LLM. Simply {that a} text-based LLM has a single vocabulary, however the RVQ-tokenizer has a number of vocabularies within the type of the N codebooks, so you could prepare a separate linear layer to mannequin the codes for every.

    The Sesame Structure (Illustration by the writer)

    Lastly, after the codewords are all generated, we mixture them to type the mixed steady audio embedding. The ultimate job is to transform this audio again to a waveform. For this, we apply transposed convolutional layers to upscale the embedding again from 12.5 Hz again to KHz waveform audio. Mainly, reversing the transforms we had utilized initially throughout audio preprocessing.

    In Abstract

    Take a look at the accompanying video on this text! (Video by writer)

    So, right here is the general abstract of the Sesame mannequin in some bullet factors.

    1.  Sesame is constructed on a multimodal Dialog Speech Mannequin or a CSM.
    2. Textual content and audio are tokenized collectively to type a sequence of tokens and enter into the spine transformer that autoregressively processes the sequence.
    3. Whereas the textual content is processed like some other text-based LLM, the audio is processed instantly from its waveform illustration. They use the Mimi encoder to transform the waveform into latent codes utilizing a cut up RVQ tokenizer.
    4. The multimodal spine transformers eat a sequence of tokens and predict the subsequent zeroth codeword.
    5.  One other light-weight transformer known as the Audio Decoder predicts the subsequent codewords from the zeroth codeword.
    6. The ultimate audio body illustration is generated from combining all of the generated codewords and upsampled again to the waveform illustration.

    Thanks for studying!

    References and Should-read papers

    Check out my ML YouTube Channel

    Sesame Blogpost and Demo

    Related papers: 
    Moshi: https://arxiv.org/abs/2410.00037 
    SoundStream: https://arxiv.org/abs/2107.03312 
    HuBert: https://arxiv.org/abs/2106.07447 
    Speech Tokenizer: https://arxiv.org/abs/2308.16692




    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026

    Comments are closed.

    Editors Picks

    Portable water filter provides safe drinking water from any source

    April 18, 2026

    MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged

    April 18, 2026

    NCAA seeks faster trial over DraftKings disputed March Madness branding case

    April 18, 2026

    AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Vivo’s X300 Ultra Aims to Catapult You Into Hollywood Stardom

    March 3, 2026

    Gambling is reshaping video games, time, money and attention in 2026, report says

    February 21, 2026

    I Cleaned a Messy CSV File Using Pandas .  Here’s the Exact Process I Follow Every Time.

    November 26, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.