Eerily realistic AI voice demo sparks amazement and discomfort online

An instance argument with Sesame’s CSM created by Gavin Purcell.

Gavin Purcell, co-host of the AI for Humans podcast, posted an example video on Reddit the place the human pretends to be an embezzler and argues with a boss. It is so dynamic that it is tough to inform who the human is and which one is the AI mannequin. Judging by our personal demo, it is completely able to what you see within the video.

“Close to-human high quality”

Beneath the hood, Sesame’s CSM achieves its realism by utilizing two AI fashions working collectively (a spine and a decoder) based mostly on Meta’s Llama structure that processes interleaved textual content and audio. Sesame skilled three AI mannequin sizes, with the biggest utilizing 8.3 billion parameters (an 8 billion spine mannequin plus a 300 million parameter decoder) on roughly 1 million hours of primarily English audio.

Sesame’s CSM does not comply with the standard two-stage method utilized by many earlier text-to-speech programs. As an alternative of producing semantic tokens (high-level speech representations) and acoustic particulars (fine-grained audio options) in two separate phases, Sesame’s CSM integrates right into a single-stage, multimodal transformer-based mannequin, collectively processing interleaved textual content and audio tokens to provide speech. OpenAI’s voice mannequin makes use of the same multimodal method.

In blind assessments with out conversational context, human evaluators confirmed no clear desire between CSM-generated speech and actual human recordings, suggesting the mannequin achieves near-human high quality for remoted speech samples. Nevertheless, when supplied with conversational context, evaluators nonetheless persistently most popular actual human speech, indicating a niche stays in totally contextual speech technology.

Sesame co-founder Brendan Iribe acknowledged present limitations in a touch upon Hacker Information, noting that the system is “nonetheless too keen and sometimes inappropriate in its tone, prosody and pacing” and has points with interruptions, timing, and dialog movement. “In the present day, we’re firmly within the valley, however we’re optimistic we are able to climb out,” he wrote.

Source link

Eerily realistic AI voice demo sparks amazement and discomfort online

Kalshi lawsuits dominate prediction market news today

Catawba Tribe Plans Two More North Carolina Casinos

Polymarket scrutiny, Schwab entry – latest prediction market news

Honolulu gambling raid in Waimakua Place nets machines

New Mexico lawsuit targets Kalshi sports contracts

Rhode Island Senate approves sports betting market expansion

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Rugged Phone with Hot-Swap Battery and Thermal Camera

Sources describe in detail the failed talks between Anthropic and DOD, and how officials at agencies, including the CIA, still hope for a peace agreement (New York Times)

Nomad Goods Promo Codes: 25% Off

Eerily realistic AI voice demo sparks amazement and discomfort online

“Close to-human high quality”

Related Posts