audio embeddings for music advice?
Streaming platforms (Spotify, Apple Music, and so forth.) have to have the power to suggest new songs to their customers. The higher the suggestions, the higher the listening expertise.
There are lots of methods these platforms can construct their advice techniques. Fashionable techniques will mix totally different advice strategies collectively right into a hybrid construction.
Take into consideration once you first joined Spotify, you should have been requested what genres you want. Based mostly on the genres you choose, Spotify will suggest some songs. Suggestions primarily based on tune metadata like this are known as content-based filtering. Collaborative filtering can be used, which teams collectively prospects that behave equally, after which ideas are transferred between them.
(Diagram generated by the creator utilizing OpenAI’s picture technology instruments)
The 2 strategies above lean closely on person behaviour. One other methodology, which is more and more being utilized by giant streaming providers, is utilizing Deep Studying to symbolize songs in realized embedding areas. This permits songs to be represented in a excessive dimensional embedding area which captures rhythm, timbre, texture, and manufacturing type. Similarity between songs can then be computed simply, which scales higher than utilizing classical collaborative filtering approaches when contemplating a whole bunch of thousands and thousands of customers and tens of thousands and thousands of tracks.
Via the rise of LLMs, phrase and phrase embeddings have turn into mainstream and are comparatively effectively understood. However how does embedding work for songs and what downside are they fixing? The rest of this publish focuses on how audio turns into a mannequin enter, what architectural selections encode music options, how contrastive studying shapes the geometry of the embedding area and the way a tune recommender system utilizing an embedding may work in apply.
How does Audio turn into an enter right into a neural community?
Uncooked audio information like MP3 are essentially a waveform – a quickly various time collection. Studying from these information is feasible, however is usually data-hungry and computationally costly. We are able to convert .mp3 information into mel-spectrograms, that are far more suited as inputs to a neural community.
Mel-spectrograms are a approach of representing audio file’s frequency content material over time, tailored to how people understand sound. It’s a 2D illustration the place the x-axis corresponds to time, the y-axis corresponds to mel-scaled frequency bands, and every worth represents the log-scaled vitality in that band at the moment.

(Diagram generated by the creator utilizing OpenAI’s picture technology instruments)
The colors and shapes we see on a mel-spectrogram can inform us significant musical info. Brighter colors point out greater vitality at that frequency and time and darker colors point out decrease vitality. Skinny horizontal bands point out sustained pitches and sometimes correspond to sustained notes (vocals, strings, synth pads). Tall, vertical streaks point out vitality throughout many frequencies directly, concentrated in time. These can symbolize drum snares and claps.
Now we will begin to consider how convolutional neural networks can study to recognise options of those audio representations. At this level, the important thing problem turns into: how will we prepare a mannequin to recognise that two quick audio excerpts belong to the identical tune with out labels?
Chunking and Contrastive Studying
Earlier than we leap into the structure of the CNN that we’ve used, we’ll take a while to speak about how we load the spectrogram knowledge into the community, and the way we arrange the loss operate of the community with out labels.
At a really excessive degree, we feed the spectrograms into the CNN, plenty of matrix multiplication occurs inside, after which we’re left with a 128-dimensional vector which is a latent illustration of bodily options of that audio file. However how will we arrange the batching and loss for the community to have the ability to consider related songs.
Let’s begin with the batching. We’ve a dataset of songs (from the FMA small dataset) that we’ve transformed into spectrograms. We make use of the tensorflow.keras.utils.Sequence class to randomly choose 8 songs from the dataset. We then randomly “chunk” every spectrogram to pick a 128 x 129 rectangle which represents a small portion of every tune, as depicted under.

(Diagram generated by the creator utilizing OpenAI’s picture technology instruments)
Because of this each batch we feed into the community is of the form (8, 128, 129, 1) (batch dimension, mel frequency dimension, time chunk, channel dimension). By feeding chunks of songs as a substitute of entire songs, the mannequin will see totally different components of the identical songs throughout coaching epochs. This prevents the mannequin from overfitting to a particular second in every monitor. Utilizing quick samples from every tune encourages the community to study native musical texture (timbre, rhythmic density) relatively than long-range construction.
Subsequent, we make use of a contrastive studying goal. Contrastive loss was launched in 2005 by Chopra et al. to study an embedding area the place related pairs (optimistic pairs) have a low Euclidean distance, and dissimilar pairs (unfavourable pairs) are separated by a minimum of a sure margin. We’re utilizing the same idea by making use of InfoNCE loss.
We create two stochastic “views” of every batch. What this actually means is that we create two augmentations of the batch, every with random, usually distributed noise added. That is carried out merely, with the next operate:
@tf.operate
def increase(x):
"""Tiny time-frequency noise."""
noise = tf.random.regular(form=tf.form(x), imply=0.0, stddev=0.05)
return tf.clip_by_value(x + noise, -80.0, 0.0)
# mel dB vary often -80–0
Embeddings of the identical audio pattern needs to be extra related to one another than to embeddings of another pattern within the batch.
So for a batch of dimension 8, we compute the similarity of each embedding from the primary view and each embedding from the second view, leading to an 8×8 similarity matrix.
We outline the 2 L2-normalised augmented batches as [z_i, z_j in mathbb{R}^{N times d} ]
Every row (a 128-D embedding, in our case) of the 2 batches are L2-normalised, that’s,
[ Vert z_i^{(k)} Vert_2 = 1 ]
We are able to then compute the similarity of each embedding from the primary view and each embedding from the second view, leading to an NxN similarity matrix. This matrix is outlined as:
[ S = frac{1}{tau} z_i z_j^T ]
The place each ingredient of S is the similarity between the embedding of tune ok and embedding of tune l throughout each augmentations. This may be outlined element-wise as:
[
S_{kl} = frac{1}{tau} langle z_i^{(k)}, z_j^{(l)} rangle
= frac{1}{tau} cos(z_i^{(k)}, z_j^{(l)})
]
The place tau is a temperature parameter. Because of this the diagonal entries (the similarity between chunks from the identical tune) would be the optimistic pairs, and the off-diagonal entries are the unfavourable pairs.
Then for every row ok of the similarity matrix, we compute:
[
ell_k =log
frac{exp(S_{kk})}{sum_{l=1}^{N} exp(S_{kl})}
]
It is a softmax cross-entropy loss the place the numerator is similarity between the optimistic chunks, and the denominator is the sum of all of the similarities throughout the row.
Lastly we common the loss over the batch, giving us the complete loss goal:
[
L =
frac{1}{N}
sum_{k=1}^{N}
left( log
frac{
expleft(
frac{1}{tau}
langle z_i^{(k)}, z_j^{(k)} rangle
right)
}{
sum_{l=1}^{N}
expleft(
frac{1}{tau}
langle z_i^{(k)}, z_j^{(l)} rangle
right)
}
right)
]
Minimising the contrastive loss encourages the mannequin to assign the best similarity to matching augmented views whereas suppressing similarity to all different samples within the batch. This concurrently pulls representations of the identical audio nearer collectively and pushes representations of various audio additional aside, shaping a structured embedding area with out requiring express labels.
This loss operate is neatly described by the next python operate:
def contrastive_loss(z_i, z_j, temperature=0.1):
"""
Compute InfoNCE loss between two batches of embeddings.
z_i, z_j: (batch_size, embedding_dim)
"""
z_i = tf.math.l2_normalize(z_i, axis=1)
z_j = tf.math.l2_normalize(z_j, axis=1)
logits = tf.matmul(z_i, z_j, transpose_b=True) / temperature
labels = tf.vary(tf.form(logits)[0])
loss = tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
return tf.reduce_mean(loss)
Now we’ve constructed some instinct of how we load batches into the mannequin and the way minimising our loss operate clusters related sounds collectively, we will dive into the construction of the CNN.
A easy CNN structure
We’ve chosen a reasonably easy convolutional neural community structure for this activity. CNNs first originated with Yann LeCun and staff after they created LeNet for handwritten digit recognition. CNNs are nice at studying to know pictures, and we’ve transformed every tune into an image-like format that works with CNNs.
The primary convolution layer applies 32 small filters throughout the spectrogram. At this level, the community is generally studying very native patterns: issues like quick bursts of vitality, harmonic traces, or sudden adjustments that always correspond to notice onsets or percussion. Batch normalization retains the activations well-behaved throughout coaching, and max pooling reduces the decision barely so the mannequin doesn’t overreact to tiny shifts in time or frequency.
The second block will increase the variety of filters to 64 and begins combining these low-level patterns into extra significant buildings. Right here, the community begins to choose up on broader textures, repeating rhythmic patterns, and constant timbral options. Pooling once more compresses the illustration whereas conserving an important activations.
By the third convolution layer, the mannequin is working with 128 channels. These function maps are inclined to mirror higher-level features of the sound, resembling general spectral stability or instrument-like textures. At this stage, the precise place of a function issues lower than whether or not it seems in any respect.

(Diagram generated by the creator utilizing OpenAI’s picture technology instruments)
World common pooling removes the remaining time–frequency construction by averaging every function map all the way down to a single worth. This forces the community to summarize what patterns are current within the chunk, relatively than the place they happen, and produces a fixed-size vector no matter enter size.
A dense layer then maps this abstract right into a 128-dimensional embedding. That is the area the place similarity is realized: chunks that sound alike ought to find yourself shut collectively, whereas dissimilar sounds are pushed aside.
Lastly, the embedding is L2-normalized so that every one vectors lie on the unit sphere. This makes cosine similarity simple to compute and retains distances within the embedding area constant throughout contrastive coaching.
At a excessive degree, this mannequin learns about music in a lot the identical approach {that a} convolutional neural community learns about pictures. As an alternative of pixels organized by top and width, the enter here’s a mel-spectrogram organized by frequency and time.
How do we all know the mannequin is any good?
All the things we’ve talked about up to now has been fairly summary. How will we truly know that the mel-spectrogram representations, the mannequin structure and the contrastive studying have carried out a good job at creating significant embeddings?
One frequent approach of understanding the embedding area we’ve created is to visualise the area in a lower-dimensional one, one which people can truly visualise. This method is named dimensionality discount, and is beneficial when attempting to know excessive dimensionality knowledge.

(Picture by creator)
Two methods we will use are PCA (Principal Element Evaluation) and t-SNE (t-distributed Stochastic Neighbor Embedding). PCA is a linear methodology that preserves international construction, making it helpful for understanding the general form and main instructions of variation in an embedding area. t-SNE is a non-linear methodology that prioritises native neighbourhood relationships, which makes it higher for revealing small clusters of comparable factors however much less dependable for decoding international distances. In consequence, PCA is healthier for assessing whether or not an embedding area is coherent general, whereas t-SNE is healthier for checking whether or not related gadgets are inclined to group collectively domestically.
As talked about above, I educated this CNN utilizing the FMA small dataset, which comprises style labels for every tune. Once we visualise the embedding area, we will group genres collectively which helps us make some statements in regards to the high quality of the embedding area.
The 2-dimensional projections give totally different however complementary views of the realized embedding area. Neither plot exhibits completely separated style clusters, which is predicted and really fascinating for a music similarity mannequin.
Within the PCA projection, genres are closely blended and kind a easy, steady form relatively than distinct teams. This implies that the embeddings seize gradual variations in musical traits resembling timbre and rhythm, relatively than memorising style labels. As a result of PCA preserves international construction, this means that the embedding area is coherent and organised in a significant approach.
The t-SNE projection focuses on native relationships. Right here, tracks from the identical style usually tend to seem close to one another, forming small, free clusters. On the identical time, there may be nonetheless vital overlap between genres, reflecting the truth that many songs share traits throughout style boundaries.

(Picture by creator)
General, these visualisations recommend that the embeddings work effectively for similarity-based duties. PCA exhibits that the area is globally well-structured, whereas t-SNE exhibits that domestically related songs are inclined to group collectively — each of that are vital properties for a music advice system. To additional consider the standard of the embeddings we might additionally take a look at recommendation-related analysis metrics, like NDCG and recall@ok.
Turning the undertaking right into a usable music advice ap
Lastly we’ll spend a while speaking about how we will truly flip this educated mannequin into one thing usable. As an instance how a CNN like this is perhaps utilized in apply, I’ve created a quite simple tune recommender net app. This app takes an uploaded MP3 file, computes its embedding and returns a listing of probably the most related tracks primarily based on cosine similarity. Relatively than treating the mannequin in isolation, I designed the pipeline end-to-end: audio preprocessing, spectrogram technology, embedding inference, similarity search, and outcome presentation. This mirrors how such a system could be utilized in apply, the place fashions should function reliably on unseen inputs relatively than curated datasets.
The embeddings from the FMA small dataset are precomputed and saved offline, permitting suggestions to be generated rapidly utilizing cosine similarity relatively than operating the mannequin repeatedly. Chunk-level embeddings are aggregated right into a single song-level illustration, making certain constant behaviour for tracks of various lengths.
The ultimate result’s a light-weight net utility that demonstrates how a realized illustration will be built-in into an actual advice workflow.
It is a quite simple illustration of how embeddings might be utilized in an precise advice system, but it surely doesn’t seize the entire image. Fashionable advice techniques will mix each audio embeddings and collaborative filtering, as talked about at first of this text.
Audio embeddings seize what issues sound like and collaborative filtering captures who likes what. A mixture of the 2, together with further rating fashions can mix to create a hybrid system that balances acoustic similarity and private style.
Information sources and Photographs
This undertaking makes use of the FMA Small dataset, a publicly out there subset of the Free Music Archive (FMA) dataset launched by Defferrard et al. The dataset consists of quick music clips launched underneath Artistic Commons licenses and is extensively used for tutorial analysis in music info retrieval.
All schematic diagrams on this article have been generated by the creator utilizing AI-assisted picture technology instruments and are utilized in accordance with the instrument’s phrases, which enable industrial use. The photographs have been created from unique prompts and don’t reference copyrighted works, fictional characters, or actual people.

