Paper Insights: Masked Autoencoders that listen | by Shanmuka Sadhu

I’m at present working as a Machine Studying researcher on the College of Iowa, and the precise modality I’m working with is the audio modality. Since I’m simply beginning the undertaking, I’ve been studying present state-of-the-art papers and different related papers to grasp the panorama. This paper certainly has to do with audio and is an extension of the Masked Autoencoders by Fb Analysis.

Audio-MAE first masks the audio spectrogram with a excessive masking ratio after which locations solely the non-masked tokens into the ViT encoder. Then, a decoder composed of normal transformer blocks will re-order and enter masked tokens to attempt to reconstruct the enter spectrogram.

Invention of the Transformer and Self-Consideration.
Masked autoencoders with BERT.
Imaginative and prescient Transformer for laptop imaginative and prescient duties.
For self-supervised studying, MAE was used.
Transformer-based structure for audio duties: AST and MBT. Frequent strategies when working with audio are:

Deflating patch embedding.
Interpolating positional embedding.

Problem:

Massive-scale coaching of a transformer could be very costly because it has quadratic time complexity. Options within the subject previous to AudioMAE:

Swin-Transformer: Native consideration inside home windows that shift throughout layers.
MViT: Employs pooling to assemble a hierarchy of Transformers.
MAE: Encodes solely 25% of visible patches.

Rework audio recordings to Mel-Spectrograms, dividing them into non-overlapping common grid patches. Patches are flattened and embedded by a linear projection. Then, sinusoidal positional embeddings are added.

Two types of masking are offered within the paper:

Unstructured Masking: Random masking inside any prior.
Structured Masking: Time, Frequency, Time + Frequency.

By means of ablation research, the authors concluded that giant masking charges(80% for spectrogram patches and 75% in MAE photos) is efficient for self-supervised studying of audio. Larger unstructured for pre-training and decrease structured masks ratios for fine-tuning lead to the perfect accuracy.

Encoder: 12-layer ViT-B given 20% non-masked patches, which matches the masking ratio talked about above.

Decoder: Trainable masked tokens are added to the encoded patches. The time-frequency is restored. Then positional embeddings are added and the restored sequence is positioned into the decoder.

For picture, it’s generally understood that world self-attention is used for photos since they’re invariant underneath transformations and scaling. However world consideration isn’t the perfect for audio since time and frequency are native. Spectrograms are extra just like NLP with regards to the relevancy of order and place. Thus, patches use native consideration mechanisms: separate patches into native home windows.

Shifted window location: brief window consideration by 50% between decoder layers.
Hybrid Window consideration: Computes native consideration inside all home windows however the previous couple of prime layers.

The target is the mean-squared error between the prediction and the enter spectrogram. Throughout fine-tuning, solely use the encoder and nonetheless use masking to cut back computation. Provides a linear layer on prime for fine-tuning.

Experiments had been carried out on:

AudioSet(audio classification)
ESC-50(Environmental Classification)
Speech Instructions(SPC-1 and SPC-2) and (VoxCeleb).

Comparision with different SOTA fashions

Implementation:

Encoder: 12-layer ViT-B
Decoder: 16-layer transformer with shared native consideration.
Masking ratio of 0.8 for pre-training. Ratios of 0.3 in time and 0.3 frequency for fine-tuning.

Ablation was on masking methods, patch measurement, stride, encoder and decoder:

Source link

LLMs Finally Learn to Say “I Don’t Know” — And It’s a Game-Changer | by ArXiv In-depth Analysis | Jun, 2025

Unsupervised Learning: A Simple Revision Guide | by Samriddhi Saxena | Jun, 2025

Statistical Inference: Your Friendly Guide to Making Sense of Data | by Timothy Kimutai | Jun, 2025

Why Skills Alone Aren’t Enough to Build a Strong Team

Scale Your Small Business Without Draining Your Resources

Distillation: Size Matters in AI. Artificial Intelligence models are… | by Shunya Vichaar | Mar, 2025

3 Questions: Visualizing research in the age of AI | MIT News

Goldman Sachs to Managers: Move to Dallas, Salt Lake City

Most Popular

You’re Not Too Small for a CRM — Here’s Why It Matters

How to Make Money Without a Job

Learnings from a Machine Learning Engineer — Part 6: The Human Side

Our Picks

Instant, Explainable Data Insights with Agentic AI

Painted by a Prompt: This Looks Amazing… But Who Made It? | by Sahir Maharaj | Jun, 2025

Why Day Trading is No Longer Under the Radar — B

Paper Insights: Masked Autoencoders that listen | by Shanmuka Sadhu | Jun, 2025

Related Posts