The audio (& speech) domain is going through a massive shift in terms of end-user performances. It is at the same tipping point as NLP was in 2017 before the Transformers revolution took over. We’ve gone from needing a copious amount of data to create Spoken Language Understanding systems to just needing a 10-minute snippet.
This tutorial will help you create strong code-first & scientific foundations in dealing with Audio data and build real-world applications like Automatic Speech Recognition (ASR) Audio Classification, and Speaker Verification using backbone models like Wav2Vec2.0, HuBERT, etc.
Unlike general Machine Learning problems where we either classify i.e. segregate a data point into a pre-defined class or regress around a continuous variable, audio related problems can be slightly more complex. Wherein, we either go from an audio representation to a text representation (ASR) or separate different layers of audio (Diarization) and so on. This tutorial will not only help you build applications like these but also unpack the science behind them using a code-first approach.
Every step of the way we’ll first write and run some code and then take a step back and unpack it all till it makes sense. We’ll make science *fun* again :)
The tutorial will be divided into 3 key sections:
1. Read, Manipulate & Visualize Audio data
2. Build your very own ASR system (using pre-trained models like Wav2Vec2.0) & deploy it
3. Create an Audio Classification pipeline & infer the model for other downstream audio tasks
At the end of the tutorial, you’ll develop strong intuition about Audio data and learn how to leverage large pre-trained backbone models for downstream tasks. You’ll also learn how to create quick demos to test and share your models.
Libraries: HuggingFace, SpeechBrain, PyTorch & Librosa