The human voice is becoming an increasingly important way of interacting with devices, but current state of the art solutions are proprietary and strive for user lock-in. Mozilla’s DeepSpeech and Common Voice projects are there to change this.
In contrast to classic STT approaches, DeepSpeech features a modern end-to-end deep learning solution. Based on Baidu's Deep Speech research paper, it trains a model by machine learning techniques. This model directly translates raw audio data into text - without any domain specific code in between.
To train systems like DeepSpeech, an extremely large amount of voice data is required. Most of the data used by large companies isn’t available to the majority of people. That's why Mozilla launched Common Voice, a project to help make voice recognition open to everyone.
Introduction (10 min)
Talk:
- A short history of automatic speech recognition (ASR)
- What is the motivation behind project DeepSpeech?
- How does DeepSpeech work?
- How good is DeepSpeech compared to other solutions?
Using DeepSpeech (10 min)
Demo:
- How to voice-enable a project using DeepSpech
- How to translate audio data into text
- Looking into the demo code
About data (10 min)
Talk:
- What kind of data is required to train a model?
- Which corpora are we currently using to train our models?
- Common Voice: What is it and how to contribute?
- Unboxing first version of Common Voice corpus
Demo:
- How to import (corpus) data
- How to create a simple corpus from own samples
- How to augment samples with noise
Training a DeepSpeech model (10 min)
Talk:
- How does training work?
- What are train, dev and test data-sets used for?
- What are the required software components?
- What are the hardware requirements?
Demo:
- How to train a simple model
Roadmap (5 min)
Talk:
- What's our roadmap for 2018+?
Q&A (15 min)