Build Speech to Text Model from Scratch

4 min readJun 1, 2023

What we need to train a model that converts speech into text from collecting data to inference.

Training a Speech-to-Text (STT) model involves several steps. STT is a an NLP task that converts spoken language into written text. This is frequently used for voice recognition systems, transcription services, and various applications in linguistics.

Training STT model will involve deep learning and specifically, recurrent neural networks (RNNs) like Long Short-Term Memory (LSTM) units or transformers. However, those State-Of-The-Art models change rapidly, so I am going to present a high level steps of what usually is done in order to build a working model.

In future posts, I will present a working demo from a selected pre-trained model.

1. Gather and Preprocess Your Data

First, you’ll need a large dataset of audio files along with their transcriptions (text).

You can either create this dataset yourself (which can be really expensive, usually you will need around 60 hours of recording to train a good model 💁)or you can just use one of the many publicly available datasets such as:

- LibriSpeech
- Mozilla’s Common Voice
- VoxForge
- TED-LIUM

Build Speech to Text Model from Scratch

1. Gather and Preprocess Your Data

Written by Zahra Ahmad