Performing Audio Transcription
This guide will explain how to transcribe audio files using SpeechLine.
First, load in the your transcription model by passing its Hugging Face model checkpoint into Wav2Vec2Transcriber.
from speechline.transcribers import Wav2Vec2Transcriber
transcriber = Wav2Vec2Transcriber("bookbot/wav2vec2-ljspeech-gruut")
Next, you will need to transform your input audio file (given by sample.wav) into a Dataset format like the following
from datasets import Dataset, Audio
dataset = Dataset.from_dict({"audio": ["sample.wav"]})
dataset = dataset.cast_column("audio", Audio(sampling_rate=transcriber.sampling_rate))
Once preprocessing is finished, simply pass the input data into the transcriber.
Transcribing Audios: 0%| | 0/1 [00:00<?, ?ex/s]
The output format of the transcription model is shown below. It is a list of dictionary containing the transcribed text, start_time and end_time stamps of the corresponding phoneme token.
[[{'end_time': 0.02, 'start_time': 0.0, 'text': 'ɪ'},
{'end_time': 0.3, 'start_time': 0.26, 'text': 't'},
{'end_time': 0.36, 'start_time': 0.34, 'text': 'ɪ'},
{'end_time': 0.44, 'start_time': 0.42, 'text': 'z'},
{'end_time': 0.54, 'start_time': 0.5, 'text': 'n'},
{'end_time': 0.58, 'start_time': 0.54, 'text': 'oʊ'},
{'end_time': 0.62, 'start_time': 0.58, 'text': 't'},
{'end_time': 0.78, 'start_time': 0.76, 'text': 'ʌ'},
{'end_time': 0.94, 'start_time': 0.92, 'text': 'p'}]]
You can manually check the model output by playing a segment (using the start and end timestamps) of your input audio file.
First, load your audio file.
You can use the following function to play a segment of your audio from a given offset
def play_segment(offsets, index: int):
start = offsets[index]["start_time"]
end = offsets[index]["end_time"]
print(offsets[index]["text"])
return audio[start * 1000 : end * 1000]
Here are some examples of the phoneme segments
ɪ
t
ɪ
z
n
oʊ
t
ʌ
p