Performing Audio Transcription
This guide will explain how to transcribe audio files using SpeechLine.
First, load in the your transcription model by passing its Hugging Face model checkpoint into Wav2Vec2Transcriber
.
from speechline.transcribers import Wav2Vec2Transcriber
transcriber = Wav2Vec2Transcriber("bookbot/wav2vec2-ljspeech-gruut")
Next, you will need to transform your input audio file (given by sample.wav
) into a Dataset
format like the following
from datasets import Dataset, Audio
dataset = Dataset.from_dict({"audio": ["sample.wav"]})
dataset = dataset.cast_column("audio", Audio(sampling_rate=transcriber.sampling_rate))
Once preprocessing is finished, simply pass the input data into the transcriber.
Transcribing Audios: 0%| | 0/1 [00:00<?, ?ex/s]
The output format of the transcription model is shown below. It is a list of dictionary containing the transcribed text
, start_time
and end_time
stamps of the corresponding phoneme token.
[[{'end_time': 0.02, 'start_time': 0.0, 'text': 'ɪ'},
{'end_time': 0.3, 'start_time': 0.26, 'text': 't'},
{'end_time': 0.36, 'start_time': 0.34, 'text': 'ɪ'},
{'end_time': 0.44, 'start_time': 0.42, 'text': 'z'},
{'end_time': 0.54, 'start_time': 0.5, 'text': 'n'},
{'end_time': 0.58, 'start_time': 0.54, 'text': 'oʊ'},
{'end_time': 0.62, 'start_time': 0.58, 'text': 't'},
{'end_time': 0.78, 'start_time': 0.76, 'text': 'ʌ'},
{'end_time': 0.94, 'start_time': 0.92, 'text': 'p'}]]
You can manually check the model output by playing a segment (using the start and end timestamps) of your input audio file.
First, load your audio file.
You can use the following function to play a segment of your audio from a given offset
def play_segment(offsets, index: int):
start = offsets[index]["start_time"]
end = offsets[index]["end_time"]
print(offsets[index]["text"])
return audio[start * 1000 : end * 1000]
Here are some examples of the phoneme segments
ɪ
t
ɪ
z
n
oʊ
t
ʌ
p