Skip to content

Performing Audio Transcription

This guide will explain how to transcribe audio files using SpeechLine.

First, load in the your transcription model by passing its Hugging Face model checkpoint into Wav2Vec2Transcriber.

from speechline.transcribers import Wav2Vec2Transcriber

transcriber = Wav2Vec2Transcriber("bookbot/wav2vec2-ljspeech-gruut")

Next, you will need to transform your input audio file (given by sample.wav) into a Dataset format like the following

from datasets import Dataset, Audio

dataset = Dataset.from_dict({"audio": ["sample.wav"]})
dataset = dataset.cast_column("audio", Audio(sampling_rate=transcriber.sampling_rate))

Once preprocessing is finished, simply pass the input data into the transcriber.

phoneme_offsets = transcriber.predict(dataset, output_offsets=True, return_timestamps="char")
Transcribing Audios:   0%|          | 0/1 [00:00<?, ?ex/s]

The output format of the transcription model is shown below. It is a list of dictionary containing the transcribed text, start_time and end_time stamps of the corresponding phoneme token.

phoneme_offsets
[[{'end_time': 0.02, 'start_time': 0.0, 'text': 'ɪ'},
  {'end_time': 0.3, 'start_time': 0.26, 'text': 't'},
  {'end_time': 0.36, 'start_time': 0.34, 'text': 'ɪ'},
  {'end_time': 0.44, 'start_time': 0.42, 'text': 'z'},
  {'end_time': 0.54, 'start_time': 0.5, 'text': 'n'},
  {'end_time': 0.58, 'start_time': 0.54, 'text': 'oʊ'},
  {'end_time': 0.62, 'start_time': 0.58, 'text': 't'},
  {'end_time': 0.78, 'start_time': 0.76, 'text': 'ʌ'},
  {'end_time': 0.94, 'start_time': 0.92, 'text': 'p'}]]

You can manually check the model output by playing a segment (using the start and end timestamps) of your input audio file.

First, load your audio file.

from pydub import AudioSegment

audio = AudioSegment.from_file("sample.wav")
audio

You can use the following function to play a segment of your audio from a given offset

def play_segment(offsets, index: int):
    start = offsets[index]["start_time"]
    end = offsets[index]["end_time"]
    print(offsets[index]["text"])
    return audio[start * 1000 : end * 1000]

Here are some examples of the phoneme segments

play_segment(phoneme_offsets[0], 0)
ɪ

play_segment(phoneme_offsets[0], 1)
t

play_segment(phoneme_offsets[0], 2)
ɪ

play_segment(phoneme_offsets[0], 3)
z

play_segment(phoneme_offsets[0], 4)
n

play_segment(phoneme_offsets[0], 5)

play_segment(phoneme_offsets[0], 6)
t

play_segment(phoneme_offsets[0], 7)
ʌ

play_segment(phoneme_offsets[0], 8)
p