Skip to content

Audio Labeling Pipeline

Home

Home

Label Pipeline

This repository hosts the necessary AWS Lambda scripts to facilitate an automated audio labeling pipeline. The main components of the pipeline includes:

Component	Description
Audio Transcription using AWS Transcribe	Transcribe incoming audios stored in S3 using AWS Transcribe. After transcribing, align audios based on ground truth values and save annotations.
Audio Splitting	Based on audio alignment transcriptions, segment audios and split into different files before saving back to S3.
Audio Adult/Child Classifier	Classify incoming audios stored in S3 as either adult, or child audios.
Integration with AirTable Dashboards	Export AirTable audio annotations (transcript and labels) to S3 by moving files according to their labels.
Audio Recording Logger	Logs daily audio recording data from S3 Inventory to AirTable.

For more details of each component, please check each subdirectory's README file.

Pipeline Overview

The high-level overview of this pipeline is shown below.

Installation

git clone https://github.com/bookbot-kids/label-pipeline.git
cd label-pipeline
pip install -r requirements.txt

References

@misc{label-studio-no-date,
    author = {{Label Studio}},
    title = {{Improve Audio Transcriptions with Label Studio}},
    url = {https://labelstud.io/blog/Improve-Audio-Transcriptions-with-Label-Studio.html},
}

Contributors