Skip to content

Home

Label Pipeline

GitHub Documentation Tests Code Coverage Contributor Covenant contributing guidelines

This repository hosts the necessary AWS Lambda scripts to facilitate an automated audio labeling pipeline. The main components of the pipeline includes:

Component Description
Audio Transcription using AWS Transcribe Transcribe incoming audios stored in S3 using AWS Transcribe. After transcribing, align audios based on ground truth values and save annotations.
Audio Splitting Based on audio alignment transcriptions, segment audios and split into different files before saving back to S3.
Audio Adult/Child Classifier Classify incoming audios stored in S3 as either adult, or child audios.
Integration with AirTable Dashboards Export AirTable audio annotations (transcript and labels) to S3 by moving files according to their labels.
Audio Recording Logger Logs daily audio recording data from S3 Inventory to AirTable.

For more details of each component, please check each subdirectory's README file.

Pipeline Overview

The high-level overview of this pipeline is shown below.

Installation

git clone https://github.com/bookbot-kids/label-pipeline.git
cd label-pipeline
pip install -r requirements.txt

References

@misc{label-studio-no-date,
    author = {{Label Studio}},
    title = {{Improve Audio Transcriptions with Label Studio}},
    url = {https://labelstud.io/blog/Improve-Audio-Transcriptions-with-Label-Studio.html},
}

Contributors