Skip to content


Example Config File

    "do_classify": true,
    "filter_empty_transcript": true,
    "classifier": {
        "model": "bookbot/distil-wav2vec2-adult-child-cls-52m",
        "max_duration_s": 3.0
    "transcriber": {
        "type": "wav2vec2",
        "model": "bookbot/wav2vec2-bookbot-en-lm",
        "return_timestamps": "word",
        "chunk_length_s": 30
    "do_noise_classify": true,
    "noise_classifier": {
        "model": "bookbot/distil-ast-audioset",
        "minimum_empty_duration": 0.3,
        "threshold": 0.2
    "segmenter": {
        "type": "word_overlap",
        "minimum_chunk_duration": 1.0

speechline.config.NoiseClassifierConfig dataclass

Noise classifier config.


Name Type Description Default
model str

HuggingFace Hub model hub checkpoint.

min_empty_duration float

Minimum non-transcribed segment duration to be segmented, and passed to noise classifier. Defaults to 1.0 seconds.

threshold float

The probability threshold for the multi label classification. Defaults to 0.3.

batch_size int

Batch size during inference. Defaults to 1.

Source code in speechline/
class NoiseClassifierConfig:
    Noise classifier config.

        model (str):
            HuggingFace Hub model hub checkpoint.
        min_empty_duration (float, optional):
            Minimum non-transcribed segment duration to be segmented,
            and passed to noise classifier.
            Defaults to `1.0` seconds.
        threshold (float, optional):
            The probability threshold for the multi label classification.
            Defaults to `0.3`.
        batch_size (int, optional):
            Batch size during inference. Defaults to `1`.


    model: str
    minimum_empty_duration: float = 1.0
    threshold: float = 0.3
    batch_size: int = 1

speechline.config.ClassifierConfig dataclass

Audio classifier config.


Name Type Description Default
model str

HuggingFace Hub model hub checkpoint.

max_duration_s float

Maximum audio duration for padding. Defaults to 3.0 seconds.

batch_size int

Batch size during inference. Defaults to 1.

Source code in speechline/
class ClassifierConfig:
    Audio classifier config.

        model (str):
            HuggingFace Hub model hub checkpoint.
        max_duration_s (float, optional):
            Maximum audio duration for padding. Defaults to `3.0` seconds.
        batch_size (int, optional):
            Batch size during inference. Defaults to `1`.

    model: str
    max_duration_s: float = 3.0
    batch_size: int = 1

speechline.config.TranscriberConfig dataclass

Audio transcriber config.


Name Type Description Default
type str

Transcriber model architecture type.

model str

HuggingFace Hub model hub checkpoint.

return_timestamps Union[str, bool]

return_timestamps argument in AutomaticSpeechRecognitionPipeline's __call__ method. Use "char" for CTC-based models and True for Whisper-based models.

chunk_length_s int

Audio chunk length in seconds.

Source code in speechline/
class TranscriberConfig:
    Audio transcriber config.

        type (str):
            Transcriber model architecture type.
        model (str):
            HuggingFace Hub model hub checkpoint.
        return_timestamps (Union[str, bool]):
            `return_timestamps` argument in `AutomaticSpeechRecognitionPipeline`'s
            `__call__` method. Use `"char"` for CTC-based models and
            `True` for Whisper-based models.
        chunk_length_s (int):
            Audio chunk length in seconds.

    type: str
    model: str
    return_timestamps: Union[str, bool]
    chunk_length_s: int

    def __post_init__(self):
        SUPPORTED_MODELS = {"wav2vec2", "whisper"}
        WAV2VEC_TIMESTAMPS = {"word", "char"}

        if self.type not in SUPPORTED_MODELS:
            raise ValueError(f"Transcriber of type {self.type} is not yet supported!")

        if self.type == "wav2vec2" and self.return_timestamps not in WAV2VEC_TIMESTAMPS:
            raise ValueError("wav2vec2 only supports `'word'` or `'char'` timestamps!")
        elif self.type == "whisper" and self.return_timestamps is not True:
            raise ValueError("Whisper only supports `True` timestamps!")

speechline.config.SegmenterConfig dataclass

Audio segmenter config.


Name Type Description Default
silence_duration float

Minimum in-between silence duration (in seconds) to consider as gaps. Defaults to 3.0 seconds.

minimum_chunk_duration float

Minimum chunk duration (in seconds) to be exported. Defaults to 0.2 second.

lexicon_path str

Path to lexicon file. Defaults to None.

keep_whitespace bool

Whether to keep whitespace in transcript. Defaults to False.

Source code in speechline/
class SegmenterConfig:
    Audio segmenter config.

        silence_duration (float, optional):
            Minimum in-between silence duration (in seconds) to consider as gaps.
            Defaults to `3.0` seconds.
        minimum_chunk_duration (float, optional):
            Minimum chunk duration (in seconds) to be exported.
            Defaults to 0.2 second.
        lexicon_path (str, optional):
            Path to lexicon file. Defaults to `None`.
        keep_whitespace (bool, optional):
            Whether to keep whitespace in transcript. Defaults to `False`.

    type: str
    silence_duration: float = 0.0
    minimum_chunk_duration: float = 0.2
    lexicon_path: str = None
    keep_whitespace: bool = False

    def __post_init__(self):
        SUPPORTED_TYPES = {"silence", "word_overlap", "phoneme_overlap"}

        if self.type not in SUPPORTED_TYPES:
            raise ValueError(f"Segmenter of type {self.type} is not yet supported!")

speechline.config.Config dataclass

Main SpeechLine config, contains all other subconfigs.


Name Type Description Default
path str

Path to JSON config file.

Source code in speechline/
class Config:
    Main SpeechLine config, contains all other subconfigs.

        path (str):
            Path to JSON config file.

    path: str

    def __post_init__(self):
        config = json.load(open(self.path))
        self.do_classify = config.get("do_classify", False)
        self.do_noise_classify = config.get("do_noise_classify", False)
        self.filter_empty_transcript = config.get("filter_empty_transcript", False)

        if self.do_classify:
            self.classifier = ClassifierConfig(**config["classifier"])

        if self.do_noise_classify:
            self.noise_classifier = NoiseClassifierConfig(**config["noise_classifier"])

        self.transcriber = TranscriberConfig(**config["transcriber"])
        self.segmenter = SegmenterConfig(**config["segmenter"])