Skip to content

SpeechLine

Config

Config

Example Config File

example_config.json

{
    "do_classify": true,
    "filter_empty_transcript": true,
    "classifier": {
        "model": "bookbot/distil-wav2vec2-adult-child-cls-52m",
        "max_duration_s": 3.0
    },
    "transcriber": {
        "type": "wav2vec2",
        "model": "bookbot/wav2vec2-bookbot-en-lm",
        "return_timestamps": "word",
        "chunk_length_s": 30
    },
    "do_noise_classify": true,
    "noise_classifier": {
        "model": "bookbot/distil-ast-audioset",
        "minimum_empty_duration": 0.3,
        "threshold": 0.2
    },
    "segmenter": {
        "type": "word_overlap",
        "minimum_chunk_duration": 1.0
    }
}

`speechline.config.NoiseClassifierConfig` `dataclass`

Noise classifier config.

Parameters:

Name	Type	Description	Default
`model`	`str`	HuggingFace Hub model hub checkpoint.	required
`min_empty_duration`	`float`	Minimum non-transcribed segment duration to be segmented, and passed to noise classifier. Defaults to `1.0` seconds.	required
`threshold`	`float`	The probability threshold for the multi label classification. Defaults to `0.3`.	`0.3`
`batch_size`	`int`	Batch size during inference. Defaults to `1`.	`1`

Source code in speechline/config.py

class NoiseClassifierConfig:
    """
    Noise classifier config.

    Args:
        model (str):
            HuggingFace Hub model hub checkpoint.
        min_empty_duration (float, optional):
            Minimum non-transcribed segment duration to be segmented,
            and passed to noise classifier.
            Defaults to `1.0` seconds.
        threshold (float, optional):
            The probability threshold for the multi label classification.
            Defaults to `0.3`.
        batch_size (int, optional):
            Batch size during inference. Defaults to `1`.

    """

    model: str
    minimum_empty_duration: float = 1.0
    threshold: float = 0.3
    batch_size: int = 1

`speechline.config.ClassifierConfig` `dataclass`

Audio classifier config.

Parameters:

Name	Type	Description	Default
`model`	`str`	HuggingFace Hub model hub checkpoint.	required
`max_duration_s`	`float`	Maximum audio duration for padding. Defaults to `3.0` seconds.	`3.0`
`batch_size`	`int`	Batch size during inference. Defaults to `1`.	`1`

Source code in speechline/config.py

class ClassifierConfig:
    """
    Audio classifier config.

    Args:
        model (str):
            HuggingFace Hub model hub checkpoint.
        max_duration_s (float, optional):
            Maximum audio duration for padding. Defaults to `3.0` seconds.
        batch_size (int, optional):
            Batch size during inference. Defaults to `1`.
    """

    model: str
    max_duration_s: float = 3.0
    batch_size: int = 1

`speechline.config.TranscriberConfig` `dataclass`

Audio transcriber config.

Parameters:

Name	Type	Description	Default
`type`	`str`	Transcriber model architecture type.	required
`model`	`str`	HuggingFace Hub model hub checkpoint.	required
`return_timestamps`	`Union[str, bool]`	`return_timestamps` argument in `AutomaticSpeechRecognitionPipeline`'s `__call__` method. Use `"char"` for CTC-based models and `True` for Whisper-based models.	required
`chunk_length_s`	`int`	Audio chunk length in seconds.	required

Source code in speechline/config.py

class TranscriberConfig:
    """
    Audio transcriber config.

    Args:
        type (str):
            Transcriber model architecture type.
        model (str):
            HuggingFace Hub model hub checkpoint.
        return_timestamps (Union[str, bool]):
            `return_timestamps` argument in `AutomaticSpeechRecognitionPipeline`'s
            `__call__` method. Use `"char"` for CTC-based models and
            `True` for Whisper-based models.
        chunk_length_s (int):
            Audio chunk length in seconds.
    """

    type: str
    model: str
    return_timestamps: Union[str, bool]
    chunk_length_s: int

    def __post_init__(self):
        SUPPORTED_MODELS = {"wav2vec2", "whisper"}
        WAV2VEC_TIMESTAMPS = {"word", "char"}

        if self.type not in SUPPORTED_MODELS:
            raise ValueError(f"Transcriber of type {self.type} is not yet supported!")

        if self.type == "wav2vec2" and self.return_timestamps not in WAV2VEC_TIMESTAMPS:
            raise ValueError("wav2vec2 only supports `'word'` or `'char'` timestamps!")
        elif self.type == "whisper" and self.return_timestamps is not True:
            raise ValueError("Whisper only supports `True` timestamps!")

`speechline.config.SegmenterConfig` `dataclass`

Audio segmenter config.

Parameters:

Name	Type	Description	Default
`silence_duration`	`float`	Minimum in-between silence duration (in seconds) to consider as gaps. Defaults to `3.0` seconds.	`0.0`
`minimum_chunk_duration`	`float`	Minimum chunk duration (in seconds) to be exported. Defaults to 0.2 second.	`0.2`
`lexicon_path`	`str`	Path to lexicon file. Defaults to `None`.	`None`
`keep_whitespace`	`bool`	Whether to keep whitespace in transcript. Defaults to `False`.	`False`

Source code in speechline/config.py

class SegmenterConfig:
    """
    Audio segmenter config.

    Args:
        silence_duration (float, optional):
            Minimum in-between silence duration (in seconds) to consider as gaps.
            Defaults to `3.0` seconds.
        minimum_chunk_duration (float, optional):
            Minimum chunk duration (in seconds) to be exported.
            Defaults to 0.2 second.
        lexicon_path (str, optional):
            Path to lexicon file. Defaults to `None`.
        keep_whitespace (bool, optional):
            Whether to keep whitespace in transcript. Defaults to `False`.
    """

    type: str
    silence_duration: float = 0.0
    minimum_chunk_duration: float = 0.2
    lexicon_path: str = None
    keep_whitespace: bool = False

    def __post_init__(self):
        SUPPORTED_TYPES = {"silence", "word_overlap", "phoneme_overlap"}

        if self.type not in SUPPORTED_TYPES:
            raise ValueError(f"Segmenter of type {self.type} is not yet supported!")

`speechline.config.Config` `dataclass`

Main SpeechLine config, contains all other subconfigs.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to JSON config file.	required

Source code in speechline/config.py

class Config:
    """
    Main SpeechLine config, contains all other subconfigs.

    Args:
        path (str):
            Path to JSON config file.
    """

    path: str

    def __post_init__(self):
        config = json.load(open(self.path))
        self.do_classify = config.get("do_classify", False)
        self.do_noise_classify = config.get("do_noise_classify", False)
        self.filter_empty_transcript = config.get("filter_empty_transcript", False)

        if self.do_classify:
            self.classifier = ClassifierConfig(**config["classifier"])

        if self.do_noise_classify:
            self.noise_classifier = NoiseClassifierConfig(**config["noise_classifier"])

        self.transcriber = TranscriberConfig(**config["transcriber"])
        self.segmenter = SegmenterConfig(**config["segmenter"])