Config
Example Config File
{
"do_classify": true,
"filter_empty_transcript": true,
"classifier": {
"model": "bookbot/distil-wav2vec2-adult-child-cls-52m",
"max_duration_s": 3.0
},
"transcriber": {
"type": "wav2vec2",
"model": "bookbot/wav2vec2-bookbot-en-lm",
"return_timestamps": "word",
"chunk_length_s": 30
},
"do_noise_classify": true,
"noise_classifier": {
"model": "bookbot/distil-ast-audioset",
"minimum_empty_duration": 0.3,
"threshold": 0.2
},
"segmenter": {
"type": "word_overlap",
"minimum_chunk_duration": 1.0
}
}
speechline.config.NoiseClassifierConfig
dataclass
Noise classifier config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
str |
HuggingFace Hub model hub checkpoint. |
required |
min_empty_duration |
float |
Minimum non-transcribed segment duration to be segmented,
and passed to noise classifier.
Defaults to |
required |
threshold |
float |
The probability threshold for the multi label classification.
Defaults to |
0.3 |
batch_size |
int |
Batch size during inference. Defaults to |
1 |
Source code in speechline/config.py
class NoiseClassifierConfig:
"""
Noise classifier config.
Args:
model (str):
HuggingFace Hub model hub checkpoint.
min_empty_duration (float, optional):
Minimum non-transcribed segment duration to be segmented,
and passed to noise classifier.
Defaults to `1.0` seconds.
threshold (float, optional):
The probability threshold for the multi label classification.
Defaults to `0.3`.
batch_size (int, optional):
Batch size during inference. Defaults to `1`.
"""
model: str
minimum_empty_duration: float = 1.0
threshold: float = 0.3
batch_size: int = 1
speechline.config.ClassifierConfig
dataclass
Audio classifier config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
str |
HuggingFace Hub model hub checkpoint. |
required |
max_duration_s |
float |
Maximum audio duration for padding. Defaults to |
3.0 |
batch_size |
int |
Batch size during inference. Defaults to |
1 |
Source code in speechline/config.py
class ClassifierConfig:
"""
Audio classifier config.
Args:
model (str):
HuggingFace Hub model hub checkpoint.
max_duration_s (float, optional):
Maximum audio duration for padding. Defaults to `3.0` seconds.
batch_size (int, optional):
Batch size during inference. Defaults to `1`.
"""
model: str
max_duration_s: float = 3.0
batch_size: int = 1
speechline.config.TranscriberConfig
dataclass
Audio transcriber config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
type |
str |
Transcriber model architecture type. |
required |
model |
str |
HuggingFace Hub model hub checkpoint. |
required |
return_timestamps |
Union[str, bool] |
|
required |
chunk_length_s |
int |
Audio chunk length in seconds. |
required |
Source code in speechline/config.py
class TranscriberConfig:
"""
Audio transcriber config.
Args:
type (str):
Transcriber model architecture type.
model (str):
HuggingFace Hub model hub checkpoint.
return_timestamps (Union[str, bool]):
`return_timestamps` argument in `AutomaticSpeechRecognitionPipeline`'s
`__call__` method. Use `"char"` for CTC-based models and
`True` for Whisper-based models.
chunk_length_s (int):
Audio chunk length in seconds.
"""
type: str
model: str
return_timestamps: Union[str, bool]
chunk_length_s: int
def __post_init__(self):
SUPPORTED_MODELS = {"wav2vec2", "whisper"}
WAV2VEC_TIMESTAMPS = {"word", "char"}
if self.type not in SUPPORTED_MODELS:
raise ValueError(f"Transcriber of type {self.type} is not yet supported!")
if self.type == "wav2vec2" and self.return_timestamps not in WAV2VEC_TIMESTAMPS:
raise ValueError("wav2vec2 only supports `'word'` or `'char'` timestamps!")
elif self.type == "whisper" and self.return_timestamps is not True:
raise ValueError("Whisper only supports `True` timestamps!")
speechline.config.SegmenterConfig
dataclass
Audio segmenter config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
silence_duration |
float |
Minimum in-between silence duration (in seconds) to consider as gaps.
Defaults to |
0.0 |
minimum_chunk_duration |
float |
Minimum chunk duration (in seconds) to be exported. Defaults to 0.2 second. |
0.2 |
lexicon_path |
str |
Path to lexicon file. Defaults to |
None |
keep_whitespace |
bool |
Whether to keep whitespace in transcript. Defaults to |
False |
Source code in speechline/config.py
class SegmenterConfig:
"""
Audio segmenter config.
Args:
silence_duration (float, optional):
Minimum in-between silence duration (in seconds) to consider as gaps.
Defaults to `3.0` seconds.
minimum_chunk_duration (float, optional):
Minimum chunk duration (in seconds) to be exported.
Defaults to 0.2 second.
lexicon_path (str, optional):
Path to lexicon file. Defaults to `None`.
keep_whitespace (bool, optional):
Whether to keep whitespace in transcript. Defaults to `False`.
"""
type: str
silence_duration: float = 0.0
minimum_chunk_duration: float = 0.2
lexicon_path: str = None
keep_whitespace: bool = False
def __post_init__(self):
SUPPORTED_TYPES = {"silence", "word_overlap", "phoneme_overlap"}
if self.type not in SUPPORTED_TYPES:
raise ValueError(f"Segmenter of type {self.type} is not yet supported!")
speechline.config.Config
dataclass
Main SpeechLine config, contains all other subconfigs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str |
Path to JSON config file. |
required |
Source code in speechline/config.py
class Config:
"""
Main SpeechLine config, contains all other subconfigs.
Args:
path (str):
Path to JSON config file.
"""
path: str
def __post_init__(self):
config = json.load(open(self.path))
self.do_classify = config.get("do_classify", False)
self.do_noise_classify = config.get("do_noise_classify", False)
self.filter_empty_transcript = config.get("filter_empty_transcript", False)
if self.do_classify:
self.classifier = ClassifierConfig(**config["classifier"])
if self.do_noise_classify:
self.noise_classifier = NoiseClassifierConfig(**config["noise_classifier"])
self.transcriber = TranscriberConfig(**config["transcriber"])
self.segmenter = SegmenterConfig(**config["segmenter"])