Config
Example Config File
{
"do_classify": true,
"filter_empty_transcript": true,
"classifier": {
"model": "bookbot/distil-wav2vec2-adult-child-cls-52m",
"max_duration_s": 3.0
},
"transcriber": {
"type": "wav2vec2",
"model": "bookbot/wav2vec2-bookbot-en-lm",
"return_timestamps": "word",
"chunk_length_s": 30
},
"do_noise_classify": true,
"noise_classifier": {
"model": "bookbot/distil-ast-audioset",
"minimum_empty_duration": 0.3,
"threshold": 0.2
},
"segmenter": {
"type": "word_overlap",
"minimum_chunk_duration": 1.0
}
}
speechline.config.NoiseClassifierConfig
dataclass
Noise classifier config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str |
HuggingFace Hub model hub checkpoint. |
required |
min_empty_duration |
float |
Minimum non-transcribed segment duration to be segmented,
and passed to noise classifier.
Defaults to |
required |
threshold |
float |
The probability threshold for the multi label classification.
Defaults to |
0.3 |
batch_size |
int |
Batch size during inference. Defaults to |
1 |
Source code in speechline/config.py
@dataclass
class NoiseClassifierConfig:
"""
Noise classifier config.
Args:
model (str):
HuggingFace Hub model hub checkpoint.
min_empty_duration (float, optional):
Minimum non-transcribed segment duration to be segmented,
and passed to noise classifier.
Defaults to `1.0` seconds.
threshold (float, optional):
The probability threshold for the multi label classification.
Defaults to `0.3`.
batch_size (int, optional):
Batch size during inference. Defaults to `1`.
"""
model: str
minimum_empty_duration: float = 1.0
threshold: float = 0.3
batch_size: int = 1
speechline.config.ClassifierConfig
dataclass
Audio classifier config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str |
HuggingFace Hub model hub checkpoint. |
required |
max_duration_s |
float |
Maximum audio duration for padding. Defaults to |
3.0 |
batch_size |
int |
Batch size during inference. Defaults to |
1 |
Source code in speechline/config.py
@dataclass
class ClassifierConfig:
"""
Audio classifier config.
Args:
model (str):
HuggingFace Hub model hub checkpoint.
max_duration_s (float, optional):
Maximum audio duration for padding. Defaults to `3.0` seconds.
batch_size (int, optional):
Batch size during inference. Defaults to `1`.
"""
model: str
max_duration_s: float = 3.0
batch_size: int = 1
speechline.config.TranscriberConfig
dataclass
Audio transcriber config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
type |
str |
Transcriber model architecture type. |
required |
model |
str |
HuggingFace Hub model hub checkpoint (not required for 'gentle'). |
None |
return_timestamps |
Union[str, bool] |
|
None |
chunk_length_s |
int |
Audio chunk length in seconds. |
None |
torch_dtype |
str |
Torch dtype for model weights (e.g., 'float16'). Used by Canary transcriber. |
None |
validate_alignment |
bool |
Enable alignment validation mode. Defaults to False. |
False |
token_confidence_threshold |
float |
Minimum confidence for token acceptance (0-1). Defaults to 0.7. |
0.7 |
min_alignment_ratio |
float |
Minimum fraction of tokens that must align (0-1). Defaults to 0.8. |
0.8 |
nfa_model |
str |
NeMo model for forced alignment. Defaults to "nvidia/parakeet-ctc-1.1b". |
'nvidia/parakeet-ctc-1.1b' |
gentle_path |
str |
Path to Gentle installation. Only used when type is "gentle". Defaults to "/mnt/Projects/Projects/AudioProcessing/gentle". |
'/mnt/4090_projects/Projects/AudioProcessing/gentle' |
output_phonemes |
bool |
Include phoneme sequences in Gentle output. Defaults to True. |
True |
output_word_boundaries |
bool |
Include word boundary timestamps in Gentle output. Defaults to True. |
True |
Source code in speechline/config.py
@dataclass
class TranscriberConfig:
"""
Audio transcriber config.
Args:
type (str):
Transcriber model architecture type.
model (str):
HuggingFace Hub model hub checkpoint (not required for 'gentle').
return_timestamps (Union[str, bool]):
`return_timestamps` argument in `AutomaticSpeechRecognitionPipeline`'s
`__call__` method. Use `"char"` for CTC-based models and
`True` for Whisper-based models.
chunk_length_s (int):
Audio chunk length in seconds.
torch_dtype (str, optional):
Torch dtype for model weights (e.g., 'float16'). Used by Canary transcriber.
validate_alignment (bool, optional):
Enable alignment validation mode. Defaults to False.
token_confidence_threshold (float, optional):
Minimum confidence for token acceptance (0-1). Defaults to 0.7.
min_alignment_ratio (float, optional):
Minimum fraction of tokens that must align (0-1). Defaults to 0.8.
nfa_model (str, optional):
NeMo model for forced alignment. Defaults to "nvidia/parakeet-ctc-1.1b".
gentle_path (str, optional):
Path to Gentle installation. Only used when type is "gentle".
Defaults to "/mnt/Projects/Projects/AudioProcessing/gentle".
output_phonemes (bool, optional):
Include phoneme sequences in Gentle output. Defaults to True.
output_word_boundaries (bool, optional):
Include word boundary timestamps in Gentle output. Defaults to True.
"""
type: str
model: str = None
return_timestamps: Union[str, bool] = None
chunk_length_s: Optional[int] = None
transcriber_device: str = "cuda"
torch_dtype: Optional[str] = None
validate_alignment: bool = False
token_confidence_threshold: float = 0.7
min_alignment_ratio: float = 0.8
nfa_model: str = "nvidia/parakeet-ctc-1.1b"
gentle_path: str = "/mnt/4090_projects/Projects/AudioProcessing/gentle"
output_phonemes: bool = True
output_word_boundaries: bool = True
def __post_init__(self):
SUPPORTED_MODELS = {"wav2vec2", "whisper", "parakeet", "parakeet_tdt", "canary", "gentle"}
WAV2VEC_TIMESTAMPS = {"word", "char"}
PARAKEET_TIMESTAMPS = {"word"}
PARAKEET_TDT_TIMESTAMPS = {"word", "char"}
GENTLE_TIMESTAMPS = {"word"}
if self.type not in SUPPORTED_MODELS:
raise ValueError(f"Transcriber of type {self.type} is not yet supported!")
# Gentle has different requirements
if self.type == "gentle":
# Gentle doesn't need model checkpoint
if self.return_timestamps is None:
self.return_timestamps = "word"
if self.return_timestamps not in GENTLE_TIMESTAMPS:
raise ValueError("gentle only supports `'word'` timestamps!")
return
# All other transcribers require model
if self.model is None:
raise ValueError(f"model is required for {self.type} transcriber")
if self.type == "wav2vec2" and self.return_timestamps not in WAV2VEC_TIMESTAMPS:
raise ValueError("wav2vec2 only supports `'word'` or `'char'` timestamps!")
elif self.type == "parakeet" and self.return_timestamps not in PARAKEET_TIMESTAMPS:
raise ValueError("parakeet only supports `word` timestamps!")
elif self.type == "parakeet_tdt" and self.return_timestamps not in PARAKEET_TDT_TIMESTAMPS:
raise ValueError("parakeet_tdt only supports `'word'` or `'char'` timestamps!")
elif self.type in {"whisper", "canary"} and self.return_timestamps is not True:
raise ValueError(f"{self.type} only supports `True` timestamps!")
# Add validation for chunk_length_s requirement
if self.type in {"wav2vec2", "whisper", "canary", "parakeet_tdt"} and self.chunk_length_s is None:
raise ValueError(f"chunk_length_s is required for {self.type} models")
speechline.config.SegmenterConfig
dataclass
Audio segmenter config.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
silence_duration |
float |
Minimum in-between silence duration (in seconds) to consider as gaps.
Defaults to |
0.0 |
minimum_chunk_duration |
float |
Minimum chunk duration (in seconds) to be exported. Defaults to 0.2 second. |
0.2 |
lexicon_path |
str |
Path to lexicon file. Defaults to |
None |
keep_whitespace |
bool |
Whether to keep whitespace in transcript. Defaults to |
False |
Source code in speechline/config.py
@dataclass
class SegmenterConfig:
"""
Audio segmenter config.
Args:
silence_duration (float, optional):
Minimum in-between silence duration (in seconds) to consider as gaps.
Defaults to `3.0` seconds.
minimum_chunk_duration (float, optional):
Minimum chunk duration (in seconds) to be exported.
Defaults to 0.2 second.
lexicon_path (str, optional):
Path to lexicon file. Defaults to `None`.
keep_whitespace (bool, optional):
Whether to keep whitespace in transcript. Defaults to `False`.
"""
type: str
silence_duration: float = 0.0
minimum_chunk_duration: float = 0.2
lexicon_path: str = None
keep_whitespace: bool = False
segment_with_ground_truth: bool = False
def __post_init__(self):
SUPPORTED_TYPES = {"silence", "word_overlap", "phoneme_overlap"}
if self.type not in SUPPORTED_TYPES:
raise ValueError(f"Segmenter of type {self.type} is not yet supported!")
speechline.config.Config
dataclass
Main SpeechLine config, contains all other subconfigs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path |
str |
Path to JSON config file. |
required |
Source code in speechline/config.py
@dataclass
class Config:
"""
Main SpeechLine config, contains all other subconfigs.
Args:
path (str):
Path to JSON config file.
"""
path: str
def __post_init__(self):
config = json.load(open(self.path))
self.do_classify = config.get("do_classify", False)
self.do_noise_classify = config.get("do_noise_classify", False)
self.filter_empty_transcript = config.get("filter_empty_transcript", False)
self.audio_extension = config.get("audio_extension", "wav")
self.folder_filter = config.get("folder_filter", None)
if self.do_classify:
self.classifier = ClassifierConfig(**config["classifier"])
if self.do_noise_classify:
self.noise_classifier = NoiseClassifierConfig(**config["noise_classifier"])
self.transcriber = TranscriberConfig(**config["transcriber"])
self.segmenter = SegmenterConfig(**config["segmenter"])