Word Tokenizer
speechline.utils.tokenizer.WordTokenizer
dataclass
Basic word-based splitting.
Source code in speechline/utils/tokenizer.py
class WordTokenizer:
"""
Basic word-based splitting.
"""
tokenizer = TweetTokenizer(preserve_case=False)
def __call__(self, text: str) -> List[str]:
"""
Splits text into words, ignoring punctuations and case.
Args:
text (str):
Text to tokenize.
Returns:
List[str]:
List of tokens.
"""
tokens = self.tokenizer.tokenize(text)
tokens = [token for token in tokens if token not in punctuation]
return tokens
__call__(self, text)
special
Splits text into words, ignoring punctuations and case.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str |
Text to tokenize. |
required |
Returns:
Type | Description |
---|---|
List[str] |
List of tokens. |