Mispronunciation

`transcribe.mispronunciation`

`Mispronunciation`

A class to represent a Mispronunciation. Contains attributes which holds the type and differences.

Parameters:

Name	Type	Description	Default
`job_name`	`str`	Job name/id.	required
`audio_url`	`str`	URL to audio file.	required
`language`	`str`	Language of audio.	required
`type`	`MispronunciationType`	Type of mispronunciation/disfluency present.	required
`lists`	`Tuple[List[str], List[str]]`	Input list of strings taken for comparison.	required
`differences`	`Tuple[List[str], List[str]]`	Differences of list of strings that resulted in the type verdict.	required

Source code in src/transcribe/mispronunciation.py

class Mispronunciation:
    """
    A class to represent a Mispronunciation.
    Contains attributes which holds the type and differences.

    Arguments:
        job_name (str): Job name/id.
        audio_url (str): URL to audio file.
        language (str): Language of audio.
        type (MispronunciationType): Type of mispronunciation/disfluency present.
        lists (Tuple[List[str], List[str]]): Input list of strings taken for comparison.
        differences (Tuple[List[str], List[str]]): Differences of list of strings that
                                                   resulted in the type verdict.
    """

    def __init__(
        self,
        type: MispronunciationType,
        lists: Tuple[List[str], List[str]],
        differences: Tuple[List[str], List[str]],
        opcodes: List[Tuple[str, int, int, int, int]],
    ):
        """Constructor for the `Mispronunciation` class.

        Args:
            type (MispronunciationType): Type of mispronunciation/disfluency present.
            lists (Tuple[List[str], List[str]]): Input list of strings taken for
                                                 comparison.
            differences (Tuple[List[str], List[str]]): Differences of list of strings
                                                    that resulted in the type verdict.
            opcodes (List[Tuple[str, int, int, int, int]]): Opcodes from `diff` library.
        """
        self.job_name = None
        self.audio_url = None
        self.language = None
        self.type = type
        self.lists = lists
        self.differences = differences
        self.opcodes = opcodes

`init(type, lists, differences, opcodes)`

Constructor for the Mispronunciation class.

Parameters:

Name	Type	Description	Default
`type`	`MispronunciationType`	Type of mispronunciation/disfluency present.	required
`lists`	`Tuple[List[str], List[str]]`	Input list of strings taken for comparison.	required
`differences`	`Tuple[List[str], List[str]]`	Differences of list of strings that resulted in the type verdict.	required
`opcodes`	`List[Tuple[str, int, int, int, int]]`	Opcodes from `diff` library.	required

Source code in src/transcribe/mispronunciation.py

def __init__(
    self,
    type: MispronunciationType,
    lists: Tuple[List[str], List[str]],
    differences: Tuple[List[str], List[str]],
    opcodes: List[Tuple[str, int, int, int, int]],
):
    """Constructor for the `Mispronunciation` class.

    Args:
        type (MispronunciationType): Type of mispronunciation/disfluency present.
        lists (Tuple[List[str], List[str]]): Input list of strings taken for
                                             comparison.
        differences (Tuple[List[str], List[str]]): Differences of list of strings
                                                that resulted in the type verdict.
        opcodes (List[Tuple[str, int, int, int, int]]): Opcodes from `diff` library.
    """
    self.job_name = None
    self.audio_url = None
    self.language = None
    self.type = type
    self.lists = lists
    self.differences = differences
    self.opcodes = opcodes

`detect_mispronunciation(ground_truth, transcript, homophones=None)`

Detects if the pair of ground truth and transcript is considered as a mispronunciation.

We define a mispronunciation to be either an addition (A) / substitution (S). Ignores deletion (D), 100% match (M) and single-word GT (X), returning None. Also handles homophones given a pre-defined list.

Parameters:

Name	Type	Description	Default
`ground_truth`	`List[str]`	List of ground truth words.	required
`transcript`	`List[str]`	List of transcript words.	required
`homophones`	`List[Set[str]]`	List of homophone families. Defaults to None.	`None`

Returns:

Name	Type	Description
`Mispronunciation`	`Mispronunciation`	Object of mispronunciation present. Otherwise, `None`.

Examples

#	Ground Truth	Transcript	Verdict
1	skel is a skeleton	skel is a skeleton	M
2	skel is a skeleton	skel is not a skeleton	A
3	skel is a skeleton	skel is a zombie	S
4	skel is a skeleton	skel is not a zombie	A & S
5	skel is a skeleton	skel is skeleton	D
6	skel is a skeleton	skel is zombie	D
7	vain is a skeleton	vein is a skeleton	M
8	skel	skel is a skeleton	X

Algorithm

BASE CASES if:

single-word ground truth
empty transcript
zero alignment

MATCH if:

both residues are empty (100% match)

DELETION if:

zero transcript residue, >1 ground truth residue
- all spoken transcripts are correct, but some words are missing
more residue in ground truth than in transcript
- less strict condition than above
- may possibly contain substitution, but could be minimal

ADDITION if:

zero ground truth residue, >1 transcript residue
- all words in ground truth are perfectly spoken, but additional words are present

SUBSTITUTION if:

same amounts of residue, at exact same positions
- strict form of substitution, only 1-1 changes per position

ADDITION & SUBSTITUTION if:

more residue in transcript than in ground truth
- with at least 1 match

Source code in src/transcribe/mispronunciation.py

def detect_mispronunciation(
    ground_truth: List[str], transcript: List[str], homophones: List[Set[str]] = None
) -> Mispronunciation:
    """Detects if the pair of ground truth and transcript is considered as a
    mispronunciation.

    We define a mispronunciation to be either an addition (A) / substitution (S).
    Ignores deletion (D), 100% match (M) and single-word GT (X), returning `None`.
    Also handles homophones given a pre-defined list.

    Args:
        ground_truth (List[str]): List of ground truth words.
        transcript (List[str]): List of transcript words.
        homophones (List[Set[str]], optional): List of homophone families. Defaults
                                               to None.

    Returns:
        Mispronunciation: Object of mispronunciation present. Otherwise, `None`.

    Examples
    -------------------------------------------------------------
    | # | Ground Truth       | Transcript             | Verdict |
    |:-:|--------------------|------------------------|:-------:|
    | 1 | skel is a skeleton | skel is a skeleton     |    M    |
    | 2 | skel is a skeleton | skel is not a skeleton |    A    |
    | 3 | skel is a skeleton | skel is a zombie       |    S    |
    | 4 | skel is a skeleton | skel is not a zombie   |  A & S  |
    | 5 | skel is a skeleton | skel is skeleton       |    D    |
    | 6 | skel is a skeleton | skel is zombie         |    D    |
    | 7 | vain is a skeleton | vein is a skeleton     |    M    |
    | 8 | skel               | skel is a skeleton     |    X    |

    Algorithm
    ----------
    BASE CASES if:

    - single-word ground truth
    - empty transcript
    - zero alignment

    MATCH if:

    - both residues are empty (100% match)

    DELETION if:

    - zero transcript residue, >1 ground truth residue
        - all spoken transcripts are correct, but some words are missing
    - more residue in ground truth than in transcript
        - less strict condition than above
        - may possibly contain substitution, but could be minimal

    ADDITION if:

    - zero ground truth residue, >1 transcript residue
        - all words in ground truth are perfectly spoken, but additional words are
        present

    SUBSTITUTION if:

    - same amounts of residue, at exact same positions
        - strict form of substitution, only 1-1 changes per position

    ADDITION & SUBSTITUTION if:

    - more residue in transcript than in ground truth
        - with at least 1 match
    """
    if homophones is None:
        homophones = HOMOPHONES["en"]

    transcript = list(filter(remove_fillers, transcript))

    if len(ground_truth) == 1 or len(transcript) == 0:
        return None  # single word or filler-only transcript

    tsc_idx = set(range(len(transcript)))
    gt_idx = set(range(len(ground_truth)))

    aligned_tsc, aligned_gt, opcodes = match_sequence(
        transcript, ground_truth, homophones
    )

    if len(aligned_tsc) == 0 and len(aligned_gt) == 0:
        return None  # zero matches/alignments, pretty much random

    tsc_diff = tsc_idx.difference(aligned_tsc)
    gt_diff = gt_idx.difference(aligned_gt)

    tsc_diff_words = [transcript[idx] for idx in tsc_diff]
    gt_diff_words = [ground_truth[idx] for idx in gt_diff]

    mispronunciation = Mispronunciation(
        None, (ground_truth, transcript), (gt_diff_words, tsc_diff_words), opcodes
    )

    if len(gt_diff) == 0 and len(tsc_diff) == 0:
        return None  # 100% match
    elif len(gt_diff) > 0 and len(tsc_diff) == 0:
        return None  # deletion only
    elif len(gt_diff) == 0 and len(tsc_diff) > 0:
        mispronunciation.type = MispronunciationType.ADDITION
        return mispronunciation  # addition only
    elif len(tsc_diff) == len(gt_diff) and tsc_diff == gt_diff:
        mispronunciation.type = MispronunciationType.SUBSTITUTION
        return mispronunciation  # strict substitution only
    elif len(tsc_diff) >= len(gt_diff):
        mispronunciation.type = MispronunciationType.ADDITION_SUBSTITUTION
        return mispronunciation  # addition & substitution
    else:
        # in cases where there is less spoken words (transcript) compared to GT,
        # we assume that there is mostly deletion, although it may contain substitutions
        # we think, the transcript thus contain little to no information that may be
        # useful for training.
        return None

`remove_fillers(word)`

Manually checks if a word is a filler word

Parameters:

Name	Type	Description	Default
`word`	`str`	Any word (sequence of characters).	required

Returns:

Name	Type	Description
`bool`	`bool`	`True` if word is not a filler. `False` otherwise.

Source code in src/transcribe/mispronunciation.py

def remove_fillers(word: str) -> bool:
    """Manually checks if a word is a filler word

    Args:
        word (str): Any word (sequence of characters).

    Returns:
        bool: `True` if word is not a filler. `False` otherwise.
    """
    fillers = ("", "uh", "huh", "mm", "yeah", "mhm", "hmm", "hm")
    return word not in fillers

Mispronunciation

transcribe.mispronunciation