The `calculateErrorRate` action compares reference (ground truth) transcripts with hypothesis (predicted) transcripts to calculate error rates at the character, word, and sentence levels. This is crucial for evaluating the performance of a speech-to-text model. It requires two input tables: one for the reference text and one for the hypothesis text, and it matches the transcripts based on their IDs.
| Parameter | Description |
|---|---|
| table | Specifies the input table that contains the hypothesis transcripts generated by the speech-to-text model. |
| reference | Specifies the input table that contains the ground truth (reference) transcripts. |
| tableId | Specifies the variable in the hypothesis table that contains the unique identifier for each transcript. |
| tableText | Specifies the variable in the hypothesis table that contains the transcribed text to be evaluated. |
| referenceId | Specifies the variable in the reference table that contains the unique identifier for each transcript. |
| referenceText | Specifies the variable in the reference table that contains the ground truth text. |
First, we create two CAS tables. `reference_transcripts` holds the correct, or 'ground truth', text. `hypothesis_transcripts` holds the text generated by our speech-to-text model. Both tables include an ID to match the corresponding sentences.
| 1 | DATA mycas.reference_transcripts; |
| 2 | INFILE DATALINES dsd; |
| 3 | LENGTH id $ 10 text $ 100; |
| 4 | INPUT id $ text $; |
| 5 | DATALINES; |
| 6 | utt1,this is a sample sentence |
| 7 | utt2,another test for accuracy |
| 8 | ; |
| 9 | RUN; |
| 10 | |
| 11 | DATA mycas.hypothesis_transcripts; |
| 12 | INFILE DATALINES dsd; |
| 13 | LENGTH hyp_id $ 10 hyp_text $ 100; |
| 14 | INPUT hyp_id $ hyp_text $; |
| 15 | DATALINES; |
| 16 | utt1,this is a sample sentience |
| 17 | utt2,an other test for accuracy |
| 18 | ; |
| 19 | RUN; |
This example calculates the error rate by providing the reference and hypothesis tables. By default, the action assumes the first column is the ID and the second is the text for both tables.
| 1 | PROC CAS; |
| 2 | langModel.calculateErrorRate / |
| 3 | TABLE={name='hypothesis_transcripts'}, |
| 4 | reference={name='reference_transcripts'}; |
| 5 | RUN; |
This example demonstrates how to specify the exact columns for IDs and text in both the hypothesis and reference tables, which is useful when tables have multiple columns or non-standard naming conventions.
| 1 | PROC CAS; |
| 2 | langModel.calculateErrorRate / |
| 3 | TABLE={name='hypothesis_transcripts'}, |
| 4 | reference={name='reference_transcripts'}, |
| 5 | tableId='hyp_id', |
| 6 | tableText='hyp_text', |
| 7 | referenceId='id', |
| 8 | referenceText='text'; |
| 9 | RUN; |