langModel

calculateErrorRate

Description

The `calculateErrorRate` action compares reference (ground truth) transcripts with hypothesis (predicted) transcripts to calculate error rates at the character, word, and sentence levels. This is crucial for evaluating the performance of a speech-to-text model. It requires two input tables: one for the reference text and one for the hypothesis text, and it matches the transcripts based on their IDs.

proc cas; langModel.calculateErrorRate / table={...} reference={...} tableId="string" tableText="string" referenceId="string" referenceText="string"; run;
Settings
ParameterDescription
tableSpecifies the input table that contains the hypothesis transcripts generated by the speech-to-text model.
referenceSpecifies the input table that contains the ground truth (reference) transcripts.
tableIdSpecifies the variable in the hypothesis table that contains the unique identifier for each transcript.
tableTextSpecifies the variable in the hypothesis table that contains the transcribed text to be evaluated.
referenceIdSpecifies the variable in the reference table that contains the unique identifier for each transcript.
referenceTextSpecifies the variable in the reference table that contains the ground truth text.
Data Preparation View data prep sheet
Creating Reference and Hypothesis Data

First, we create two CAS tables. `reference_transcripts` holds the correct, or 'ground truth', text. `hypothesis_transcripts` holds the text generated by our speech-to-text model. Both tables include an ID to match the corresponding sentences.

Copied!
1DATA mycas.reference_transcripts;
2 INFILE DATALINES dsd;
3 LENGTH id $ 10 text $ 100;
4 INPUT id $ text $;
5 DATALINES;
6utt1,this is a sample sentence
7utt2,another test for accuracy
8;
9RUN;
10 
11DATA mycas.hypothesis_transcripts;
12 INFILE DATALINES dsd;
13 LENGTH hyp_id $ 10 hyp_text $ 100;
14 INPUT hyp_id $ hyp_text $;
15 DATALINES;
16utt1,this is a sample sentience
17utt2,an other test for accuracy
18;
19RUN;

Examples

This example calculates the error rate by providing the reference and hypothesis tables. By default, the action assumes the first column is the ID and the second is the text for both tables.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 langModel.calculateErrorRate /
3 TABLE={name='hypothesis_transcripts'},
4 reference={name='reference_transcripts'};
5RUN;
Result :
The action returns a result table summarizing the Word Error Rate (WER), Character Error Rate (CER), and Sentence Error Rate (SER), along with counts of substitutions, insertions, and deletions for both words and characters.

This example demonstrates how to specify the exact columns for IDs and text in both the hypothesis and reference tables, which is useful when tables have multiple columns or non-standard naming conventions.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 langModel.calculateErrorRate /
3 TABLE={name='hypothesis_transcripts'},
4 reference={name='reference_transcripts'},
5 tableId='hyp_id',
6 tableText='hyp_text',
7 referenceId='id',
8 referenceText='text';
9RUN;
Result :
The output is a comprehensive report detailing the error rates. It includes overall statistics (WER, CER, SER) and a breakdown of errors (substitutions, deletions, insertions) for each transcript pair, allowing for a granular analysis of the model's performance.