Robustness to Missing Data and ID Mismatches

Business Context

In a real-world pipeline, audio files sometimes fail to process, or metadata gets corrupted. This test simulates a 'dirty' dataset where some hypothesis IDs are missing (audio failure), some reference IDs are missing (extra predictions), and some text fields are empty.

About the Set : langModel

Management of Large Language Models (LLM) and NLP.

Discover all actions of langModel

Data Preparation

Creation of disjoint sets of data: 'A' exists in both, 'B' only in Ref, 'C' only in Hyp, and 'D' has empty text content.

Copied!

1
2	DATA mycas.dirty_ref;
3	LENGTH uid $5 text $50;
4	INPUT uid $ text &;
5	DATALINES;
6	ID_A The quick brown fox ID_B Jumps over the dog ID_D Silent audio segment ;
7
8	RUN;
9
10	DATA mycas.dirty_hyp;
11	LENGTH uid $5 text $50;
12	INPUT uid $ text &;
13	DATALINES;
14	ID_A The quick brown fox ID_C New unmatched sentence ID_D ;
15
16	RUN;
17

Étapes de réalisation

Load the disjoint datasets.

Copied!

1
2	PROC CAS;
3	TABLE.fetch / TABLE='dirty_ref';
4	TABLE.fetch / TABLE='dirty_hyp';
5
6	RUN;
7

Execute action expecting handling of unmatched keys and empty strings.

Copied!

1
2	PROC CAS;
3	langModel.calculateErrorRate / TABLE={name='dirty_hyp'} reference={name='dirty_ref'} tableId='uid' referenceId='uid';
4
5	RUN;
6

Expected Result

The action should not crash. It should calculate error rates only for the intersecting ID ('ID_A'). For 'ID_D' (empty text in hypothesis vs content in ref), it should report a 100% deletion error rate. Unmatched IDs ('ID_B', 'ID_C') should ideally be ignored or flagged in a log/warning, but must not stop the execution.

Voir la documentation technique de calculateErrorRate