conditionalRandomFields crfScore

Robustness to Missing Values and Unknown Tokens

Scénario de test & Cas d'usage

Business Context

A social media monitoring tool processes tweets that may be empty, contain special characters, or words not present in the training dictionary (Out-Of-Vocabulary).
Data Preparation

Creation of a dataset with edge cases: empty strings, null values, and words not in the attribute table.

Copied!
1DATA mycas.dirty_tweets;
2 LENGTH tweet_id $10 text $100;
3 INFILE DATALINES delimiter='|' missover;
4 INPUT tweet_id $ text $;
5 DATALINES;
6TW01|
7TW02|.
8TW03|Supercalifragilisticexpialidocious
9TW04|Standard text
10;
11RUN;
12 
13/* Minimal model tables */
14DATA mycas.e_label; LENGTH _label_ $20 _type_ $20; INPUT _label_ $ _type_ $; DATALINES; O OTHER; RUN;
15DATA mycas.e_attr; LENGTH _attr_ $50 _value_ $50; INPUT _attr_ $ _value_ $; DATALINES; WORD[0] Standard; RUN;
16DATA mycas.e_feat; LENGTH _feature_ $50; INPUT _feature_ $; DATALINES; U00:Standard; RUN;
17DATA mycas.e_attr_feat; INPUT _attrid_ _featureid_ _weight_; DATALINES; 1 1 1.0; RUN;
18DATA mycas.e_temp; LENGTH _template_ $100; INPUT _template_ $; DATALINES; U00:%w[0]; RUN;

Étapes de réalisation

1
Attempting to score dirty data.
Copied!
1PROC CAS;
2 conditionalRandomFields.crfScore
3 TABLE={name='dirty_tweets'},
4 model={
5 attr={name='e_attr'},
6 attrfeature={name='e_attr_feat'},
7 feature={name='e_feat'},
8 label={name='e_label'},
9 template={name='e_temp'}
10 },
11 casOut={name='dirty_scored', replace=true},
12 target='tags';
13RUN;

Expected Result


The action completes without error. Empty texts generate empty or 'O' (Other) sequences depending on model logic. Unknown words (like 'Supercalifragilistic...') are handled gracefully (usually tagged as 'O' or default label) without crashing the session.