Robustness to Missing Values and Unknown Tokens

Business Context

A social media monitoring tool processes tweets that may be empty, contain special characters, or words not present in the training dictionary (Out-Of-Vocabulary).

Data Preparation

Creation of a dataset with edge cases: empty strings, null values, and words not in the attribute table.

Copied!

1	DATA mycas.dirty_tweets;
2	LENGTH tweet_id $10 text $100;
3	INFILE DATALINES delimiter='\|' missover;
4	INPUT tweet_id $ text $;
5	DATALINES;
6	TW01\|
7	TW02\|.
8	TW03\|Supercalifragilisticexpialidocious
9	TW04\|Standard text
10	;
11	RUN;
12
13	/* Minimal model tables */
14	DATA mycas.e_label; LENGTH _label_ $20 _type_ $20; INPUT _label_ $ _type_ $; DATALINES; O OTHER; RUN;
15	DATA mycas.e_attr; LENGTH _attr_ $50 _value_ $50; INPUT _attr_ $ _value_ $; DATALINES; WORD[0] Standard; RUN;
16	DATA mycas.e_feat; LENGTH _feature_ $50; INPUT _feature_ $; DATALINES; U00:Standard; RUN;
17	DATA mycas.e_attr_feat; INPUT _attrid_ _featureid_ _weight_; DATALINES; 1 1 1.0; RUN;
18	DATA mycas.e_temp; LENGTH _template_ $100; INPUT _template_ $; DATALINES; U00:%w[0]; RUN;

Étapes de réalisation

Attempting to score dirty data.

Copied!

1	PROC CAS;
2	conditionalRandomFields.crfScore
3	TABLE={name='dirty_tweets'},
4	model={
5	attr={name='e_attr'},
6	attrfeature={name='e_attr_feat'},
7	feature={name='e_feat'},
8	label={name='e_label'},
9	template={name='e_temp'}
10	},
11	casOut={name='dirty_scored', replace=true},
12	target='tags';
13	RUN;

Expected Result

The action completes without error. Empty texts generate empty or 'O' (Other) sequences depending on model logic. Unknown words (like 'Supercalifragilistic...') are handled gracefully (usually tagged as 'O' or default label) without crashing the session.

Voir la documentation technique de crfScore