High-Volume Medical POS Tagging with LBFGS Optimization

Business Context

A hospital system processes thousands of medical notes daily. They need a robust model to tag medical terms, requiring fine-tuned optimization to prevent overfitting on a large, noisy dataset.

Data Preparation

Simulation of a larger dataset representing medical notes, generating 1000 sequences to test the optimizer's performance.

Copied!

1
2	DATA casuser.medical_notes;
3	LENGTH _token_ $20 feature_suffix $3 label $10;
4	DO i=1 to 1000;
5	_start_='BEGIN';
6	_end_='WORD';
7	_token_='Patient';
8	feature_suffix='ent';
9	label='O';
10	OUTPUT;
11	_start_='WORD';
12	_end_='WORD';
13	_token_='shows';
14	feature_suffix='ows';
15	label='O';
16	OUTPUT;
17	_start_='WORD';
18	_end_='END';
19	_token_='symptoms';
20	feature_suffix='oms';
21	label='B-SYM';
22	OUTPUT;
23	END;
24
25	RUN;
26

Étapes de réalisation

Training with LBFGS algorithm, applying L1 regularization and a specific line search method.

Copied!

1
2	PROC CAS;
3	conditionalRandomFields.crfTrain TABLE={name='medical_notes', caslib='casuser'} target='label' template='U00:%x[0,0]
4	U01:%x[0,1]' nloOpts={algorithm='LBFGS', optmlOpt={regL1=0.2, maxIters=100}, lbfgsOpt={lineSearchMethod='WOLFE'}} model={label={name='med_labels'}, attr={name='med_attrs'}, feature={name='med_features'}, attrfeature={name='med_attrfeats'}, template={name='med_template'}};
5
6	RUN;
7

Expected Result

The model trains using the LBFGS solver. The output log should reflect the use of Wolfe line search and L1 regularization. The process completes within the maximum iteration limit.

Voir la documentation technique de crfTrain