The crfTrain action trains a Conditional Random Fields (CRF) model for sequence labeling. CRF is a statistical modeling method often applied in pattern recognition and machine learning for structured prediction. It is particularly suited for tasks like named entity recognition, part-of-speech tagging, and other sequence labeling tasks where the context of the sequence is important for making predictions.
| Parameter | Description |
|---|---|
| model | Specifies the output modeling tables. This includes tables for attributes, features, labels, and the template used. |
| nloOpts | Specifies the nonlinear optimization parameters for the training process, such as the algorithm (LBFGS, SGD, etc.), tolerances, and iteration limits. |
| table | Specifies the input CAS table that contains the data for training the model. |
| target | Specifies the name of the variable in the input table that contains the labels (the predicted/hidden variable). |
| template | Specifies the textual template that defines the features to be extracted from the input data for training the CRF model. |
This example creates a sample dataset named 'my_cas_library.sequence_data'. This table contains sequences of tokens, with each token having associated features and a label. The '_start_', '_end_', and '_token_' columns define the structure of the sequences, which is essential for the CRF training process.
| 1 | DATA my_cas_library.sequence_data; |
| 2 | INFILE DATALINES delimiter=','; |
| 3 | INPUT _start_ $ _end_ $ _token_ $ feature1 $ feature2 $ label $; |
| 4 | DATALINES; |
| 5 | BEGIN,WORD,This,Cap,Len3,O |
| 6 | WORD,WORD,is,Low,Len2,O |
| 7 | WORD,END,a,Low,Len1,O |
| 8 | BEGIN,WORD,SAS,Up,Len3,B-ORG |
| 9 | WORD,END,Viya,Cap,Len4,I-ORG |
| 10 | ; |
| 11 | RUN; |
This example demonstrates a basic training of a Conditional Random Fields model. It uses the 'sequence_data' table as input, specifies 'label' as the target variable, and provides a simple feature template. The trained model components are saved into separate tables for labels, attributes, features, and the template itself.
| 1 | PROC CAS; |
| 2 | conditionalRandomFields.crfTrain |
| 3 | TABLE={name='sequence_data', caslib='my_cas_library'}, |
| 4 | target='label', |
| 5 | template='U00:%x[0,0]', |
| 6 | model={label={name='crf_labels', replace=true}, attr={name='crf_attrs', replace=true}, feature={name='crf_features', replace=true}, attrfeature={name='crf_attrfeatures', replace=true}, template={name='crf_template', replace=true}}; |
| 7 | RUN; |
This example shows a more advanced training configuration. It uses the LBFGS optimization algorithm and specifies several options for the solver, including a regularization term (regL1) to prevent overfitting and a maximum of 50 iterations. The feature template is more complex, including unigrams and bigrams from the token and its surrounding context.
| 1 | PROC CAS; |
| 2 | conditionalRandomFields.crfTrain |
| 3 | TABLE={name='sequence_data', caslib='my_cas_library'}, |
| 4 | target='label', |
| 5 | template='U00:%x[0,0] |
| 6 | U01:%x[-1,0] |
| 7 | U02:%x[1,0] |
| 8 | B01:%x[-1,0]/%x[0,0]', |
| 9 | nloOpts={ |
| 10 | algorithm='LBFGS', |
| 11 | optmlOpt={regL1=0.5, maxIters=50}, |
| 12 | printOpt={printLevel='PRINTDETAIL'} |
| 13 | }, |
| 14 | model={ |
| 15 | label={name='crf_labels_detailed', replace=true, caslib='my_cas_library'}, |
| 16 | attr={name='crf_attrs_detailed', replace=true, caslib='my_cas_library'}, |
| 17 | feature={name='crf_features_detailed', replace=true, caslib='my_cas_library'}, |
| 18 | attrfeature={name='crf_attrfeatures_detailed', replace=true, caslib='my_cas_library'}, |
| 19 | template={name='crf_template_detailed', replace=true, caslib='my_cas_library'} |
| 20 | }; |
| 21 | RUN; |