The crfScore action uses a Conditional Random Fields (CRF) model to score input data. It performs sequence labeling on the documents, predicting a sequence of labels for a sequence of tokens. This is commonly used for tasks like Named Entity Recognition (NER). The action requires a pre-trained CRF model, which consists of several tables (attributes, features, labels, etc.), and an input table containing the text to be scored.
| Parameter | Description |
|---|---|
| casOut | Specifies the output CAS table to store the tagged data. This table will contain the original data along with the predicted labels for each token. |
| model | Specifies the input tables that constitute the trained CRF model. This is a dictionary parameter that must include the 'attr', 'attrfeature', 'feature', 'label', and 'template' tables. |
| table | Specifies the input CAS table that contains the documents to be scored. |
| target | Specifies the name of the variable in the output table that will contain the predicted labels (the hidden sequence). |
This example first creates a sample input table 'score_data' with document IDs and text. Then, it simulates the creation of the five required model tables ('crf_attr', 'crf_attr_feature', 'crf_feature', 'crf_label', 'crf_template') that would typically be generated by the 'crfTrain' action. These tables are necessary for the 'crfScore' action to function.
| 1 | /* 1. Create sample data to score */ |
| 2 | DATA mycas.score_data; |
| 3 | INFILE DATALINES delimiter='|'; |
| 4 | LENGTH docid $ 10 text $ 300; |
| 5 | INPUT docid $ text $; |
| 6 | DATALINES; |
| 7 | 1|John Smith lives in New York. |
| 8 | 2|Mary works for SAS Institute. |
| 9 | ; |
| 10 | RUN; |
| 11 | |
| 12 | /* 2. Simulate pre-existing CRF model tables (usually created by crfTrain) */ |
| 13 | |
| 14 | /* Label Table */ |
| 15 | DATA mycas.crf_label; |
| 16 | INFILE DATALINES delimiter=','; |
| 17 | LENGTH _label_ $20 _type_ $20; |
| 18 | INPUT _label_ $ _type_ $; |
| 19 | DATALINES; |
| 20 | B-PER,PERSON |
| 21 | I-PER,PERSON |
| 22 | B-ORG,ORGANIZATION |
| 23 | I-ORG,ORGANIZATION |
| 24 | B-LOC,LOCATION |
| 25 | I-LOC,LOCATION |
| 26 | O,OTHER |
| 27 | ; |
| 28 | RUN; |
| 29 | |
| 30 | /* Attribute Table */ |
| 31 | DATA mycas.crf_attr; |
| 32 | INFILE DATALINES delimiter=','; |
| 33 | LENGTH _attr_ $50 _value_ $50; |
| 34 | INPUT _attr_ $ _value_ $; |
| 35 | DATALINES; |
| 36 | WORD[0],John |
| 37 | WORD[0],Smith |
| 38 | WORD[0],lives |
| 39 | WORD[0],in |
| 40 | WORD[0],New |
| 41 | WORD[0],York |
| 42 | WORD[0],Mary |
| 43 | WORD[0],works |
| 44 | WORD[0],for |
| 45 | WORD[0],SAS |
| 46 | WORD[0],Institute |
| 47 | ; |
| 48 | RUN; |
| 49 | |
| 50 | /* Feature Table */ |
| 51 | DATA mycas.crf_feature; |
| 52 | INFILE DATALINES delimiter=','; |
| 53 | LENGTH _feature_ $50; |
| 54 | INPUT _feature_ $; |
| 55 | DATALINES; |
| 56 | U01:York |
| 57 | U02:New |
| 58 | L:B-PER |
| 59 | U00:John |
| 60 | U00:Smith |
| 61 | U00:Mary |
| 62 | U00:SAS |
| 63 | ; |
| 64 | RUN; |
| 65 | |
| 66 | /* Attribute-Feature Table */ |
| 67 | DATA mycas.crf_attr_feature; |
| 68 | INFILE DATALINES delimiter=','; |
| 69 | INPUT _attrid_ _featureid_ _weight_ ; |
| 70 | DATALINES; |
| 71 | 1 4 1.5 |
| 72 | 2 5 1.6 |
| 73 | 3 1 0.2 |
| 74 | 4 2 0.3 |
| 75 | 5 6 1.7 |
| 76 | 6 7 1.8 |
| 77 | ; |
| 78 | RUN; |
| 79 | |
| 80 | /* Template Table */ |
| 81 | DATA mycas.crf_template; |
| 82 | INFILE DATALINES delimiter=','; |
| 83 | LENGTH _template_ $100; |
| 84 | INPUT _template_ $; |
| 85 | DATALINES; |
| 86 | U00:%w[0] |
| 87 | U01:%w[1] |
| 88 | U02:%w[-1] |
| 89 | L |
| 90 | ; |
| 91 | RUN; |
This example demonstrates how to use the `crfScore` action to apply a trained Conditional Random Fields model to new data. It specifies the input data table ('score_data'), the set of model tables, and the output table ('crf_scored_output'). The `target` parameter names the new column that will hold the predicted entity labels.
| 1 | PROC CAS; |
| 2 | conditionalRandomFields.crfScore |
| 3 | TABLE={name='score_data'}, |
| 4 | model={ |
| 5 | attr={name='crf_attr'}, |
| 6 | attrfeature={name='crf_attr_feature'}, |
| 7 | feature={name='crf_feature'}, |
| 8 | label={name='crf_label'}, |
| 9 | template={name='crf_template'} |
| 10 | }, |
| 11 | casOut={name='crf_scored_output', replace=true}, |
| 12 | target='predicted_label'; |
| 13 | RUN; |
| 14 | |
| 15 | /* Display the scored results */ |
| 16 | PROC PRINT DATA=mycas.crf_scored_output; |
| 17 | RUN; QUIT; |
A hospital wants to automatically structure unstructured patient notes by identifying symptoms and medications using a pre-trained CRF model.
An e-commerce platform needs to tag named entities (Product Names, Brands) in thousands of daily customer reviews to analyze trends. Performance and stability on larger datasets...
A social media monitoring tool processes tweets that may be empty, contain special characters, or words not present in the training dictionary (Out-Of-Vocabulary).