The match action processes an input data table to identify and group similar records, a process known as entity resolution or matching. It assigns a unique cluster ID to each group of matching records based on user-defined rules, facilitating the identification of duplicate or related entities within a dataset. This is a fundamental step in data quality and master data management.
| Parameter | Description |
|---|---|
| algorithm | Specifies the algorithm to use for matching. AUTO lets CAS decide, DISTRIBUTED runs the matching across all nodes, and SINGLE runs it on a single node. |
| clusterId | Specifies the name of the output column that will contain the generated cluster IDs for matching records. |
| clusterIdLabel | Specifies the descriptive label for the cluster ID column in the output table. |
| clusterIdType | Specifies the data type for the cluster ID column. Can be CHAR, DOUBLE, or INT. |
| columns | Specifies which columns from the input table to include (pass through) in the output table. If omitted, all columns are included. |
| doNotCluster | Specifies an input column that acts as a flag. If a row's value in this column is 'true' or '1', it will be placed in its own unique cluster. |
| emptyStringIsNull | When set to TRUE, treats empty string values in matching columns as NULL values. |
| inTable | Specifies the input data table to be processed for entity resolution. |
| matchRules | Defines the set of rules used to determine if records match. Each rule specifies one or more columns to compare. |
| nThreads | Specifies the number of threads to use for the operation on each worker node. A value of 0 uses the system default. |
| nullValuesMatch | When set to TRUE, records where the matching columns are all NULL will be grouped together. |
| outTable | Specifies the output data table where the results, including the cluster IDs, will be written. |
This example creates a sample CAS table named 'CONTACTS' with names and addresses. This table will be used to demonstrate the matching capabilities of the entityRes.match action.
| 1 | PROC CAS; |
| 2 | LOADACTIONSET 'dataStep'; |
| 3 | RUN; |
| 4 | dataStep.runCode / code='data casuser.contacts; |
| 5 | length name $ 20 address $ 30; |
| 6 | infile datalines delimiter=","; |
| 7 | input name address; |
| 8 | datalines; |
| 9 | John Smith,123 Main St |
| 10 | Jon Smith,123 Main Street |
| 11 | Jane Doe,456 Oak Ave |
| 12 | Jane Dow,456 Oak Avenue |
| 13 | Peter Jones,789 Pine Ln |
| 14 | ; |
| 15 | run;'; |
| 16 | RUN; |
| 17 |
This example performs a simple match on the 'CONTACTS' table. It uses a single rule that groups records if the 'name' column is identical. The resulting table 'CONTACTS_MATCHED' will contain the original data plus a cluster ID ('cID') for each group of matched records.
| 1 | PROC CAS; |
| 2 | LOADACTIONSET 'entityRes'; |
| 3 | RUN; |
| 4 | entityRes.match / |
| 5 | inTable={name='contacts' caslib='casuser'}, |
| 6 | outTable={name='contacts_matched' caslib='casuser', replace=true}, |
| 7 | clusterId='cID', |
| 8 | matchRules={{rule={columns={'name'}} }}; |
| 9 | RUN; |
| 10 | fedsql.execDirect query='select * from casuser.contacts_matched;'; |
| 11 | RUN; |
This example demonstrates a more complex matching scenario. It uses two matching rules: the first rule matches records with identical 'name', and the second rule matches records with identical 'address'. Records that satisfy either rule will be placed in the same cluster. All columns from the input table are passed to the output table 'CONTACTS_MATCHED_DETAILED', and the cluster ID column is explicitly named 'GroupID'.
| 1 | PROC CAS; |
| 2 | LOADACTIONSET 'entityRes'; |
| 3 | RUN; |
| 4 | entityRes.match / |
| 5 | inTable={name='contacts', caslib='casuser'}, |
| 6 | columns={'name', 'address'}, |
| 7 | outTable={name='contacts_matched_detailed', caslib='casuser', replace=true}, |
| 8 | clusterId='GroupID', |
| 9 | clusterIdLabel='Group Identifier', |
| 10 | clusterIdType='INT', |
| 11 | matchRules=[ |
| 12 | {rule={columns={'name'}}} |
| 13 | {rule={columns={'address'}}} |
| 14 | ]; |
| 15 | RUN; |
| 16 | TABLE.fetch / TABLE={name='contacts_matched_detailed', caslib='casuser'}; |
| 17 | RUN; |