entityRes

match

Description

The match action processes an input data table to identify and group similar records, a process known as entity resolution or matching. It assigns a unique cluster ID to each group of matching records based on user-defined rules, facilitating the identification of duplicate or related entities within a dataset. This is a fundamental step in data quality and master data management.

entityRes.match { algorithm="AUTO" | "DISTRIBUTED" | "SINGLE", clusterId="string", clusterIdLabel="string", clusterIdType="CHAR" | "DOUBLE" | "INT", columns={"variable-name-1" <, "variable-name-2", ...>}, doNotCluster="string", emptyStringIsNull=TRUE | FALSE, inTable={<casintable>}, matchRules={{ruleConditions-1} <, {ruleConditions-2}, ...>}, nThreads=integer, nullValuesMatch=TRUE | FALSE, outTable={<casouttable>} };
Settings
ParameterDescription
algorithmSpecifies the algorithm to use for matching. AUTO lets CAS decide, DISTRIBUTED runs the matching across all nodes, and SINGLE runs it on a single node.
clusterIdSpecifies the name of the output column that will contain the generated cluster IDs for matching records.
clusterIdLabelSpecifies the descriptive label for the cluster ID column in the output table.
clusterIdTypeSpecifies the data type for the cluster ID column. Can be CHAR, DOUBLE, or INT.
columnsSpecifies which columns from the input table to include (pass through) in the output table. If omitted, all columns are included.
doNotClusterSpecifies an input column that acts as a flag. If a row's value in this column is 'true' or '1', it will be placed in its own unique cluster.
emptyStringIsNullWhen set to TRUE, treats empty string values in matching columns as NULL values.
inTableSpecifies the input data table to be processed for entity resolution.
matchRulesDefines the set of rules used to determine if records match. Each rule specifies one or more columns to compare.
nThreadsSpecifies the number of threads to use for the operation on each worker node. A value of 0 uses the system default.
nullValuesMatchWhen set to TRUE, records where the matching columns are all NULL will be grouped together.
outTableSpecifies the output data table where the results, including the cluster IDs, will be written.
Data Preparation View data prep sheet
Create Sample Contact Data

This example creates a sample CAS table named 'CONTACTS' with names and addresses. This table will be used to demonstrate the matching capabilities of the entityRes.match action.

Copied!
1PROC CAS;
2 LOADACTIONSET 'dataStep';
3 RUN;
4 dataStep.runCode / code='data casuser.contacts;
5 length name $ 20 address $ 30;
6 infile datalines delimiter=",";
7 input name address;
8 datalines;
9 John Smith,123 Main St
10 Jon Smith,123 Main Street
11 Jane Doe,456 Oak Ave
12 Jane Dow,456 Oak Avenue
13 Peter Jones,789 Pine Ln
14 ;
15 run;';
16RUN;
17 

Examples

This example performs a simple match on the 'CONTACTS' table. It uses a single rule that groups records if the 'name' column is identical. The resulting table 'CONTACTS_MATCHED' will contain the original data plus a cluster ID ('cID') for each group of matched records.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 LOADACTIONSET 'entityRes';
3 RUN;
4 entityRes.match /
5 inTable={name='contacts' caslib='casuser'},
6 outTable={name='contacts_matched' caslib='casuser', replace=true},
7 clusterId='cID',
8 matchRules={{rule={columns={'name'}} }};
9 RUN;
10 fedsql.execDirect query='select * from casuser.contacts_matched;';
11RUN;

This example demonstrates a more complex matching scenario. It uses two matching rules: the first rule matches records with identical 'name', and the second rule matches records with identical 'address'. Records that satisfy either rule will be placed in the same cluster. All columns from the input table are passed to the output table 'CONTACTS_MATCHED_DETAILED', and the cluster ID column is explicitly named 'GroupID'.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 LOADACTIONSET 'entityRes';
3 RUN;
4 entityRes.match /
5 inTable={name='contacts', caslib='casuser'},
6 columns={'name', 'address'},
7 outTable={name='contacts_matched_detailed', caslib='casuser', replace=true},
8 clusterId='GroupID',
9 clusterIdLabel='Group Identifier',
10 clusterIdType='INT',
11 matchRules=[
12 {rule={columns={'name'}}}
13 {rule={columns={'address'}}}
14 ];
15 RUN;
16 TABLE.fetch / TABLE={name='contacts_matched_detailed', caslib='casuser'};
17RUN;

FAQ

What is the purpose of the `match` action in the Entity Resolution action set?
What does the `algorithm` parameter do?
How do I specify the input table for the matching process?
How can I define the matching criteria?
How is the output of the matching process stored?
Can I prevent certain rows from being clustered?
How are NULL or empty string values handled during matching?
Is it possible to control which columns from the input table appear in the output?