match - WeAreCAS

Q: What is the purpose of the `match` action in the Entity Resolution action set?

The `match` action is used for Data Management Matching, which involves grouping rows into clusters based on specified rules.

Q: What does the `algorithm` parameter do?

The `algorithm` parameter specifies the algorithm to use for matching. The available options are 'AUTO', 'DISTRIBUTED', or 'SINGLE'. 'AUTO' is the default.

Q: How do I specify the input table for the matching process?

Use the `inTable` parameter to specify the input data table. You need to provide the table name and optionally the caslib.

Q: How can I define the matching criteria?

The matching criteria are defined using the `matchRules` parameter. It requires a list of rules, where each rule specifies the set of columns to be used for matching rows.

Q: How is the output of the matching process stored?

The `outTable` parameter specifies the output data table where the results, including the original columns and the new cluster IDs, are written. The `clusterId` parameter lets you name the column containing the cluster IDs, which defaults to 'cID'.

Q: Can I prevent certain rows from being clustered?

Yes, you can use the `doNotCluster` parameter. Specify the name of a column in the input table that contains a Boolean flag. If the value for a row is 'true' or '1', it will be placed in its own separate cluster.

Q: How are NULL or empty string values handled during matching?

By default, empty strings are treated as NULL values (`emptyStringIsNull` is TRUE). However, NULL values are not matched with each other (`nullValuesMatch` is FALSE). You can change these behaviors by setting these boolean parameters.

Q: Is it possible to control which columns from the input table appear in the output?

Yes, the `columns` parameter allows you to specify a list of column names from the input table to be passed through to the output table. If this parameter is not specified, all input columns will be included in the output.

Codes SAS Liés

The 'Many-to-Many' Trap: Handling Duplicate Keys and Missing Values in SAS Merges

Automating SAS Graphics: How to Build a Dynamic GIF Export Macro with Custom GOPTIONS

Updating Server Contexts for SAS Libraries

Setting Graphic Parameters for PDF Output

Macro hasvarsc - Checking for presence of character variables

Extracting File Paths from Metadata

Stored Process Source Code Extraction via Metadata

Extracting Stored Process Source Code (SAS Metadata)

Concatenation of SAS Tables

Value Match Search and Variable Name Retrieval

Description

The match action processes an input data table to identify and group similar records, a process known as entity resolution or matching. It assigns a unique cluster ID to each group of matching records based on user-defined rules, facilitating the identification of duplicate or related entities within a dataset. This is a fundamental step in data quality and master data management.

entityRes.match { algorithm="AUTO" | "DISTRIBUTED" | "SINGLE", clusterId="string", clusterIdLabel="string", clusterIdType="CHAR" | "DOUBLE" | "INT", columns={"variable-name-1" <, "variable-name-2", ...>}, doNotCluster="string", emptyStringIsNull=TRUE | FALSE, inTable={<casintable>}, matchRules={{ruleConditions-1} <, {ruleConditions-2}, ...>}, nThreads=integer, nullValuesMatch=TRUE | FALSE, outTable={<casouttable>} };

Settings

Parameter	Description
algorithm	Specifies the algorithm to use for matching. AUTO lets CAS decide, DISTRIBUTED runs the matching across all nodes, and SINGLE runs it on a single node.
clusterId	Specifies the name of the output column that will contain the generated cluster IDs for matching records.
clusterIdLabel	Specifies the descriptive label for the cluster ID column in the output table.
clusterIdType	Specifies the data type for the cluster ID column. Can be CHAR, DOUBLE, or INT.
columns	Specifies which columns from the input table to include (pass through) in the output table. If omitted, all columns are included.
doNotCluster	Specifies an input column that acts as a flag. If a row's value in this column is 'true' or '1', it will be placed in its own unique cluster.
emptyStringIsNull	When set to TRUE, treats empty string values in matching columns as NULL values.
inTable	Specifies the input data table to be processed for entity resolution.
matchRules	Defines the set of rules used to determine if records match. Each rule specifies one or more columns to compare.
nThreads	Specifies the number of threads to use for the operation on each worker node. A value of 0 uses the system default.
nullValuesMatch	When set to TRUE, records where the matching columns are all NULL will be grouped together.
outTable	Specifies the output data table where the results, including the cluster IDs, will be written.

Data Preparation View data prep sheet

Create Sample Contact Data

This example creates a sample CAS table named 'CONTACTS' with names and addresses. This table will be used to demonstrate the matching capabilities of the entityRes.match action.

Copied!

1	PROC CAS;
2	LOADACTIONSET 'dataStep';
3	RUN;
4	dataStep.runCode / code='data casuser.contacts;
5	length name $ 20 address $ 30;
6	infile datalines delimiter=",";
7	input name address;
8	datalines;
9	John Smith,123 Main St
10	Jon Smith,123 Main Street
11	Jane Doe,456 Oak Ave
12	Jane Dow,456 Oak Avenue
13	Peter Jones,789 Pine Ln
14	;
15	run;';
16	RUN;
17

Examples

This example performs a simple match on the 'CONTACTS' table. It uses a single rule that groups records if the 'name' column is identical. The resulting table 'CONTACTS_MATCHED' will contain the original data plus a cluster ID ('cID') for each group of matched records.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	LOADACTIONSET 'entityRes';
3	RUN;
4	entityRes.match /
5	inTable={name='contacts' caslib='casuser'},
6	outTable={name='contacts_matched' caslib='casuser', replace=true},
7	clusterId='cID',
8	matchRules={{rule={columns={'name'}} }};
9	RUN;
10	fedsql.execDirect query='select * from casuser.contacts_matched;';
11	RUN;

This example demonstrates a more complex matching scenario. It uses two matching rules: the first rule matches records with identical 'name', and the second rule matches records with identical 'address'. Records that satisfy either rule will be placed in the same cluster. All columns from the input table are passed to the output table 'CONTACTS_MATCHED_DETAILED', and the cluster ID column is explicitly named 'GroupID'.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	LOADACTIONSET 'entityRes';
3	RUN;
4	entityRes.match /
5	inTable={name='contacts', caslib='casuser'},
6	columns={'name', 'address'},
7	outTable={name='contacts_matched_detailed', caslib='casuser', replace=true},
8	clusterId='GroupID',
9	clusterIdLabel='Group Identifier',
10	clusterIdType='INT',
11	matchRules=[
12	{rule={columns={'name'}}}
13	{rule={columns={'address'}}}
14	];
15	RUN;
16	TABLE.fetch / TABLE={name='contacts_matched_detailed', caslib='casuser'};
17	RUN;

FAQ

What is the purpose of the `match` action in the Entity Resolution action set?

What does the `algorithm` parameter do?

How do I specify the input table for the matching process?

How can I define the matching criteria?

How is the output of the matching process stored?

Can I prevent certain rows from being clustered?

How are NULL or empty string values handled during matching?

Is it possible to control which columns from the input table appear in the output?

Table of Contents

The 'Many-to-Many' Trap: Handling Duplicate Keys and Missing Values in SAS Merges

Automating SAS Graphics: How to Build a Dynamic GIF Export Macro with Custom GOPTIONS

Updating Server Contexts for SAS Libraries

Setting Graphic Parameters for PDF Output

Macro hasvarsc - Checking for presence of character variables

Extracting File Paths from Metadata

Stored Process Source Code Extraction via Metadata

Extracting Stored Process Source Code (SAS Metadata)

Concatenation of SAS Tables

Value Match Search and Variable Name Retrieval

Description

Create Sample Contact Data

Examples

Basic Match on a Single Column

Detailed Match with Multiple Rules and All Columns

FAQ