alJoin - WeAreCAS

Q: What is the purpose of the alJoin action?

The alJoin action is used to join a data table with an annotation table. It is part of the Active Learning action set.

Q: What are the required parameters for the alJoin action?

The alJoin action requires three main parameters: 'table' which specifies the primary data table, 'annotatedTable' which specifies the table containing annotation data, and 'casOut' which defines the output table for the joined results.

Q: How do you specify the tables to be joined?

You use the 'table' parameter for the main data table (aliased as 'left') and the 'annotatedTable' parameter for the annotation data (aliased as 'right' or 'annotation').

Q: What is the function of the 'id' parameter in the alJoin action?

The 'id' parameter (aliased as 'idVar') specifies the identifier column that is used to join the data table and the annotation table.

Q: Which types of joins does the alJoin action support?

The 'joinType' parameter allows you to specify the join method. Supported types are: APPEND, FULL, INNER, LEFT, and RIGHT. The default join type is LEFT.

Q: How is the output of the alJoin action specified?

The output is specified using the 'casOut' parameter, which is a required parameter that defines the properties of the new in-memory table that will store the joined results.

At a glance

Constructing robust training datasets often requires merging feature sets with externally generated labels, a task efficiently handled by the alJoin command within the activeLearn action set. This tool serves as the vital connector in the Machine Learning pipeline, mapping raw data tables to their corresponding annotation tables via shared identifiers. By facilitating various merge strategies, it allows Data Scientists to seamlessly integrate ground truth labels back into the primary workflow for model refinement. The FAQ section below addresses key implementation details, troubleshooting common merge conflicts, and best practices for managing large-scale data alignment.

Description

The alJoin action is a fundamental tool in active learning workflows, designed to merge a primary data table with an annotation table. This process is crucial for enriching the dataset with labels or other metadata, which is then used in subsequent training iterations. The action supports various join types (such as INNER, LEFT, RIGHT, FULL) to provide flexibility in how the data is combined, based on a shared identifier column.

activeLearn.alJoin { annotatedTable={...}, casOut={...}, id="string", joinType="string", logLevel=integer, table={...} }

Settings

Parameter	Description
annotatedTable	Specifies the in-memory table containing the annotation data for the join operation. This table is typically the source of labels or other metadata.
casOut	Specifies the output table to store the results of the join operation. This new table will contain the combined data from the input and annotation tables.
id	Specifies the identifier column used to join the data table and the annotation table. This column must exist in both tables to match records.
joinType	Defines the method for joining the tables. Options include APPEND, INNER, LEFT, RIGHT, and FULL. The default is LEFT join.
logLevel	Controls the level of detail for progress messages sent to the client. Level 0 (default) sends no messages, 1 sends start/end messages, and 2 includes iteration history.
table	Specifies the primary in-memory data table for the join operation.

Data Preparation View data prep sheet

Data Creation for alJoin Examples

This code snippet creates two sample tables: 'raw_data' containing the main dataset with features, and 'annotations' containing the labels. These tables will be used to demonstrate the alJoin action.

Copied!

1	PROC CAS; SESSION casauto;
2
3	DATA casuser.raw_data;
4	DO i = 1 to 10;
5	x1 = rand('UNIFORM');
6	x2 = rand('UNIFORM');
7	OUTPUT;
8	END;
9	RUN;
10
11	DATA casuser.annotations;
12	DO i = 1, 3, 5, 7, 9;
13	label = ifn(rand('UNIFORM') > 0.5, 'A', 'B');
14	OUTPUT;
15	END;
16	RUN;
17
18	QUIT;

Examples

This example performs a simple LEFT join, merging the 'raw_data' table with the 'annotations' table using the 'i' column as the identifier. The result is stored in 'joined_data'.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	ACTION activeLearn.alJoin /
3	TABLE={name='raw_data'}
4	annotatedTable={name='annotations'}
5	id='i'
6	casOut={name='joined_data', replace=true};
7	RUN;
8	QUIT;

Result :
An output table named 'joined_data' is created in the casuser caslib. It contains all rows from 'raw_data', with the 'label' column populated for rows where 'i' matches a record in the 'annotations' table. Rows with no match will have a missing value for 'label'.

This example demonstrates an INNER join, which only includes rows where the identifier 'i' exists in both the 'raw_data' and 'annotations' tables. This is useful for creating a dataset containing only labeled observations.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	ACTION activeLearn.alJoin /
3	TABLE={name='raw_data'}
4	annotatedTable={name='annotations'}
5	id='i'
6	joinType='INNER'
7	casOut={name='joined_data_inner', caslib='casuser', replace=true};
8	RUN;
9	QUIT;

Result :
A new table 'joined_data_inner' is created. It will contain only 5 rows (for i=1, 3, 5, 7, 9) where both the features from 'raw_data' and the labels from 'annotations' are present.

FAQ

What is the purpose of the alJoin action?

What are the required parameters for the alJoin action?

How do you specify the tables to be joined?

What is the function of the 'id' parameter in the alJoin action?

Which types of joins does the alJoin action support?

How is the output of the alJoin action specified?

Associated Scenarios

Use Case

Standard Case: Merging Customer Data with Campaign Responses

A marketing department needs to analyze the effectiveness of a recent email campaign. They have a main table of all customers and a smaller table containing the IDs of customers...

View scenario

Use Case

Performance Case: Isolating Annotated Events from High-Volume Sensor Data

In a manufacturing plant, millions of sensor readings are generated daily. A separate system allows engineers to log specific timestamps where a machine failure was observed. Th...

View scenario

Use Case

Edge Case: Reconciling Disparate Datasets with a Full Join

A data quality team is tasked with reconciling two datasets: a primary customer list and a secondary list of service subscriptions. Due to different data entry processes, some c...

View scenario

Table of Contents

At a glance

Description

Data Creation for alJoin Examples

Examples

Simple Left Join

Inner Join with Different Join Type

FAQ

Associated Scenarios

Use Case

Standard Case: Merging Customer Data with Campaign Responses

Use Case

Performance Case: Isolating Annotated Events from High-Volume Sensor Data

Use Case

Edge Case: Reconciling Disparate Datasets with a Full Join