activeLearn

alJoin

Description

The alJoin action is a fundamental tool in active learning workflows, designed to merge a primary data table with an annotation table. This process is crucial for enriching the dataset with labels or other metadata, which is then used in subsequent training iterations. The action supports various join types (such as INNER, LEFT, RIGHT, FULL) to provide flexibility in how the data is combined, based on a shared identifier column.

activeLearn.alJoin { annotatedTable={...}, casOut={...}, id="string", joinType="string", logLevel=integer, table={...} }
Settings
ParameterDescription
annotatedTableSpecifies the in-memory table containing the annotation data for the join operation. This table is typically the source of labels or other metadata.
casOutSpecifies the output table to store the results of the join operation. This new table will contain the combined data from the input and annotation tables.
idSpecifies the identifier column used to join the data table and the annotation table. This column must exist in both tables to match records.
joinTypeDefines the method for joining the tables. Options include APPEND, INNER, LEFT, RIGHT, and FULL. The default is LEFT join.
logLevelControls the level of detail for progress messages sent to the client. Level 0 (default) sends no messages, 1 sends start/end messages, and 2 includes iteration history.
tableSpecifies the primary in-memory data table for the join operation.
Data Preparation View data prep sheet
Data Creation for alJoin Examples

This code snippet creates two sample tables: 'raw_data' containing the main dataset with features, and 'annotations' containing the labels. These tables will be used to demonstrate the alJoin action.

Copied!
1PROC CAS; SESSION casauto;
2 
3DATA casuser.raw_data;
4 DO i = 1 to 10;
5 x1 = rand('UNIFORM');
6 x2 = rand('UNIFORM');
7 OUTPUT;
8 END;
9RUN;
10 
11DATA casuser.annotations;
12 DO i = 1, 3, 5, 7, 9;
13 label = ifn(rand('UNIFORM') > 0.5, 'A', 'B');
14 OUTPUT;
15 END;
16RUN;
17 
18QUIT;

Examples

This example performs a simple LEFT join, merging the 'raw_data' table with the 'annotations' table using the 'i' column as the identifier. The result is stored in 'joined_data'.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 ACTION activeLearn.alJoin /
3 TABLE={name='raw_data'}
4 annotatedTable={name='annotations'}
5 id='i'
6 casOut={name='joined_data', replace=true};
7RUN;
8QUIT;
Result :
An output table named 'joined_data' is created in the casuser caslib. It contains all rows from 'raw_data', with the 'label' column populated for rows where 'i' matches a record in the 'annotations' table. Rows with no match will have a missing value for 'label'.

This example demonstrates an INNER join, which only includes rows where the identifier 'i' exists in both the 'raw_data' and 'annotations' tables. This is useful for creating a dataset containing only labeled observations.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 ACTION activeLearn.alJoin /
3 TABLE={name='raw_data'}
4 annotatedTable={name='annotations'}
5 id='i'
6 joinType='INNER'
7 casOut={name='joined_data_inner', caslib='casuser', replace=true};
8RUN;
9QUIT;
Result :
A new table 'joined_data_inner' is created. It will contain only 5 rows (for i=1, 3, 5, 7, 9) where both the features from 'raw_data' and the labels from 'annotations' are present.

FAQ

What is the purpose of the alJoin action?
What are the required parameters for the alJoin action?
How do you specify the tables to be joined?
What is the function of the 'id' parameter in the alJoin action?
Which types of joins does the alJoin action support?
How is the output of the alJoin action specified?

Associated Scenarios

Use Case
Standard Case: Merging Customer Data with Campaign Responses

A marketing department needs to analyze the effectiveness of a recent email campaign. They have a main table of all customers and a smaller table containing the IDs of customers...

Use Case
Performance Case: Isolating Annotated Events from High-Volume Sensor Data

In a manufacturing plant, millions of sensor readings are generated daily. A separate system allows engineers to log specific timestamps where a machine failure was observed. Th...

Use Case
Edge Case: Reconciling Disparate Datasets with a Full Join

A data quality team is tasked with reconciling two datasets: a primary customer list and a secondary list of service subscriptions. Due to different data entry processes, some c...