The alJoin action is a fundamental tool in active learning workflows, designed to merge a primary data table with an annotation table. This process is crucial for enriching the dataset with labels or other metadata, which is then used in subsequent training iterations. The action supports various join types (such as INNER, LEFT, RIGHT, FULL) to provide flexibility in how the data is combined, based on a shared identifier column.
| Parameter | Description |
|---|---|
| annotatedTable | Specifies the in-memory table containing the annotation data for the join operation. This table is typically the source of labels or other metadata. |
| casOut | Specifies the output table to store the results of the join operation. This new table will contain the combined data from the input and annotation tables. |
| id | Specifies the identifier column used to join the data table and the annotation table. This column must exist in both tables to match records. |
| joinType | Defines the method for joining the tables. Options include APPEND, INNER, LEFT, RIGHT, and FULL. The default is LEFT join. |
| logLevel | Controls the level of detail for progress messages sent to the client. Level 0 (default) sends no messages, 1 sends start/end messages, and 2 includes iteration history. |
| table | Specifies the primary in-memory data table for the join operation. |
This code snippet creates two sample tables: 'raw_data' containing the main dataset with features, and 'annotations' containing the labels. These tables will be used to demonstrate the alJoin action.
| 1 | PROC CAS; SESSION casauto; |
| 2 | |
| 3 | DATA casuser.raw_data; |
| 4 | DO i = 1 to 10; |
| 5 | x1 = rand('UNIFORM'); |
| 6 | x2 = rand('UNIFORM'); |
| 7 | OUTPUT; |
| 8 | END; |
| 9 | RUN; |
| 10 | |
| 11 | DATA casuser.annotations; |
| 12 | DO i = 1, 3, 5, 7, 9; |
| 13 | label = ifn(rand('UNIFORM') > 0.5, 'A', 'B'); |
| 14 | OUTPUT; |
| 15 | END; |
| 16 | RUN; |
| 17 | |
| 18 | QUIT; |
This example performs a simple LEFT join, merging the 'raw_data' table with the 'annotations' table using the 'i' column as the identifier. The result is stored in 'joined_data'.
| 1 | PROC CAS; |
| 2 | ACTION activeLearn.alJoin / |
| 3 | TABLE={name='raw_data'} |
| 4 | annotatedTable={name='annotations'} |
| 5 | id='i' |
| 6 | casOut={name='joined_data', replace=true}; |
| 7 | RUN; |
| 8 | QUIT; |
This example demonstrates an INNER join, which only includes rows where the identifier 'i' exists in both the 'raw_data' and 'annotations' tables. This is useful for creating a dataset containing only labeled observations.
| 1 | PROC CAS; |
| 2 | ACTION activeLearn.alJoin / |
| 3 | TABLE={name='raw_data'} |
| 4 | annotatedTable={name='annotations'} |
| 5 | id='i' |
| 6 | joinType='INNER' |
| 7 | casOut={name='joined_data_inner', caslib='casuser', replace=true}; |
| 8 | RUN; |
| 9 | QUIT; |
A marketing department needs to analyze the effectiveness of a recent email campaign. They have a main table of all customers and a smaller table containing the IDs of customers...
In a manufacturing plant, millions of sensor readings are generated daily. A separate system allows engineers to log specific timestamps where a machine failure was observed. Th...
A data quality team is tasked with reconciling two datasets: a primary customer list and a secondary list of service subscriptions. Due to different data entry processes, some c...