activeLearn alJoin

Edge Case: Reconciling Disparate Datasets with a Full Join

Scénario de test & Cas d'usage

Business Context

A data quality team is tasked with reconciling two datasets: a primary customer list and a secondary list of service subscriptions. Due to different data entry processes, some customers may be missing from the subscription list, and some subscriptions may belong to customers not in the primary list. The goal is to create a complete view of all records from both tables to identify and resolve discrepancies.
About the Set : activeLearn

Active learning to optimize data labeling.

Discover all actions of activeLearn
Data Preparation

Creates two tables with non-perfectly overlapping keys. 'customer_list' has customers 1-5. 'subscription_list' has subscriptions for customers 4, 5, and 6 (an orphan record).

Copied!
1DATA casuser.customer_list;
2 LENGTH cust_id $ 5;
3 DO i = 1 TO 5;
4 cust_id = 'C' || LEFT(PUT(i, 4.));
5 region = CHOOSE(i, 'NORTH', 'SOUTH', 'EAST', 'WEST', 'NORTH');
6 OUTPUT;
7 END;
8RUN;
9 
10DATA casuser.subscription_list;
11 LENGTH cust_id $ 5 sub_type $ 10;
12 cust_id = 'C4'; sub_type = 'PREMIUM'; OUTPUT;
13 cust_id = 'C5'; sub_type = 'BASIC'; OUTPUT;
14 cust_id = 'C6'; sub_type = 'PREMIUM'; OUTPUT;
15RUN;

Étapes de réalisation

1
Perform a FULL join to ensure that no records are dropped from either table. This is essential for data reconciliation tasks where the goal is to see all data from both sources.
Copied!
1PROC CAS;
2 ACTION activeLearn.alJoin /
3 TABLE={name='customer_list'},
4 annotatedTable={name='subscription_list'},
5 id='cust_id',
6 joinType='FULL',
7 casOut={name='reconciliation_view', replace=true};
8RUN;
9QUIT;
2
Analyze the joined table to identify discrepancies. We will check the total row count and fetch the full table to observe the missing values on both sides of the join.
Copied!
1PROC CAS;
2 TABLE.fetch / TABLE={name='reconciliation_view'};
3 TABLE.rowCount / TABLE={name='reconciliation_view'};
4RUN;
5QUIT;

Expected Result


The output table 'reconciliation_view' must contain 6 rows in total. Customers C1, C2, C3 will have missing values for 'sub_type'. Customers C4 and C5 will have complete data. Customer C6 will have a value for 'sub_type' but missing values for 'region'. This confirms the FULL join works as expected, retaining all records and highlighting data integrity issues.