activeLearn alJoin

Performance Case: Isolating Annotated Events from High-Volume Sensor Data

Scénario de test & Cas d'usage

Business Context

In a manufacturing plant, millions of sensor readings are generated daily. A separate system allows engineers to log specific timestamps where a machine failure was observed. The goal is to efficiently extract only the sensor data points that correspond exactly to the annotated failure events for root cause analysis.
About the Set : activeLearn

Active learning to optimize data labeling.

Discover all actions of activeLearn
Data Preparation

Creates a large table of simulated sensor data ('iot_sensor_data') with 100,000 records and a very small table ('failure_annotations') with only a few annotated failure events. The join key is 'event_id'.

Copied!
1DATA casuser.iot_sensor_data;
2 LENGTH event_id $ 20;
3 DO i = 1 TO 100000;
4 event_id = 'EVT-' || PUT(i, Z10.);
5 sensor_a = RAND('NORMAL', 100, 5);
6 sensor_b = RAND('NORMAL', 50, 2);
7 OUTPUT;
8 END;
9RUN;
10 
11DATA casuser.failure_annotations;
12 LENGTH event_id $ 20 failure_code $ 10;
13 event_id = 'EVT-00045123'; failure_code = 'OVERHEAT'; OUTPUT;
14 event_id = 'EVT-00078901'; failure_code = 'PRESSURE'; OUTPUT;
15RUN;

Étapes de réalisation

1
Execute an INNER join to quickly filter the massive sensor table down to only the records that have a matching failure annotation. This is the most efficient way to isolate the events of interest.
Copied!
1PROC CAS;
2 ACTION activeLearn.alJoin /
3 TABLE={name='iot_sensor_data'},
4 annotatedTable={name='failure_annotations'},
5 id='event_id',
6 joinType='INNER',
7 casOut={name='failure_data_points', replace=true};
8RUN;
9QUIT;
2
Verify the result. The output table should be very small, containing only the rows for the specified failure events.
Copied!
1PROC CAS;
2 TABLE.fetch / TABLE={name='failure_data_points'};
3 TABLE.rowCount / TABLE={name='failure_data_points'};
4RUN;
5QUIT;

Expected Result


The output table 'failure_data_points' must contain exactly 2 rows. This demonstrates the action's efficiency in using an INNER join to select a tiny subset of data from a very large table based on a small annotation table, which is a critical performance requirement for big data analytics.