Performance Case: Isolating Annotated Events from High-Volume Sensor Data

Business Context

In a manufacturing plant, millions of sensor readings are generated daily. A separate system allows engineers to log specific timestamps where a machine failure was observed. The goal is to efficiently extract only the sensor data points that correspond exactly to the annotated failure events for root cause analysis.

About the Set : activeLearn

Active learning to optimize data labeling.

Discover all actions of activeLearn

Data Preparation

Creates a large table of simulated sensor data ('iot_sensor_data') with 100,000 records and a very small table ('failure_annotations') with only a few annotated failure events. The join key is 'event_id'.

Copied!

1	DATA casuser.iot_sensor_data;
2	LENGTH event_id $ 20;
3	DO i = 1 TO 100000;
4	event_id = 'EVT-' \|\| PUT(i, Z10.);
5	sensor_a = RAND('NORMAL', 100, 5);
6	sensor_b = RAND('NORMAL', 50, 2);
7	OUTPUT;
8	END;
9	RUN;
10
11	DATA casuser.failure_annotations;
12	LENGTH event_id $ 20 failure_code $ 10;
13	event_id = 'EVT-00045123'; failure_code = 'OVERHEAT'; OUTPUT;
14	event_id = 'EVT-00078901'; failure_code = 'PRESSURE'; OUTPUT;
15	RUN;

Étapes de réalisation

Execute an INNER join to quickly filter the massive sensor table down to only the records that have a matching failure annotation. This is the most efficient way to isolate the events of interest.

Copied!

1	PROC CAS;
2	ACTION activeLearn.alJoin /
3	TABLE={name='iot_sensor_data'},
4	annotatedTable={name='failure_annotations'},
5	id='event_id',
6	joinType='INNER',
7	casOut={name='failure_data_points', replace=true};
8	RUN;
9	QUIT;

Verify the result. The output table should be very small, containing only the rows for the specified failure events.

Copied!

1	PROC CAS;
2	TABLE.fetch / TABLE={name='failure_data_points'};
3	TABLE.rowCount / TABLE={name='failure_data_points'};
4	RUN;
5	QUIT;

Expected Result

The output table 'failure_data_points' must contain exactly 2 rows. This demonstrates the action's efficiency in using an INNER join to select a tiny subset of data from a very large table based on a small annotation table, which is a critical performance requirement for big data analytics.

Voir la documentation technique de alJoin