dataSciencePilot analyzeMissingPatterns

Performance Case: Analyzing Missing Sensor Readings in High-Volume IoT Data

Scénario de test & Cas d'usage

Business Context

A manufacturing plant uses thousands of IoT sensors to monitor equipment. Some sensors intermittently fail to report data. The goal is to analyze these missing data patterns at scale to identify potentially failing sensor models or locations, without overwhelming the system's memory for distinct value counts.
About the Set : dataSciencePilot

Automated Machine Learning (AutoML) and pipeline generation.

Discover all actions of dataSciencePilot
Data Preparation

Creation of a summarized IoT sensor reading dataset. 'sensor_id' has high cardinality. 'reading_count' will be used as a frequency weight. 'temperature' and 'pressure' have missing values.

Copied!
1DATA casuser.iot_sensor_logs;
2 LENGTH sensor_id $ 15. location $ 10.;
3 INPUT sensor_id $ location $ temperature pressure reading_count;
4 CARDS;
5SENSOR_A-001 Assembly1 25.5 101.2 1500
6SENSOR_B-734 Assembly1 . 101.5 50
7SENSOR_C-109 Painting2 30.1 . 800
8SENSOR_A-002 Assembly1 25.6 101.3 2000
9SENSOR_D-500 Painting2 . . 25
10SENSOR_B-735 Assembly1 28.2 101.9 1200
11;
12RUN;

Étapes de réalisation

1
Load the summarized sensor data into CAS.
Copied!
1PROC CASUTIL;
2 load DATA=casuser.iot_sensor_logs outcaslib='casuser' casout='iot_sensor_logs' replace;
3RUN;
4QUIT;
2
Run the analysis using 'reading_count' as a frequency variable. Set a low 'distinctCountLimit' to force the use of the Misra-Gries algorithm for the high-cardinality 'sensor_id' variable.
Copied!
1PROC CAS;
2 dataSciencePilot.analyzeMissingPatterns /
3 TABLE={name='iot_sensor_logs', caslib='casuser'},
4 inputs={{name='sensor_id'}, {name='location'}, {name='temperature'}, {name='pressure'}},
5 nominals={'sensor_id', 'location'},
6 freq='reading_count',
7 distinctCountLimit=100,
8 misraGries=TRUE,
9 casOut={name='iot_missing_perf_test', caslib='casuser', replace=true};
10RUN;
11QUIT;

Expected Result


The action should complete successfully despite the high cardinality of 'sensor_id' relative to the 'distinctCountLimit', by leveraging the Misra-Gries algorithm. The 'MissingCounts' table in the output should show weighted counts and percentages based on the 'reading_count' variable. The pattern where both 'temperature' and 'pressure' are missing (representing total sensor failure) should have a weighted count of 25.