Fixed-Size Input Generation for Edge AI Keyword Spotting

Business Context

Deploying a lightweight 'Wake Word' detection model (e.g., 'Hello SAS') on edge devices. The Convolutional Neural Network (CNN) requires a strictly fixed input size of 50 frames. Input audio clips vary in duration; short clips must be padded, and long clips truncated to exactly 50 frames.

Data Preparation

Creation of 'COMMAND_INPUTS' with audio clips of inconsistent lengths (some short, some long).

Copied!

1	DATA casuser.command_inputs; LENGTH cmd_label $10 audio_clip $3000; INPUT cmd_label $ audio_clip $; DATALINES;
2	SHORT
3	LONG
4	; RUN;

Étapes de réalisation

Extracting features with a Hamming window, limiting frequency range, and forcing exactly 50 output frames.

Copied!

1	PROC CAS;
2	audio.computeFeatures /
3	TABLE={name='command_inputs', caslib='casuser'}
4	audioColumn='audio_clip'
5	frameExtractionOptions={windowType='HAMMING', frameLength=25, frameShift=10}
6	melBanksOptions={lowFreq=20, highFreq=4000}
7	nOutputFrames=50
8	casOut={name='fixed_size_tensor', caslib='casuser', replace=true};
9	RUN;

Expected Result

The output table 'fixed_size_tensor' must contain exactly 50 frames for every input row. Short recordings are zero-padded, and long recordings are truncated to meet the 50-frame requirement.

Voir la documentation technique de computeFeatures