audio computeFeatures

Fixed-Size Input Generation for Edge AI Keyword Spotting

Scénario de test & Cas d'usage

Business Context

Deploying a lightweight 'Wake Word' detection model (e.g., 'Hello SAS') on edge devices. The Convolutional Neural Network (CNN) requires a strictly fixed input size of 50 frames. Input audio clips vary in duration; short clips must be padded, and long clips truncated to exactly 50 frames.
Data Preparation

Creation of 'COMMAND_INPUTS' with audio clips of inconsistent lengths (some short, some long).

Copied!
1DATA casuser.command_inputs; LENGTH cmd_label $10 audio_clip $3000; INPUT cmd_label $ audio_clip $; DATALINES;
2 SHORT
3 LONG
4 ; RUN;

Étapes de réalisation

1
Extracting features with a Hamming window, limiting frequency range, and forcing exactly 50 output frames.
Copied!
1PROC CAS;
2 audio.computeFeatures /
3 TABLE={name='command_inputs', caslib='casuser'}
4 audioColumn='audio_clip'
5 frameExtractionOptions={windowType='HAMMING', frameLength=25, frameShift=10}
6 melBanksOptions={lowFreq=20, highFreq=4000}
7 nOutputFrames=50
8 casOut={name='fixed_size_tensor', caslib='casuser', replace=true};
9 RUN;

Expected Result


The output table 'fixed_size_tensor' must contain exactly 50 frames for every input row. Short recordings are zero-padded, and long recordings are truncated to meet the 50-frame requirement.