aggregation

Aggregate Action CAS: Master SAS Viya Data Summarization

aggregate

L'essentiel
At a glance
For data scientists and analysts dealing with high-cardinality datasets, the aggregate action serves as a robust engine for manipulating data granularity. By efficiently computing descriptive statistics across user-defined groups or sliding time intervals, this tool streamlines the transition from granular event logs to actionable time-series data. It acts as a fundamental component of the data wrangling pipeline, ensuring that datasets are properly summarized before feeding into visualization tools or machine learning algorithms. We have compiled a detailed Q&A below to guide you through parameter configuration, memory management, and advanced grouping strategies for this specific tool.
Doc Officielle
Aggregation Action Set: Syntax

Provides actions for aggregating the values of one or more variables

Consulter

Description

The aggregate action performs aggregation on selected variables. It provides a wide range of summary statistics, including sum, mean, standard deviation, and more. This action is highly flexible, allowing for grouping by one or more variables, handling of time-series data with specified intervals, and creating customized output tables. It is a powerful tool for data summarization and preparation for further analysis.

aggregation.aggregate result=results status=rc / align="BEGINNING" | "ENDING" | "MIDDLE", bin={double-1, double-2, ...}, casOut={...}, contribute="variable-name", contributeColumnLabel="string", contributeColumnName="string", contributeDelimiter="string", contributeTrim=TRUE | FALSE, contributeUnroll=TRUE | FALSE, copyVars={"variable-name-1", "variable-name-2", ...}, doESP=TRUE | FALSE, edgeId="variable-name", exclnpwgt=TRUE | FALSE, excludeSelf=TRUE | FALSE, freq="variable-name", freqStrict=TRUE | FALSE, groupByLimit=64-bit-integer, groupedIntervalOutput=TRUE | FALSE, id="variable-name", idEnd=double, idOutputName="string", idRange={double-1, double-2, ...}, idStart=double, includeEmptyInterval=TRUE | FALSE, includeMissing=TRUE | FALSE, inputs={{...}, ...}, interval="string", jumpingWindow=TRUE | FALSE, keepRecord=TRUE | FALSE, keepRecordId=TRUE | FALSE, modeSingle=TRUE | FALSE, offset=integer, partKey={"string-1", "string-2", ...}, pctlDef=integer, pti=double, ptw=double, raw=TRUE | FALSE, saveGroupbyFormat=TRUE | FALSE, saveGroupbyRaw=TRUE | FALSE, saveVariableColumn=TRUE | FALSE, saveVariableSpecification=TRUE | FALSE, subBinOffset=double, subBinWidth=double, subInterval="string", table={...}, varSpecs={{...}, ...}, weight="variable-name", windowBin={double-1, double-2, ...}, windowInt="string", windowOffset=integer, windowSubBinOffset=double, windowSubBinWidth=double, windowSubInt="string";
Settings
ParameterDescription
align Specifies the alignment of the representative value with respect to an interval or bin.
bin Specifies the minimum and maximum values of a bin.
casOut Specifies the settings for an output table.
contribute When the doESP parameter is True, you can specify a variable whose values within each aggregation interval are recorded.
contributeColumnLabel Specifies a value to override the variable label from the contribute variable.
contributeColumnName Specifies a value to override the variable name from the contribute variable.
contributeDelimiter Specifies a delimiter that is used between concatenated values of the contribute variable.
contributeTrim When set to True, leading and trailing blanks are removed from the formatted value of the contribute variable.
contributeUnroll When set to True, each raw value from the contribute variable adds a row to the result table.
copyVars Specifies the variables to copy from the input table to the output table.
doESP When set to True, the action can take advantage of partitioning and ordering on the input table.
edgeId Specifies a numeric variable whose values are used to order the values of each varSpecs specification that uses the FIRST, LAST, FNE, or LNE aggregator.
exclnpwgt When set to True and a weight variable is specified, then observations with a non-positive weight value are excluded from the analysis.
excludeSelf When set to True and the doESP parameter is True, the aggregation excludes the current observation's contribution.
freq Specifies a numeric variable whose values are used as the frequency of analysis variable values.
freqStrict This parameter is related to using the MODE aggregator. By default, observations with frequency values that are missing or less than 1 are excluded.
groupByLimit Specifies the maximum number of levels in a group-by set.
groupedIntervalOutput When set to True, save only one of the same aggregated intervals with respect to the last Id value.
id Specifies a numeric variable that identifies the timestamp that is associated with each observation in the input table.
idEnd Specifies the inclusive maximum value of the ID variable to be considered in the analysis.
idOutputName Specifies the new name of the ID variable in the output table.
idRange Specifies the inclusive minimum and maximum values of the ID variable to be considered in the analysis.
idStart Specifies the inclusive minimum value of the ID variable to be considered in the analysis.
includeEmptyInterval By default, intervals with a missing value for the ID variable are included in the output. When set to False, these intervals are excluded.
includeMissing By default, missing values are included in the analysis. When set to False, observations with missing values are excluded.
inputs Specifies the input variables to use in the analysis. For raw numeric variables, the default aggregator is SUMMARY. For all other situations, the default aggregator is N.
interval Specifies the time period for the accumulation of observations.
jumpingWindow When set to True, specifies that aggregation occurs over a time window that can contain multiple intervals and aggregation is reset when the specified time range elapses.
keepRecord When set to True, each observation's original value for the ID variable is kept without performing interval alignment.
keepRecordId When set to True and the doESP parameter is True, each observation's original ID value is kept without performing interval alignment.
modeSingle This parameter is related to using the MODE aggregator. By default, the most frequent value is 'missing' if all the distinct values have a frequency of 1. When set to True, the minimum of these distinct values is returned.
offset Specifies the offset of each interval.
partKey When the table is partitioned, you can specify a partition key so that the results are computed for the partition only.
pctlDef Specifies how to compute quantile statistics (percentiles).
pti Specifies the time value when the aggregation within an interval or a bin is terminated.
ptw Specifies the subinterval with respect to each window interval.
raw When set to True, raw values of the variables in the input parameter are used.
saveGroupbyFormat By default, the formatted values of the groupBy variables from the input table are copied to the results. When set to False, the formatted values are not copied.
saveGroupbyRaw By default, the raw values of the groupBy variables from the input table are copied to the results. When set to False, the raw values are not copied.
saveVariableColumn By default, the variable name for each analysis variable is included in the results. When set to False, this column is not included.
saveVariableSpecification By default, the results include a column that is named 'Variable Specification' to identify the varSpecs specification that produced the result. When set to False, this column is not included.
subBinOffset Specifies an offset from the beginning of a bin.
subBinWidth Specifies the width of the sub bin within a bin.
subInterval Specifies a smaller interval to control the time period alignment within each interval for the aggregation of observations.
table Specifies the input CAS table to be used by the action.
varSpecs Specifies the variable to aggregate and the settings for the aggregator.
weight Specifies a numeric variable whose values are used as the weight of numeric analysis variable values when the aggregator is SUMMARY.
windowBin Specifies the minimum and maximum values of a window bin.
windowInt Specifies the time window for the accumulation of observations with respect to each time interval.
windowOffset Specifies the offset of each window interval.
windowSubBinOffset Specifies the starting point within a window bin in which record values are aggregated.
windowSubBinWidth Specifies the width of the sub bin within each windowBin.
windowSubInt Specifies a smaller interval to control the sub time period alignment within each window interval for the aggregation of observations.
Data Preparation View data prep sheet
Creating the 'cars' dataset

The following code creates a sample CAS table named 'cars' which contains information about various car models, including their make, model, type, origin, horsepower, and MSRP. This table will be used in the subsequent examples to demonstrate the capabilities of the 'aggregate' action.

Copied!
1DATA casuser.cars;
2 LENGTH Make $ 13 Model $ 35 Type $ 7 Origin $ 6 DriveTrain $ 5;
3 INFILE DATALINES delimiter=',';
4 INPUT Make $ Model $ Type $ Origin $ DriveTrain $ MSRP Horsepower;
5 DATALINES;
6Acura,MDX,SUV,Asia,All,36945,265
7Acura,RSX Type S 2dr,Sedan,Asia,Front,23820,200
8Acura,TSX 4dr,Sedan,Asia,Front,26990,200
9Acura,TL 4dr,Sedan,Asia,Front,33195,270
10Acura,3.5 RL 4dr,Sedan,Asia,Front,43755,225
11Acura,3.5 RL w/Navigation 4dr,Sedan,Asia,Front,46100,225
12Audi,A4 1.8T 4dr,Sedan,Europe,Front,25940,170
13Audi,A41.8T convertible 2dr,Sedan,Europe,Front,35940,170
14Audi,A4 3.0 4dr,Sedan,Europe,Front,31840,220
15Audi,A4 3.0 Quattro 4dr,Sedan,Europe,All,33430,220
16;
17RUN;

Examples

This example demonstrates a basic aggregation on the 'cars' table. It calculates summary statistics (minimum, maximum, and mean) for the 'Horsepower' and 'MSRP' variables across all records in the table. The results are stored in a new CAS table named 'cars_aggregated'.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 aggregation.aggregate /
3 TABLE={name='cars'},
4 varSpecs={{name='Horsepower', agg='min max mean'},
5 {name='MSRP', agg='min max mean'}},
6 casOut={name='cars_aggregated', replace=true};
7RUN;
Result :
The output table 'cars_aggregated' will contain the calculated minimum, maximum, and mean values for the specified variables, providing a quick summary of the data.

This example performs a more detailed aggregation by grouping the data based on the 'Origin' of the cars. It calculates the summary statistics for 'Horsepower' and 'MSRP' for each origin group (Asia and Europe). This allows for a comparative analysis of vehicle characteristics based on their manufacturing region.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 aggregation.aggregate /
3 TABLE={name='cars', groupBy={'Origin'}},
4 varSpecs={{name='Horsepower', agg='min max mean'},
5 {name='MSRP', agg='min max mean'}},
6 casOut={name='cars_aggregated_by_origin', replace=true};
7RUN;
Result :
The output table 'cars_aggregated_by_origin' will contain separate rows for each car 'Origin', with columns for the minimum, maximum, and mean of 'Horsepower' and 'MSRP' for that group.

FAQ

What is the primary purpose of the 'aggregation.aggregate' action?
What is the main required parameter for the 'aggregate' action?
How can I specify which variables to aggregate and which statistics to compute?
Is it possible to group the aggregation results by certain categories?
How can I perform time-based aggregation, for example, by month or day?
What is the difference between a sliding window and a jumping window aggregation?

Associated Scenarios

Use Case
Monthly Customer Transaction Analysis

A retail bank needs to generate a monthly summary of customer transactions. The goal is to calculate the total transaction amount, the average transaction value, and the number ...

Use Case
High-Volume IoT Sensor Data Aggregation for Predictive Maintenance

A manufacturing plant monitors thousands of machines using IoT sensors that stream data every second. To prevent failures, they need to analyze this high-volume data in near rea...

Use Case
Retail Sales Analysis with Sliding Window and Missing Data

A national retail chain wants to analyze daily sales performance using a 3-day sliding window to smooth out daily fluctuations. The analysis must handle days where sales data mi...