Aggregate Action CAS: Master SAS Viya Data Summarization

Q: What is the primary purpose of the 'aggregation.aggregate' action?

The 'aggregation.aggregate' action is used to perform aggregation on selected variables within a CAS table. It computes summary statistics for these variables, optionally grouping the results by other variables.

Q: What is the main required parameter for the 'aggregate' action?

The 'table' parameter is required. It specifies the input CAS table that you want to process and perform aggregations on.

Q: How can I specify which variables to aggregate and which statistics to compute?

You can use the 'varSpecs' parameter. It is a list of specifications where each element defines an aggregation. For each specification, you can set the variable name with the 'name' subparameter and the aggregation method with the 'agg' subparameter (e.g., 'SUMMARY', 'N', 'MEAN', 'MAXIMUM').

Q: Is it possible to group the aggregation results by certain categories?

Yes, you can group results by using the 'groupBy' subparameter within the main 'table' parameter. You provide a list of variable names that the action will use to group the data before performing the aggregation.

Q: How can I perform time-based aggregation, for example, by month or day?

To perform time-based aggregation, you must specify a timestamp variable using the 'id' parameter. Then, use the 'interval' parameter to define the time period for accumulation, such as 'MONTH', 'DAY', or 'HOUR'. Parameters like 'align' and 'offset' offer further control over the time intervals.

Q: What is the difference between a sliding window and a jumping window aggregation?

By default, the action uses a sliding window. A jumping (or tumbling) window can be enabled by setting the 'jumpingWindow' parameter to TRUE. A jumping window contains multiple intervals and the aggregation is reset when the time range elapses, whereas a sliding window always retains the same number of intervals by shifting one interval at a time.

At a glance

For data scientists and analysts dealing with high-cardinality datasets, the aggregate action serves as a robust engine for manipulating data granularity. By efficiently computing descriptive statistics across user-defined groups or sliding time intervals, this tool streamlines the transition from granular event logs to actionable time-series data. It acts as a fundamental component of the data wrangling pipeline, ensuring that datasets are properly summarized before feeding into visualization tools or machine learning algorithms. We have compiled a detailed Q&A below to guide you through parameter configuration, memory management, and advanced grouping strategies for this specific tool.

Doc Officielle

Aggregation Action Set: Syntax

Provides actions for aggregating the values of one or more variables

Consulter

Codes SAS Liés

From Raw CSV to Formatted Table: The Efficient Way to Load Local Data to CAS

Example 4 for PROC GENMOD

Description

The aggregate action performs aggregation on selected variables. It provides a wide range of summary statistics, including sum, mean, standard deviation, and more. This action is highly flexible, allowing for grouping by one or more variables, handling of time-series data with specified intervals, and creating customized output tables. It is a powerful tool for data summarization and preparation for further analysis.

aggregation.aggregate result=results status=rc / align="BEGINNING" | "ENDING" | "MIDDLE", bin={double-1, double-2, ...}, casOut={...}, contribute="variable-name", contributeColumnLabel="string", contributeColumnName="string", contributeDelimiter="string", contributeTrim=TRUE | FALSE, contributeUnroll=TRUE | FALSE, copyVars={"variable-name-1", "variable-name-2", ...}, doESP=TRUE | FALSE, edgeId="variable-name", exclnpwgt=TRUE | FALSE, excludeSelf=TRUE | FALSE, freq="variable-name", freqStrict=TRUE | FALSE, groupByLimit=64-bit-integer, groupedIntervalOutput=TRUE | FALSE, id="variable-name", idEnd=double, idOutputName="string", idRange={double-1, double-2, ...}, idStart=double, includeEmptyInterval=TRUE | FALSE, includeMissing=TRUE | FALSE, inputs={{...}, ...}, interval="string", jumpingWindow=TRUE | FALSE, keepRecord=TRUE | FALSE, keepRecordId=TRUE | FALSE, modeSingle=TRUE | FALSE, offset=integer, partKey={"string-1", "string-2", ...}, pctlDef=integer, pti=double, ptw=double, raw=TRUE | FALSE, saveGroupbyFormat=TRUE | FALSE, saveGroupbyRaw=TRUE | FALSE, saveVariableColumn=TRUE | FALSE, saveVariableSpecification=TRUE | FALSE, subBinOffset=double, subBinWidth=double, subInterval="string", table={...}, varSpecs={{...}, ...}, weight="variable-name", windowBin={double-1, double-2, ...}, windowInt="string", windowOffset=integer, windowSubBinOffset=double, windowSubBinWidth=double, windowSubInt="string";

Settings

Parameter	Description
align	Specifies the alignment of the representative value with respect to an interval or bin.
bin	Specifies the minimum and maximum values of a bin.
casOut	Specifies the settings for an output table.
contribute	When the doESP parameter is True, you can specify a variable whose values within each aggregation interval are recorded.
contributeColumnLabel	Specifies a value to override the variable label from the contribute variable.
contributeColumnName	Specifies a value to override the variable name from the contribute variable.
contributeDelimiter	Specifies a delimiter that is used between concatenated values of the contribute variable.
contributeTrim	When set to True, leading and trailing blanks are removed from the formatted value of the contribute variable.
contributeUnroll	When set to True, each raw value from the contribute variable adds a row to the result table.
copyVars	Specifies the variables to copy from the input table to the output table.
doESP	When set to True, the action can take advantage of partitioning and ordering on the input table.
edgeId	Specifies a numeric variable whose values are used to order the values of each varSpecs specification that uses the FIRST, LAST, FNE, or LNE aggregator.
exclnpwgt	When set to True and a weight variable is specified, then observations with a non-positive weight value are excluded from the analysis.
excludeSelf	When set to True and the doESP parameter is True, the aggregation excludes the current observation's contribution.
freq	Specifies a numeric variable whose values are used as the frequency of analysis variable values.
freqStrict	This parameter is related to using the MODE aggregator. By default, observations with frequency values that are missing or less than 1 are excluded.
groupByLimit	Specifies the maximum number of levels in a group-by set.
groupedIntervalOutput	When set to True, save only one of the same aggregated intervals with respect to the last Id value.
id	Specifies a numeric variable that identifies the timestamp that is associated with each observation in the input table.
idEnd	Specifies the inclusive maximum value of the ID variable to be considered in the analysis.
idOutputName	Specifies the new name of the ID variable in the output table.
idRange	Specifies the inclusive minimum and maximum values of the ID variable to be considered in the analysis.
idStart	Specifies the inclusive minimum value of the ID variable to be considered in the analysis.
includeEmptyInterval	By default, intervals with a missing value for the ID variable are included in the output. When set to False, these intervals are excluded.
includeMissing	By default, missing values are included in the analysis. When set to False, observations with missing values are excluded.
inputs	Specifies the input variables to use in the analysis. For raw numeric variables, the default aggregator is SUMMARY. For all other situations, the default aggregator is N.
interval	Specifies the time period for the accumulation of observations.
jumpingWindow	When set to True, specifies that aggregation occurs over a time window that can contain multiple intervals and aggregation is reset when the specified time range elapses.
keepRecord	When set to True, each observation's original value for the ID variable is kept without performing interval alignment.
keepRecordId	When set to True and the doESP parameter is True, each observation's original ID value is kept without performing interval alignment.
modeSingle	This parameter is related to using the MODE aggregator. By default, the most frequent value is 'missing' if all the distinct values have a frequency of 1. When set to True, the minimum of these distinct values is returned.
offset	Specifies the offset of each interval.
partKey	When the table is partitioned, you can specify a partition key so that the results are computed for the partition only.
pctlDef	Specifies how to compute quantile statistics (percentiles).
pti	Specifies the time value when the aggregation within an interval or a bin is terminated.
ptw	Specifies the subinterval with respect to each window interval.
raw	When set to True, raw values of the variables in the input parameter are used.
saveGroupbyFormat	By default, the formatted values of the groupBy variables from the input table are copied to the results. When set to False, the formatted values are not copied.
saveGroupbyRaw	By default, the raw values of the groupBy variables from the input table are copied to the results. When set to False, the raw values are not copied.
saveVariableColumn	By default, the variable name for each analysis variable is included in the results. When set to False, this column is not included.
saveVariableSpecification	By default, the results include a column that is named 'Variable Specification' to identify the varSpecs specification that produced the result. When set to False, this column is not included.
subBinOffset	Specifies an offset from the beginning of a bin.
subBinWidth	Specifies the width of the sub bin within a bin.
subInterval	Specifies a smaller interval to control the time period alignment within each interval for the aggregation of observations.
table	Specifies the input CAS table to be used by the action.
varSpecs	Specifies the variable to aggregate and the settings for the aggregator.
weight	Specifies a numeric variable whose values are used as the weight of numeric analysis variable values when the aggregator is SUMMARY.
windowBin	Specifies the minimum and maximum values of a window bin.
windowInt	Specifies the time window for the accumulation of observations with respect to each time interval.
windowOffset	Specifies the offset of each window interval.
windowSubBinOffset	Specifies the starting point within a window bin in which record values are aggregated.
windowSubBinWidth	Specifies the width of the sub bin within each windowBin.
windowSubInt	Specifies a smaller interval to control the sub time period alignment within each window interval for the aggregation of observations.

Data Preparation View data prep sheet

Creating the 'cars' dataset

The following code creates a sample CAS table named 'cars' which contains information about various car models, including their make, model, type, origin, horsepower, and MSRP. This table will be used in the subsequent examples to demonstrate the capabilities of the 'aggregate' action.

Copied!

1	DATA casuser.cars;
2	LENGTH Make $ 13 Model $ 35 Type $ 7 Origin $ 6 DriveTrain $ 5;
3	INFILE DATALINES delimiter=',';
4	INPUT Make $ Model $ Type $ Origin $ DriveTrain $ MSRP Horsepower;
5	DATALINES;
6	Acura,MDX,SUV,Asia,All,36945,265
7	Acura,RSX Type S 2dr,Sedan,Asia,Front,23820,200
8	Acura,TSX 4dr,Sedan,Asia,Front,26990,200
9	Acura,TL 4dr,Sedan,Asia,Front,33195,270
10	Acura,3.5 RL 4dr,Sedan,Asia,Front,43755,225
11	Acura,3.5 RL w/Navigation 4dr,Sedan,Asia,Front,46100,225
12	Audi,A4 1.8T 4dr,Sedan,Europe,Front,25940,170
13	Audi,A41.8T convertible 2dr,Sedan,Europe,Front,35940,170
14	Audi,A4 3.0 4dr,Sedan,Europe,Front,31840,220
15	Audi,A4 3.0 Quattro 4dr,Sedan,Europe,All,33430,220
16	;
17	RUN;

Examples

This example demonstrates a basic aggregation on the 'cars' table. It calculates summary statistics (minimum, maximum, and mean) for the 'Horsepower' and 'MSRP' variables across all records in the table. The results are stored in a new CAS table named 'cars_aggregated'.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	aggregation.aggregate /
3	TABLE={name='cars'},
4	varSpecs={{name='Horsepower', agg='min max mean'},
5	{name='MSRP', agg='min max mean'}},
6	casOut={name='cars_aggregated', replace=true};
7	RUN;

Result :
The output table 'cars_aggregated' will contain the calculated minimum, maximum, and mean values for the specified variables, providing a quick summary of the data.

This example performs a more detailed aggregation by grouping the data based on the 'Origin' of the cars. It calculates the summary statistics for 'Horsepower' and 'MSRP' for each origin group (Asia and Europe). This allows for a comparative analysis of vehicle characteristics based on their manufacturing region.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	aggregation.aggregate /
3	TABLE={name='cars', groupBy={'Origin'}},
4	varSpecs={{name='Horsepower', agg='min max mean'},
5	{name='MSRP', agg='min max mean'}},
6	casOut={name='cars_aggregated_by_origin', replace=true};
7	RUN;

Result :
The output table 'cars_aggregated_by_origin' will contain separate rows for each car 'Origin', with columns for the minimum, maximum, and mean of 'Horsepower' and 'MSRP' for that group.

FAQ

What is the primary purpose of the 'aggregation.aggregate' action?

What is the main required parameter for the 'aggregate' action?

How can I specify which variables to aggregate and which statistics to compute?

Is it possible to group the aggregation results by certain categories?

How can I perform time-based aggregation, for example, by month or day?

What is the difference between a sliding window and a jumping window aggregation?

Associated Scenarios

Use Case

Monthly Customer Transaction Analysis

A retail bank needs to generate a monthly summary of customer transactions. The goal is to calculate the total transaction amount, the average transaction value, and the number ...

View scenario

Use Case

High-Volume IoT Sensor Data Aggregation for Predictive Maintenance

A manufacturing plant monitors thousands of machines using IoT sensors that stream data every second. To prevent failures, they need to analyze this high-volume data in near rea...

View scenario

Use Case

Retail Sales Analysis with Sliding Window and Missing Data

A national retail chain wants to analyze daily sales performance using a 3-day sliding window to smooth out daily fluctuations. The analysis must handle days where sales data mi...

View scenario

Table of Contents

At a glance

Doc Officielle

From Raw CSV to Formatted Table: The Efficient Way to Load Local Data to CAS

Example 4 for PROC GENMOD

Description

Creating the 'cars' dataset

Examples

Simple Aggregation of Car Data

Detailed Aggregation by Car Origin

FAQ

Associated Scenarios

Use Case

Monthly Customer Transaction Analysis

Use Case

High-Volume IoT Sensor Data Aggregation for Predictive Maintenance

Use Case

Retail Sales Analysis with Sliding Window and Missing Data