aggregation

aggregate

Description

The aggregate action performs aggregation on selected variables. It provides a wide range of summary statistics, including sum, mean, standard deviation, and more. This action is highly flexible, allowing for grouping by one or more variables, handling of time-series data with specified intervals, and creating customized output tables. It is a powerful tool for data summarization and preparation for further analysis.

aggregation.aggregate result=results status=rc / align="BEGINNING" | "ENDING" | "MIDDLE", bin={double-1, double-2, ...}, casOut={...}, contribute="variable-name", contributeColumnLabel="string", contributeColumnName="string", contributeDelimiter="string", contributeTrim=TRUE | FALSE, contributeUnroll=TRUE | FALSE, copyVars={"variable-name-1", "variable-name-2", ...}, doESP=TRUE | FALSE, edgeId="variable-name", exclnpwgt=TRUE | FALSE, excludeSelf=TRUE | FALSE, freq="variable-name", freqStrict=TRUE | FALSE, groupByLimit=64-bit-integer, groupedIntervalOutput=TRUE | FALSE, id="variable-name", idEnd=double, idOutputName="string", idRange={double-1, double-2, ...}, idStart=double, includeEmptyInterval=TRUE | FALSE, includeMissing=TRUE | FALSE, inputs={{...}, ...}, interval="string", jumpingWindow=TRUE | FALSE, keepRecord=TRUE | FALSE, keepRecordId=TRUE | FALSE, modeSingle=TRUE | FALSE, offset=integer, partKey={"string-1", "string-2", ...}, pctlDef=integer, pti=double, ptw=double, raw=TRUE | FALSE, saveGroupbyFormat=TRUE | FALSE, saveGroupbyRaw=TRUE | FALSE, saveVariableColumn=TRUE | FALSE, saveVariableSpecification=TRUE | FALSE, subBinOffset=double, subBinWidth=double, subInterval="string", table={...}, varSpecs={{...}, ...}, weight="variable-name", windowBin={double-1, double-2, ...}, windowInt="string", windowOffset=integer, windowSubBinOffset=double, windowSubBinWidth=double, windowSubInt="string";
Settings
ParameterDescription
alignSpecifies the alignment of the representative value with respect to an interval or bin.
binSpecifies the minimum and maximum values of a bin.
casOutSpecifies the settings for an output table.
contributeWhen the doESP parameter is True, you can specify a variable whose values within each aggregation interval are recorded.
contributeColumnLabelSpecifies a value to override the variable label from the contribute variable.
contributeColumnNameSpecifies a value to override the variable name from the contribute variable.
contributeDelimiterSpecifies a delimiter that is used between concatenated values of the contribute variable.
contributeTrimWhen set to True, leading and trailing blanks are removed from the formatted value of the contribute variable.
contributeUnrollWhen set to True, each raw value from the contribute variable adds a row to the result table.
copyVarsSpecifies the variables to copy from the input table to the output table.
doESPWhen set to True, the action can take advantage of partitioning and ordering on the input table.
edgeIdSpecifies a numeric variable whose values are used to order the values of each varSpecs specification that uses the FIRST, LAST, FNE, or LNE aggregator.
exclnpwgtWhen set to True and a weight variable is specified, then observations with a non-positive weight value are excluded from the analysis.
excludeSelfWhen set to True and the doESP parameter is True, the aggregation excludes the current observation's contribution.
freqSpecifies a numeric variable whose values are used as the frequency of analysis variable values.
freqStrictThis parameter is related to using the MODE aggregator. By default, observations with frequency values that are missing or less than 1 are excluded.
groupByLimitSpecifies the maximum number of levels in a group-by set.
groupedIntervalOutputWhen set to True, save only one of the same aggregated intervals with respect to the last Id value.
idSpecifies a numeric variable that identifies the timestamp that is associated with each observation in the input table.
idEndSpecifies the inclusive maximum value of the ID variable to be considered in the analysis.
idOutputNameSpecifies the new name of the ID variable in the output table.
idRangeSpecifies the inclusive minimum and maximum values of the ID variable to be considered in the analysis.
idStartSpecifies the inclusive minimum value of the ID variable to be considered in the analysis.
includeEmptyIntervalBy default, intervals with a missing value for the ID variable are included in the output. When set to False, these intervals are excluded.
includeMissingBy default, missing values are included in the analysis. When set to False, observations with missing values are excluded.
inputsSpecifies the input variables to use in the analysis. For raw numeric variables, the default aggregator is SUMMARY. For all other situations, the default aggregator is N.
intervalSpecifies the time period for the accumulation of observations.
jumpingWindowWhen set to True, specifies that aggregation occurs over a time window that can contain multiple intervals and aggregation is reset when the specified time range elapses.
keepRecordWhen set to True, each observation's original value for the ID variable is kept without performing interval alignment.
keepRecordIdWhen set to True and the doESP parameter is True, each observation's original ID value is kept without performing interval alignment.
modeSingleThis parameter is related to using the MODE aggregator. By default, the most frequent value is 'missing' if all the distinct values have a frequency of 1. When set to True, the minimum of these distinct values is returned.
offsetSpecifies the offset of each interval.
partKeyWhen the table is partitioned, you can specify a partition key so that the results are computed for the partition only.
pctlDefSpecifies how to compute quantile statistics (percentiles).
ptiSpecifies the time value when the aggregation within an interval or a bin is terminated.
ptwSpecifies the subinterval with respect to each window interval.
rawWhen set to True, raw values of the variables in the input parameter are used.
saveGroupbyFormatBy default, the formatted values of the groupBy variables from the input table are copied to the results. When set to False, the formatted values are not copied.
saveGroupbyRawBy default, the raw values of the groupBy variables from the input table are copied to the results. When set to False, the raw values are not copied.
saveVariableColumnBy default, the variable name for each analysis variable is included in the results. When set to False, this column is not included.
saveVariableSpecificationBy default, the results include a column that is named 'Variable Specification' to identify the varSpecs specification that produced the result. When set to False, this column is not included.
subBinOffsetSpecifies an offset from the beginning of a bin.
subBinWidthSpecifies the width of the sub bin within a bin.
subIntervalSpecifies a smaller interval to control the time period alignment within each interval for the aggregation of observations.
tableSpecifies the input CAS table to be used by the action.
varSpecsSpecifies the variable to aggregate and the settings for the aggregator.
weightSpecifies a numeric variable whose values are used as the weight of numeric analysis variable values when the aggregator is SUMMARY.
windowBinSpecifies the minimum and maximum values of a window bin.
windowIntSpecifies the time window for the accumulation of observations with respect to each time interval.
windowOffsetSpecifies the offset of each window interval.
windowSubBinOffsetSpecifies the starting point within a window bin in which record values are aggregated.
windowSubBinWidthSpecifies the width of the sub bin within each windowBin.
windowSubIntSpecifies a smaller interval to control the sub time period alignment within each window interval for the aggregation of observations.
Data Preparation View data prep sheet
Creating the 'cars' dataset

The following code creates a sample CAS table named 'cars' which contains information about various car models, including their make, model, type, origin, horsepower, and MSRP. This table will be used in the subsequent examples to demonstrate the capabilities of the 'aggregate' action.

Copied!
1DATA casuser.cars;
2 LENGTH Make $ 13 Model $ 35 Type $ 7 Origin $ 6 DriveTrain $ 5;
3 INFILE DATALINES delimiter=',';
4 INPUT Make $ Model $ Type $ Origin $ DriveTrain $ MSRP Horsepower;
5 DATALINES;
6Acura,MDX,SUV,Asia,All,36945,265
7Acura,RSX Type S 2dr,Sedan,Asia,Front,23820,200
8Acura,TSX 4dr,Sedan,Asia,Front,26990,200
9Acura,TL 4dr,Sedan,Asia,Front,33195,270
10Acura,3.5 RL 4dr,Sedan,Asia,Front,43755,225
11Acura,3.5 RL w/Navigation 4dr,Sedan,Asia,Front,46100,225
12Audi,A4 1.8T 4dr,Sedan,Europe,Front,25940,170
13Audi,A41.8T convertible 2dr,Sedan,Europe,Front,35940,170
14Audi,A4 3.0 4dr,Sedan,Europe,Front,31840,220
15Audi,A4 3.0 Quattro 4dr,Sedan,Europe,All,33430,220
16;
17RUN;

Examples

This example demonstrates a basic aggregation on the 'cars' table. It calculates summary statistics (minimum, maximum, and mean) for the 'Horsepower' and 'MSRP' variables across all records in the table. The results are stored in a new CAS table named 'cars_aggregated'.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 aggregation.aggregate /
3 TABLE={name='cars'},
4 varSpecs={{name='Horsepower', agg='min max mean'},
5 {name='MSRP', agg='min max mean'}},
6 casOut={name='cars_aggregated', replace=true};
7RUN;
Result :
The output table 'cars_aggregated' will contain the calculated minimum, maximum, and mean values for the specified variables, providing a quick summary of the data.

This example performs a more detailed aggregation by grouping the data based on the 'Origin' of the cars. It calculates the summary statistics for 'Horsepower' and 'MSRP' for each origin group (Asia and Europe). This allows for a comparative analysis of vehicle characteristics based on their manufacturing region.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 aggregation.aggregate /
3 TABLE={name='cars', groupBy={'Origin'}},
4 varSpecs={{name='Horsepower', agg='min max mean'},
5 {name='MSRP', agg='min max mean'}},
6 casOut={name='cars_aggregated_by_origin', replace=true};
7RUN;
Result :
The output table 'cars_aggregated_by_origin' will contain separate rows for each car 'Origin', with columns for the minimum, maximum, and mean of 'Horsepower' and 'MSRP' for that group.

FAQ

What is the primary purpose of the 'aggregation.aggregate' action?
What is the main required parameter for the 'aggregate' action?
How can I specify which variables to aggregate and which statistics to compute?
Is it possible to group the aggregation results by certain categories?
How can I perform time-based aggregation, for example, by month or day?
What is the difference between a sliding window and a jumping window aggregation?

Associated Scenarios

Use Case
Monthly Customer Transaction Analysis

A retail bank needs to generate a monthly summary of customer transactions. The goal is to calculate the total transaction amount, the average transaction value, and the number ...

Use Case
High-Volume IoT Sensor Data Aggregation for Predictive Maintenance

A manufacturing plant monitors thousands of machines using IoT sensors that stream data every second. To prevent failures, they need to analyze this high-volume data in near rea...

Use Case
Retail Sales Analysis with Sliding Window and Missing Data

A national retail chain wants to analyze daily sales performance using a 3-day sliding window to smooth out daily fluctuations. The analysis must handle days where sales data mi...