simple

groupByInfo

Description

The groupByInfo action computes the index and frequency of each group, and the index of each record within its group. This action is part of the Simple Analytics Action Set, which provides basic analytic functions.

simple.groupByInfo <result=results> <status=rc> / <algorithm2>=TRUE | FALSE, <attributes>={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, <casOut>={caslib="string", compress=TRUE | FALSE, indexVars={"variable-name-1" <, "variable-name-2">}, label="string", lifetime=64-bit-integer, maxMemSize=64-bit-integer, memoryFormat="DVR" | "INHERIT" | "STANDARD", name="table-name", promote=TRUE | FALSE, replace=TRUE | FALSE, replication=integer, tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE", threadBlockSize=64-bit-integer, timeStamp="string", where={"string-1" <, "string-2">}}, <copyVars>={"variable-name-1" <, "variable-name-2">}, <cumFreqName>="variable-name", <details>=TRUE | FALSE, <divider>="string", <frequencyName>="variable-name", <generatedColumns>="ALL" | "CUMFREQ" | "F" | "FREQUENCY" | "GROUPID" | "NONE" | "POSITION", <groupByLimit>=64-bit-integer, <groupbyTable>={casLib="string", dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}, importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", where="where-expression"}, <groupIDName>="variable-name", <includeDuplicates>=TRUE | FALSE, <includeMissing>=TRUE | FALSE, <inputRegEx>={"string-1" <, "string-2">}, <inputs>={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, <journalTrace>=TRUE | FALSE, <keyModify>={"string-1" <, "string-2">}, <maxFrequency>=64-bit-integer, <maxPosition>=64-bit-integer, <minFrequency>=64-bit-integer, <minGroupsLL>=64-bit-integer, <minPosition>=64-bit-integer, <noVars>=TRUE | FALSE, <nSubGroupVars>=integer, <position>=64-bit-integer, <positionName>="variable-name", <sparse>=TRUE | FALSE, <subGroupCumFreqName>="variable-name", <subGroupFrequencyName>="variable-name", <subGroupIDName>="variable-name", <subGroupPositionName>="variable-name", table={caslib="string", computedOnDemand=TRUE | FALSE, computedVars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, computedVarsProgram="string", dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2>}, groupBy={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, groupByMode="NOSORT" | "REDISTRIBUTE", importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", orderBy={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, singlePass=TRUE | FALSE, vars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, where="where-expression", whereTable={casLib="string", dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}, importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", vars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, where="where-expression"} ;
Settings
ParameterDescription
algorithm2 Specifies a more efficient algorithm for processing group information. Processing with this algorithm might be faster when the data contains a large number of groups and the grid that you are using has many nodes. When True is specified, the output table specified by casOut= does not contain formatted-value columns for the input variables (input-variable-name_F columns).
attributes Specifies the variable attributes.
casOut Specifies the settings for an output table.
copyVars Specifies the columns to copy from the input table to the casOut output table. Variables specified in the inputs parameter and groupBy columns are automatically copied.
cumFreqName Specifies the column name for the cumulative frequency in the output table. The column shows the cumulative frequency of the group to which the record belongs, with respect to the input table. The default column name is _CumFreq_.
details Specifies that messages such as the number of groups and other details are written to the SAS log.
divider Specifies the character that is used to divide character values of the input variables. Specify a character that is not in these variables.
frequencyName Specifies the column name for the group frequencies in the output table. The default column name is _Frequency_.
generatedColumns Specifies the generated columns to include in the output table. Values can be ALL, CUMFREQ, F, FREQUENCY, GROUPID, NONE, POSITION.
groupByLimit Specifies the maximum number of levels in a group-by set. When the server determines this number of levels, the server stops and does not return a result. Specify this parameter if you want to avoid creating large result sets in group-by operations.
groupbyTable Specifies an input table that contains the groups to use in a group-by analysis.
casLib Specifies the caslib for the filter table. By default, the active caslib is used.
dataSourceOptions Specifies data source options.
importOptions Specifies the settings for reading a table from a data source.
name Specifies the name of the filter table.
where Specifies an expression for subsetting the data from the filter table.
groupIDName Specifies the column name for the group IDs in the output table. The default column name is _GroupID_.
includeDuplicates Specifies that the output table contains duplicate records.
includeMissing Specifies that groups that have a missing value in a groupBy variable are included in the casOut table.
inputRegEx Specifies a list of regular expressions, one for each variable in the inputs list.
inputs Specifies the input variables for the analysis.
journalTrace When set to True, the action writes detailed tracing information to the SAS log.
keyModify Specifies whether you want character values in input variables to be left-justified (L), right-justified (R), converted to uppercase (U), and consecutive blanks condensed to one blank (C).
maxFrequency Specifies that the output table contains records with _Frequency_ less than or equal to this number.
maxPosition Specifies that the output table contains records with _Position_ less than or equal to this number.
minFrequency Specifies that the output table contains records with _Frequency_ greater than or equal to this number.
minGroupsLL Specifies the minimum number of groups that are required to use post-read parallel algorithms.
minPosition Specifies that the output table contains records with _Position_ greater than or equal to this number.
noVars When set to True, no variables are automatically copied from the input table to the output table. By default, all columns that are not in the inputs, groupBy, weight, or frequency parameters are copied.
nSubGroupVars Specifies the number of variables in the inputs varlist whose values determine subgroups.
position Specifies to filter the records for the output table. Only the records that match the specified value in the _Position_ column are included in the output table.
positionName Specifies the column name for the position of the record within its group in the output table. The default column name is _Position_.
sparse When set to True and the action is running on a distributed server, each thread reads data that belongs to only a few groups. By default, each thread reads records from many or all groups. Specify True to improve performance when there are many threads on all workers, many groups, or both. In some cases, sparse is enabled automatically.
subGroupCumFreqName Specifies the column name for the cumulative frequency of subgroups that each record belongs to. The default column name is _SubCumFreq_.
subGroupFrequencyName Specifies the column name for the frequency of the record within its subgroup. The default column name is _SubFrequency_.
subGroupIDName Specifies the column name for the subgroup IDs. The default column name is _SubGroupID_.
subGroupPositionName Specifies column name for the subgroup frequencies. The default column name is _SubPosition_.
table Specifies the table name, caslib, and other common parameters.
computedOnDemand Specifies whether computed variables are to be computed on demand.
computedVars Specifies the computed variables to add to the table.
computedVarsProgram Specifies the string that contains a single DS2 program that is to be executed in CAS.
groupBy Specifies the variables to use for grouping.
groupByMode Specifies how to handle grouping variables. Values can be NOSORT, REDISTRIBUTE.
orderBy Specifies the variables to use for ordering.
singlePass When set to True, only one pass is made over the data to collect statistics.
vars Specifies the variables to use.
whereTable Specifies a table that contains a WHERE expression. This is for internal use only.
Data Preparation View data prep sheet
Create Data for GroupByInfo Example

This is a generic placeholder as no explicit data creation example was found in the provided HTML snippet. A typical data creation would involve loading a table into CAS.

Copied!
1DATA casuser.mydata;
2 INPUT ID $ X Y;
3 DATALINES;
4 A 1 10
5 B 2 20
6 A 3 30
7 C 4 40
8 B 5 50
9 A 6 60
10 ;
11RUN;
12PROC CAS;
13 caslib _all_ assign;
14 load DATA=casuser.mydata out=mydata;
15RUN; QUIT;

Examples

This example demonstrates how to use groupByInfo with only the required parameters to compute group information for a table.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 SIMPLE.groupByInfo RESULT=r / TABLE={name='mydata', groupBy={'ID'}};
3 PRINT r;
4QUIT;
Result :
A result table showing the group information for 'mydata' based on the 'ID' variable, including group IDs, frequencies, and positions.

This example illustrates the use of 'algorithm2' for potentially faster processing with a large number of groups, and also includes duplicate records in the output table.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 SIMPLE.groupByInfo RESULT=r /
3 TABLE={name='mydata', groupBy={'ID', 'X'}},
4 algorithm2=TRUE,
5 includeDuplicates=TRUE,
6 generatedColumns={'ALL'};
7 PRINT r;
8QUIT;
Result :
A detailed result table with group information, using the efficient algorithm and including duplicate records, showing all generated columns like _GroupID_, _Frequency_, _CumFreq_, and _Position_ for each group defined by 'ID' and 'X'.

This example shows how to filter the output table to include only groups that meet a minimum frequency and minimum position within their group.

SAS® / CAS Code Code awaiting community validation
Copied!
1PROC CAS;
2 SIMPLE.groupByInfo RESULT=r /
3 TABLE={name='mydata', groupBy={'ID'}},
4 minFrequency=2,
5 minPosition=2,
6 generatedColumns={'GROUPID', 'FREQUENCY', 'POSITION'};
7 PRINT r;
8QUIT;
Result :
A result table containing group IDs, frequencies, and positions, but only for groups where the frequency is 2 or more, and the record's position within its group is 2 or more.

FAQ

What is the simple.groupByInfo Action?
What are the required parameters for simple.groupByInfo?
How can I specify the output table for simple.groupByInfo?
What are the options for generatedColumns in simple.groupByInfo?
Can I use an alternative algorithm for groupByInfo?
How can I filter records based on frequency or position in the output table?
What is the purpose of the `noVars` parameter in simple.groupByInfo?
Can I include groups with missing values in the output?
What is the `groupByLimit` parameter used for?
What parameters are used for reading input tables?
What parameters are used for creating output tables?
What does the 'cumFreqName' parameter do?
What does the 'frequencyName' parameter do?
What does the 'generatedColumns' parameter do?
What does the 'groupIDName' parameter do?
What does the 'positionName' parameter do?
What does the 'subGroupCumFreqName' parameter do?
What does the 'subGroupFrequencyName' parameter do?
What does the 'subGroupIDName' parameter do?
What does the 'subGroupPositionName' parameter do?
What does the 'algorithm2' parameter do?
What does the 'details' parameter do?
What does the 'divider' parameter do?
What does the 'groupByLimit' parameter do?
What does the 'includeDuplicates' parameter do?
What does the 'includeMissing' parameter do?
What does the 'inputRegEx' parameter do?
What does the 'journalTrace' parameter do?
What does the 'keyModify' parameter do?
What does the 'maxFrequency' parameter do?
What does the 'maxPosition' parameter do?
What does the 'minFrequency' parameter do?
What does the 'minGroupsLL' parameter do?
What does the 'minPosition' parameter do?
What does the 'noVars' parameter do?
What does the 'nSubGroupVars' parameter do?
What does the 'position' parameter do?
What does the 'sparse' parameter do?
What are the variable attributes?
What are the settings for an output table?
Which columns are copied from the input table to the casOut output table?
How to specify an input table for group-by analysis?
What are the input variables for the analysis?
How to specify the table name, caslib, and other common parameters?