groupByInfo - WeAreCAS

Q: What is the simple.groupByInfo Action?

The simple.groupByInfo action computes the index and frequency of each group, and the index of each record within its group. It is part of the Simple Analytics Action Set.

Q: What are the required parameters for simple.groupByInfo?

The `table` parameter, which specifies the input table, is required. Its `name` subparameter is also required.

Q: How can I specify the output table for simple.groupByInfo?

You can specify the settings for an output table using the `casOut` parameter. This parameter supports subparameters such as `caslib`, `name`, `replace`, etc.

Q: What are the options for generatedColumns in simple.groupByInfo?

The `generatedColumns` parameter (alias `genCols`) allows you to specify which generated columns to include in the output table. Options include 'ALL', 'CUMFREQ', 'F', 'FREQUENCY', 'GROUPID', 'NONE', and 'POSITION'. These values must be unique.

Q: Can I use an alternative algorithm for groupByInfo?

Yes, you can specify `algorithm2=TRUE` to use a more efficient algorithm for processing group information. This might be faster for data with a large number of groups or many nodes. When `algorithm2=TRUE`, certain parameters like `divider`, `includeMissing`, `inputRegex`, `keyModify`, `nSubgroupVars`, `subgroupCumFreqName`, `subgroupFrequencyName`, `subgroupIdName`, and `subgroupPositionName` are ignored, and formatted-value columns are not included in the output table.

Q: How can I filter records based on frequency or position in the output table?

You can use `maxFrequency` (alias `maxFreq`) to include records with _Frequency_ less than or equal to a specified number, `minFrequency` (alias `minFreq`) for records with _Frequency_ greater than or equal to a specified number, `maxPosition` (alias `maxIIG`) for records with _Position_ less than or equal to a specified number, and `minPosition` (alias `minPos`) for records with _Position_ greater than or equal to a specified number.

Q: What is the purpose of the `noVars` parameter in simple.groupByInfo?

When `noVars=TRUE`, no variables are automatically copied from the input table to the output table. By default, all columns not in the `inputs`, `groupBy`, `weight`, or `frequency` parameters are copied.

Q: Can I include groups with missing values in the output?

Yes, `includeMissing=TRUE` specifies that groups with a missing value in a groupBy variable are included in the `casOut` table. The default is `FALSE`.

Q: What is the `groupByLimit` parameter used for?

The `groupByLimit` parameter specifies the maximum number of levels in a group-by set. The server stops processing if this limit is reached, which can prevent the creation of excessively large result sets.

Q: What parameters are used for reading input tables?

The parameters for reading input tables are: 'groupbyTable' which specifies an input table that contains the groups to use in a group-by analysis, and 'table' which specifies the table name, caslib, and other common parameters.

Description

The groupByInfo action computes the index and frequency of each group, and the index of each record within its group. This action is part of the Simple Analytics Action Set, which provides basic analytic functions.

simple.groupByInfo <result=results> <status=rc> / <algorithm2>=TRUE | FALSE, <attributes>={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, <casOut>={caslib="string", compress=TRUE | FALSE, indexVars={"variable-name-1" <, "variable-name-2">}, label="string", lifetime=64-bit-integer, maxMemSize=64-bit-integer, memoryFormat="DVR" | "INHERIT" | "STANDARD", name="table-name", promote=TRUE | FALSE, replace=TRUE | FALSE, replication=integer, tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE", threadBlockSize=64-bit-integer, timeStamp="string", where={"string-1" <, "string-2">}}, <copyVars>={"variable-name-1" <, "variable-name-2">}, <cumFreqName>="variable-name", <details>=TRUE | FALSE, <divider>="string", <frequencyName>="variable-name", <generatedColumns>="ALL" | "CUMFREQ" | "F" | "FREQUENCY" | "GROUPID" | "NONE" | "POSITION", <groupByLimit>=64-bit-integer, <groupbyTable>={casLib="string", dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}, importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", where="where-expression"}, <groupIDName>="variable-name", <includeDuplicates>=TRUE | FALSE, <includeMissing>=TRUE | FALSE, <inputRegEx>={"string-1" <, "string-2">}, <inputs>={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, <journalTrace>=TRUE | FALSE, <keyModify>={"string-1" <, "string-2">}, <maxFrequency>=64-bit-integer, <maxPosition>=64-bit-integer, <minFrequency>=64-bit-integer, <minGroupsLL>=64-bit-integer, <minPosition>=64-bit-integer, <noVars>=TRUE | FALSE, <nSubGroupVars>=integer, <position>=64-bit-integer, <positionName>="variable-name", <sparse>=TRUE | FALSE, <subGroupCumFreqName>="variable-name", <subGroupFrequencyName>="variable-name", <subGroupIDName>="variable-name", <subGroupPositionName>="variable-name", table={caslib="string", computedOnDemand=TRUE | FALSE, computedVars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, computedVarsProgram="string", dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2>}, groupBy={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, groupByMode="NOSORT" | "REDISTRIBUTE", importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", orderBy={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, singlePass=TRUE | FALSE, vars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, where="where-expression", whereTable={casLib="string", dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}, importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", vars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, where="where-expression"} ;

Settings

Parameter	Description
algorithm2	Specifies a more efficient algorithm for processing group information. Processing with this algorithm might be faster when the data contains a large number of groups and the grid that you are using has many nodes. When True is specified, the output table specified by casOut= does not contain formatted-value columns for the input variables (input-variable-name_F columns).
attributes	Specifies the variable attributes.
casOut	Specifies the settings for an output table.
copyVars	Specifies the columns to copy from the input table to the casOut output table. Variables specified in the inputs parameter and groupBy columns are automatically copied.
cumFreqName	Specifies the column name for the cumulative frequency in the output table. The column shows the cumulative frequency of the group to which the record belongs, with respect to the input table. The default column name is _CumFreq_.
details	Specifies that messages such as the number of groups and other details are written to the SAS log.
divider	Specifies the character that is used to divide character values of the input variables. Specify a character that is not in these variables.
frequencyName	Specifies the column name for the group frequencies in the output table. The default column name is _Frequency_.
generatedColumns	Specifies the generated columns to include in the output table. Values can be ALL, CUMFREQ, F, FREQUENCY, GROUPID, NONE, POSITION.
groupByLimit	Specifies the maximum number of levels in a group-by set. When the server determines this number of levels, the server stops and does not return a result. Specify this parameter if you want to avoid creating large result sets in group-by operations.
groupbyTable	Specifies an input table that contains the groups to use in a group-by analysis.
casLib	Specifies the caslib for the filter table. By default, the active caslib is used.
dataSourceOptions	Specifies data source options.
importOptions	Specifies the settings for reading a table from a data source.
name	Specifies the name of the filter table.
where	Specifies an expression for subsetting the data from the filter table.
groupIDName	Specifies the column name for the group IDs in the output table. The default column name is _GroupID_.
includeDuplicates	Specifies that the output table contains duplicate records.
includeMissing	Specifies that groups that have a missing value in a groupBy variable are included in the casOut table.
inputRegEx	Specifies a list of regular expressions, one for each variable in the inputs list.
inputs	Specifies the input variables for the analysis.
journalTrace	When set to True, the action writes detailed tracing information to the SAS log.
keyModify	Specifies whether you want character values in input variables to be left-justified (L), right-justified (R), converted to uppercase (U), and consecutive blanks condensed to one blank (C).
maxFrequency	Specifies that the output table contains records with _Frequency_ less than or equal to this number.
maxPosition	Specifies that the output table contains records with _Position_ less than or equal to this number.
minFrequency	Specifies that the output table contains records with _Frequency_ greater than or equal to this number.
minGroupsLL	Specifies the minimum number of groups that are required to use post-read parallel algorithms.
minPosition	Specifies that the output table contains records with _Position_ greater than or equal to this number.
noVars	When set to True, no variables are automatically copied from the input table to the output table. By default, all columns that are not in the inputs, groupBy, weight, or frequency parameters are copied.
nSubGroupVars	Specifies the number of variables in the inputs varlist whose values determine subgroups.
position	Specifies to filter the records for the output table. Only the records that match the specified value in the _Position_ column are included in the output table.
positionName	Specifies the column name for the position of the record within its group in the output table. The default column name is _Position_.
sparse	When set to True and the action is running on a distributed server, each thread reads data that belongs to only a few groups. By default, each thread reads records from many or all groups. Specify True to improve performance when there are many threads on all workers, many groups, or both. In some cases, sparse is enabled automatically.
subGroupCumFreqName	Specifies the column name for the cumulative frequency of subgroups that each record belongs to. The default column name is _SubCumFreq_.
subGroupFrequencyName	Specifies the column name for the frequency of the record within its subgroup. The default column name is _SubFrequency_.
subGroupIDName	Specifies the column name for the subgroup IDs. The default column name is _SubGroupID_.
subGroupPositionName	Specifies column name for the subgroup frequencies. The default column name is _SubPosition_.
table	Specifies the table name, caslib, and other common parameters.
computedOnDemand	Specifies whether computed variables are to be computed on demand.
computedVars	Specifies the computed variables to add to the table.
computedVarsProgram	Specifies the string that contains a single DS2 program that is to be executed in CAS.
groupBy	Specifies the variables to use for grouping.
groupByMode	Specifies how to handle grouping variables. Values can be NOSORT, REDISTRIBUTE.
orderBy	Specifies the variables to use for ordering.
singlePass	When set to True, only one pass is made over the data to collect statistics.
vars	Specifies the variables to use.
whereTable	Specifies a table that contains a WHERE expression. This is for internal use only.

Data Preparation View data prep sheet

Create Data for GroupByInfo Example

This is a generic placeholder as no explicit data creation example was found in the provided HTML snippet. A typical data creation would involve loading a table into CAS.

Copied!

1	DATA casuser.mydata;
2	INPUT ID $ X Y;
3	DATALINES;
4	A 1 10
5	B 2 20
6	A 3 30
7	C 4 40
8	B 5 50
9	A 6 60
10	;
11	RUN;
12	PROC CAS;
13	caslib _all_ assign;
14	load DATA=casuser.mydata out=mydata;
15	RUN; QUIT;

Examples

This example demonstrates how to use groupByInfo with only the required parameters to compute group information for a table.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	SIMPLE.groupByInfo RESULT=r / TABLE={name='mydata', groupBy={'ID'}};
3	PRINT r;
4	QUIT;

Result :
A result table showing the group information for 'mydata' based on the 'ID' variable, including group IDs, frequencies, and positions.

This example illustrates the use of 'algorithm2' for potentially faster processing with a large number of groups, and also includes duplicate records in the output table.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	SIMPLE.groupByInfo RESULT=r /
3	TABLE={name='mydata', groupBy={'ID', 'X'}},
4	algorithm2=TRUE,
5	includeDuplicates=TRUE,
6	generatedColumns={'ALL'};
7	PRINT r;
8	QUIT;

Result :
A detailed result table with group information, using the efficient algorithm and including duplicate records, showing all generated columns like _GroupID_, _Frequency_, _CumFreq_, and _Position_ for each group defined by 'ID' and 'X'.

This example shows how to filter the output table to include only groups that meet a minimum frequency and minimum position within their group.

SAS® / CAS Code Code awaiting community validation

Copied!

1	PROC CAS;
2	SIMPLE.groupByInfo RESULT=r /
3	TABLE={name='mydata', groupBy={'ID'}},
4	minFrequency=2,
5	minPosition=2,
6	generatedColumns={'GROUPID', 'FREQUENCY', 'POSITION'};
7	PRINT r;
8	QUIT;

Result :
A result table containing group IDs, frequencies, and positions, but only for groups where the frequency is 2 or more, and the record's position within its group is 2 or more.

FAQ

What is the simple.groupByInfo Action?

What are the required parameters for simple.groupByInfo?

How can I specify the output table for simple.groupByInfo?

What are the options for generatedColumns in simple.groupByInfo?

Can I use an alternative algorithm for groupByInfo?

How can I filter records based on frequency or position in the output table?

What is the purpose of the `noVars` parameter in simple.groupByInfo?

Can I include groups with missing values in the output?

What is the `groupByLimit` parameter used for?

What parameters are used for reading input tables?

What parameters are used for creating output tables?

What does the 'cumFreqName' parameter do?

What does the 'frequencyName' parameter do?

What does the 'generatedColumns' parameter do?

What does the 'groupIDName' parameter do?

What does the 'positionName' parameter do?

What does the 'subGroupCumFreqName' parameter do?

What does the 'subGroupFrequencyName' parameter do?

What does the 'subGroupIDName' parameter do?

What does the 'subGroupPositionName' parameter do?

What does the 'algorithm2' parameter do?

What does the 'details' parameter do?

What does the 'divider' parameter do?

What does the 'groupByLimit' parameter do?

What does the 'includeDuplicates' parameter do?

What does the 'includeMissing' parameter do?

What does the 'inputRegEx' parameter do?

What does the 'journalTrace' parameter do?

What does the 'keyModify' parameter do?

What does the 'maxFrequency' parameter do?

What does the 'maxPosition' parameter do?

What does the 'minFrequency' parameter do?

What does the 'minGroupsLL' parameter do?

What does the 'minPosition' parameter do?

What does the 'noVars' parameter do?

What does the 'nSubGroupVars' parameter do?

What does the 'position' parameter do?

What does the 'sparse' parameter do?

What are the variable attributes?

What are the settings for an output table?

Which columns are copied from the input table to the casOut output table?

How to specify an input table for group-by analysis?

What are the input variables for the analysis?

How to specify the table name, caslib, and other common parameters?

Actions associées

simple

compare

Compares two tables by computing the index and frequency of each group, cumul...

simple

correlation

The correlation action computes Pearson product-moment correlations. This is ...

simple

crossTab

The crossTab action performs one-way or two-way tabulations, also known as fr...

simple

freq

Generates a frequency distribution for one or more variables. It calculates c...

simple

groupBy

Builds BY groups in terms of the variable value combinations given the variab...

simple

mdSummary

The mdSummary action calculates multidimensional summaries of numeric variabl...

Table of Contents

Description

Create Data for GroupByInfo Example

Examples

Basic Usage with Minimum Required Parameters

Using algorithm2 for Efficient Grouping and Including Duplicates

Filtering by Minimum Frequency and Position

FAQ

Actions associées

compare

correlation

crossTab

freq

groupBy

mdSummary