simple

groupByInfo

Beschreibung

The groupByInfo action computes the index and frequency of each group, and the index of each record within its group. This action is part of the Simple Analytics Action Set, which provides basic analytic functions.

simple.groupByInfo <result=results> <status=rc> / <algorithm2>=TRUE | FALSE, <attributes>={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, <casOut>={caslib="string", compress=TRUE | FALSE, indexVars={"variable-name-1" <, "variable-name-2">}, label="string", lifetime=64-bit-integer, maxMemSize=64-bit-integer, memoryFormat="DVR" | "INHERIT" | "STANDARD", name="table-name", promote=TRUE | FALSE, replace=TRUE | FALSE, replication=integer, tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE", threadBlockSize=64-bit-integer, timeStamp="string", where={"string-1" <, "string-2">}}, <copyVars>={"variable-name-1" <, "variable-name-2">}, <cumFreqName>="variable-name", <details>=TRUE | FALSE, <divider>="string", <frequencyName>="variable-name", <generatedColumns>="ALL" | "CUMFREQ" | "F" | "FREQUENCY" | "GROUPID" | "NONE" | "POSITION", <groupByLimit>=64-bit-integer, <groupbyTable>={casLib="string", dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}, importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", where="where-expression"}, <groupIDName>="variable-name", <includeDuplicates>=TRUE | FALSE, <includeMissing>=TRUE | FALSE, <inputRegEx>={"string-1" <, "string-2">}, <inputs>={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, <journalTrace>=TRUE | FALSE, <keyModify>={"string-1" <, "string-2">}, <maxFrequency>=64-bit-integer, <maxPosition>=64-bit-integer, <minFrequency>=64-bit-integer, <minGroupsLL>=64-bit-integer, <minPosition>=64-bit-integer, <noVars>=TRUE | FALSE, <nSubGroupVars>=integer, <position>=64-bit-integer, <positionName>="variable-name", <sparse>=TRUE | FALSE, <subGroupCumFreqName>="variable-name", <subGroupFrequencyName>="variable-name", <subGroupIDName>="variable-name", <subGroupPositionName>="variable-name", table={caslib="string", computedOnDemand=TRUE | FALSE, computedVars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, computedVarsProgram="string", dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2>}, groupBy={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, groupByMode="NOSORT" | "REDISTRIBUTE", importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", orderBy={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, singlePass=TRUE | FALSE, vars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, where="where-expression", whereTable={casLib="string", dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}, importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}, name="table-name", vars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}, where="where-expression"} ;
Einstellungen
ParameterBeschreibung
algorithm2Specifies a more efficient algorithm for processing group information. Processing with this algorithm might be faster when the data contains a large number of groups and the grid that you are using has many nodes. When True is specified, the output table specified by casOut= does not contain formatted-value columns for the input variables (input-variable-name_F columns).
attributesSpecifies the variable attributes.
casOutSpecifies the settings for an output table.
copyVarsSpecifies the columns to copy from the input table to the casOut output table. Variables specified in the inputs parameter and groupBy columns are automatically copied.
cumFreqNameSpecifies the column name for the cumulative frequency in the output table. The column shows the cumulative frequency of the group to which the record belongs, with respect to the input table. The default column name is _CumFreq_.
detailsSpecifies that messages such as the number of groups and other details are written to the SAS log.
dividerSpecifies the character that is used to divide character values of the input variables. Specify a character that is not in these variables.
frequencyNameSpecifies the column name for the group frequencies in the output table. The default column name is _Frequency_.
generatedColumnsSpecifies the generated columns to include in the output table. Values can be ALL, CUMFREQ, F, FREQUENCY, GROUPID, NONE, POSITION.
groupByLimitSpecifies the maximum number of levels in a group-by set. When the server determines this number of levels, the server stops and does not return a result. Specify this parameter if you want to avoid creating large result sets in group-by operations.
groupbyTableSpecifies an input table that contains the groups to use in a group-by analysis.
casLibSpecifies the caslib for the filter table. By default, the active caslib is used.
dataSourceOptionsSpecifies data source options.
importOptionsSpecifies the settings for reading a table from a data source.
nameSpecifies the name of the filter table.
whereSpecifies an expression for subsetting the data from the filter table.
groupIDNameSpecifies the column name for the group IDs in the output table. The default column name is _GroupID_.
includeDuplicatesSpecifies that the output table contains duplicate records.
includeMissingSpecifies that groups that have a missing value in a groupBy variable are included in the casOut table.
inputRegExSpecifies a list of regular expressions, one for each variable in the inputs list.
inputsSpecifies the input variables for the analysis.
journalTraceWhen set to True, the action writes detailed tracing information to the SAS log.
keyModifySpecifies whether you want character values in input variables to be left-justified (L), right-justified (R), converted to uppercase (U), and consecutive blanks condensed to one blank (C).
maxFrequencySpecifies that the output table contains records with _Frequency_ less than or equal to this number.
maxPositionSpecifies that the output table contains records with _Position_ less than or equal to this number.
minFrequencySpecifies that the output table contains records with _Frequency_ greater than or equal to this number.
minGroupsLLSpecifies the minimum number of groups that are required to use post-read parallel algorithms.
minPositionSpecifies that the output table contains records with _Position_ greater than or equal to this number.
noVarsWhen set to True, no variables are automatically copied from the input table to the output table. By default, all columns that are not in the inputs, groupBy, weight, or frequency parameters are copied.
nSubGroupVarsSpecifies the number of variables in the inputs varlist whose values determine subgroups.
positionSpecifies to filter the records for the output table. Only the records that match the specified value in the _Position_ column are included in the output table.
positionNameSpecifies the column name for the position of the record within its group in the output table. The default column name is _Position_.
sparseWhen set to True and the action is running on a distributed server, each thread reads data that belongs to only a few groups. By default, each thread reads records from many or all groups. Specify True to improve performance when there are many threads on all workers, many groups, or both. In some cases, sparse is enabled automatically.
subGroupCumFreqNameSpecifies the column name for the cumulative frequency of subgroups that each record belongs to. The default column name is _SubCumFreq_.
subGroupFrequencyNameSpecifies the column name for the frequency of the record within its subgroup. The default column name is _SubFrequency_.
subGroupIDNameSpecifies the column name for the subgroup IDs. The default column name is _SubGroupID_.
subGroupPositionNameSpecifies column name for the subgroup frequencies. The default column name is _SubPosition_.
tableSpecifies the table name, caslib, and other common parameters.
computedOnDemandSpecifies whether computed variables are to be computed on demand.
computedVarsSpecifies the computed variables to add to the table.
computedVarsProgramSpecifies the string that contains a single DS2 program that is to be executed in CAS.
groupBySpecifies the variables to use for grouping.
groupByModeSpecifies how to handle grouping variables. Values can be NOSORT, REDISTRIBUTE.
orderBySpecifies the variables to use for ordering.
singlePassWhen set to True, only one pass is made over the data to collect statistics.
varsSpecifies the variables to use.
whereTableSpecifies a table that contains a WHERE expression. This is for internal use only.
Create Data for GroupByInfo Example

This is a generic placeholder as no explicit data creation example was found in the provided HTML snippet. A typical data creation would involve loading a table into CAS.

Kopiert!
1DATA casuser.mydata;
2 INPUT ID $ X Y;
3 DATALINES;
4 A 1 10
5 B 2 20
6 A 3 30
7 C 4 40
8 B 5 50
9 A 6 60
10 ;
11RUN;
12PROC CAS;
13 caslib _all_ assign;
14 load DATA=casuser.mydata out=mydata;
15RUN; QUIT;

Beispiele

This example demonstrates how to use groupByInfo with only the required parameters to compute group information for a table.

SAS® / CAS-Code Code wartet auf Validierung durch die Community
Kopiert!
1PROC CAS;
2 SIMPLE.groupByInfo RESULT=r / TABLE={name='mydata', groupBy={'ID'}};
3 PRINT r;
4QUIT;
Ergebnis :
A result table showing the group information for 'mydata' based on the 'ID' variable, including group IDs, frequencies, and positions.

This example illustrates the use of 'algorithm2' for potentially faster processing with a large number of groups, and also includes duplicate records in the output table.

SAS® / CAS-Code Code wartet auf Validierung durch die Community
Kopiert!
1PROC CAS;
2 SIMPLE.groupByInfo RESULT=r /
3 TABLE={name='mydata', groupBy={'ID', 'X'}},
4 algorithm2=TRUE,
5 includeDuplicates=TRUE,
6 generatedColumns={'ALL'};
7 PRINT r;
8QUIT;
Ergebnis :
A detailed result table with group information, using the efficient algorithm and including duplicate records, showing all generated columns like _GroupID_, _Frequency_, _CumFreq_, and _Position_ for each group defined by 'ID' and 'X'.

This example shows how to filter the output table to include only groups that meet a minimum frequency and minimum position within their group.

SAS® / CAS-Code Code wartet auf Validierung durch die Community
Kopiert!
1PROC CAS;
2 SIMPLE.groupByInfo RESULT=r /
3 TABLE={name='mydata', groupBy={'ID'}},
4 minFrequency=2,
5 minPosition=2,
6 generatedColumns={'GROUPID', 'FREQUENCY', 'POSITION'};
7 PRINT r;
8QUIT;
Ergebnis :
A result table containing group IDs, frequencies, and positions, but only for groups where the frequency is 2 or more, and the record's position within its group is 2 or more.

FAQ

What is the simple.groupByInfo Action?
What are the required parameters for simple.groupByInfo?
How can I specify the output table for simple.groupByInfo?
What are the options for generatedColumns in simple.groupByInfo?
Can I use an alternative algorithm for groupByInfo?
How can I filter records based on frequency or position in the output table?
What is the purpose of the `noVars` parameter in simple.groupByInfo?
Can I include groups with missing values in the output?
What is the `groupByLimit` parameter used for?
What parameters are used for reading input tables?
What parameters are used for creating output tables?
What does the 'cumFreqName' parameter do?
What does the 'frequencyName' parameter do?
What does the 'generatedColumns' parameter do?
What does the 'groupIDName' parameter do?
What does the 'positionName' parameter do?
What does the 'subGroupCumFreqName' parameter do?
What does the 'subGroupFrequencyName' parameter do?
What does the 'subGroupIDName' parameter do?
What does the 'subGroupPositionName' parameter do?
What does the 'algorithm2' parameter do?
What does the 'details' parameter do?
What does the 'divider' parameter do?
What does the 'groupByLimit' parameter do?
What does the 'includeDuplicates' parameter do?
What does the 'includeMissing' parameter do?
What does the 'inputRegEx' parameter do?
What does the 'journalTrace' parameter do?
What does the 'keyModify' parameter do?
What does the 'maxFrequency' parameter do?
What does the 'maxPosition' parameter do?
What does the 'minFrequency' parameter do?
What does the 'minGroupsLL' parameter do?
What does the 'minPosition' parameter do?
What does the 'noVars' parameter do?
What does the 'nSubGroupVars' parameter do?
What does the 'position' parameter do?
What does the 'sparse' parameter do?
What are the variable attributes?
What are the settings for an output table?
Which columns are copied from the input table to the casOut output table?
How to specify an input table for group-by analysis?
What are the input variables for the analysis?
How to specify the table name, caslib, and other common parameters?