dataSciencePilot

generateShadowFeatures

Description

Generate shadow features.

Settings
ParameterDescription
casOutspecifies the CAS table to store the analysis results. The casouttable value can be one or more of the following: caslib="string": specifies the name of the caslib for the output table. indexVars={\"variable-name-1\" <, \"variable-name-2\", ...>}: specifies the list of variables to create indexes for in the output data. lifetime=64-bit-integer: specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds. Default: 0 Minimum value: 0. memoryFormat="DVR" | "INHERIT" | "STANDARD": specifies the memory format for the output table. Default: INHERIT. DVR: use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values. INHERIT: use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server. STANDARD: use the standard memory format. name="table-name": specifies the name for the output table. promote=TRUE | FALSE: when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope. Default: FALSE. replace=TRUE | FALSE: when set to True, overwrites an existing table that has the same name. Default: FALSE. tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE": Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server. DEFER: Defer redistribution policy selection to higher-level entity. NOREDIST: Do not redistribute table data when the number of worker pods changes on a running CAS server. REBALANCE: Rebalance table data when the number of worker pods changes on a running CAS server.
copyVarsspecifies the names of variables to be copied to the output table.
distinctCountLimitspecifies the distinct count limit. If the limit is exceeded, and the misraGries parameter is set to True, the Misra-Gries frequency sketch algorithm is used to estimate the frequency distribution. Otherwise, the distinct count operation is aborted. Default: 10000 Minimum value: 256.
ecdfTolerancespecifies the tolerance value for the empirical cumulative distribution function. This value is used by the quantile sketch algorithm. Default: 0.001 Range: 1E-06–0.1.
freqspecifies the frequency variable.
generateLevelsspecifies that levels, instead of raw values, be generated. Default: FALSE.
inputsspecifies the variables to use for the analysis. You can specify a subset of the variables from the input table. For more information about specifying the inputs parameter, see the common casinvardesc parameter (Appendix A: Common Parameters). Alias: vars.
misraGrieswhen set to True, uses the Misra-Gries algorithm for the frequency distribution estimation, if the distinct count limit is exceeded. Default: TRUE.
nominalsspecifies the nominal variables.
nProbesspecifies the number of shadow features to generate for each variable. Default: 5 Range: 1–20.
probeMissingwhen set to True, generates missing values at the observed missing rate. Default: TRUE.
rareThresholdspecifies the rare frequency threshold. Alias: rareFreqCutOff Minimum value (exclusive): 0.
rareThresholdPercentspecifies the rare frequency threshold percentage. Levels whose frequencies are below the threshold are grouped together. Alias: rareThresholdPercentage Range: (0, 100).
samplespecifies the options for sampling the shadow features. The featureProbeSample value can be one or more of the following: nRecords=64-bit-integer: number of observations to sample using the specified model (astore) Alias: nObs Default: 1000 Minimum value: 1. rstore={castable}: specifies an input blob table where to read the model and the state from. The castable value can be one or more of the following: caslib="string": specifies the caslib for the input table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib. dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2, ...>}: specifies data source options. Aliases: options, dataSource. name="table-name": specifies the name of the input table. whereTable={groupbytable}: specifies an input table that contains rows to use as a WHERE filter. If the vars parameter is not specified, then all the variable names that are common to the input table and the filtering table are used to find matching rows. If the where parameter for the input table and this parameter are specified, then this filtering table is applied first. The groupbytable value can be one or more of the following: casLib="string": specifies the caslib for the filter table. By default, the active caslib is used. dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}: specifies data source options. Aliases: options, dataSource. For more information about specifying the dataSourceOptions parameter, see the common dataSourceOptions parameter (Appendix A: Common Parameters). importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}: specifies the settings for reading a table from a data source. Alias: import. For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters). name="table-name": specifies the name of the filter table. vars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}: specifies the variable names to use from the filter table. The casinvardesc value can be one or more of the following: format="string": specifies the format to apply to the variable. formattedLength=integer: specifies the length of the format field plus the length of the format precision. label="string": specifies the descriptive label for the variable. name="variable-name": specifies the name for the variable. nfd=integer: specifies the length of the format precision. nfl=integer: specifies the length of the format field. where="where-expression": specifies an expression for subsetting the data from the filter table.
saveStatespecifies the CAS table to store the feature transformation and generation model. Alias: saveModel. The casouttable value can be one or more of the following: caslib="string": specifies the name of the caslib for the output table. indexVars={\"variable-name-1\" <, \"variable-name-2\", ...>}: specifies the list of variables to create indexes for in the output data. lifetime=64-bit-integer: specifies the number of seconds to keep the table in memory after it is last accessed. The table is dropped if it is not accessed for the specified number of seconds. Default: 0 Minimum value: 0. memoryFormat="DVR" | "INHERIT" | "STANDARD": specifies the memory format for the output table. Default: INHERIT. DVR: use the duplicate value reduction memory format. This memory format can reduce the memory consumption and file size when the input data contains duplicate values. INHERIT: use the default memory format that is set for the server. By default, the server uses the standard memory format. If an administrator sets the CAS_DEFAULT_MEMORY_FORMAT environment variable to DVR, then the DVR memory format is set as the default for the server. STANDARD: use the standard memory format. name="table-name": specifies the name for the output table. promote=TRUE | FALSE: when set to True, adds the output table with a global scope. This enables other sessions to access the table, subject to access controls. The target caslib must also have a global scope. Default: FALSE. replace=TRUE | FALSE: when set to True, overwrites an existing table that has the same name. Default: FALSE. tableRedistUpPolicy="DEFER" | "NOREDIST" | "REBALANCE": Specifies the Table Redistribution Policy when the number of worker pods increases on a running CAS server. DEFER: Defer redistribution policy selection to higher-level entity. NOREDIST: Do not redistribute table data when the number of worker pods changes on a running CAS server. REBALANCE: Rebalance table data when the number of worker pods changes on a running CAS server.
seedspecifies a seed value for random number generation. This value is used for repeatable random number generation in some scenarios. Default: 0.
tablespecifies the table name, caslib, and other common parameters. The castable value can be one or more of the following: caslib="string": specifies the caslib for the input table that you want to use with the action. By default, the active caslib is used. Specify a value only if you need to access a table from a different caslib. computedOnDemand=TRUE | FALSE: when set to True, creates the computed variables when the table is loaded instead of when the action begins. Alias: compOnDemand Default: FALSE. computedVars={{casinvardesc-1} <, {casinvardesc-2}, ...>}: specifies the names of the computed variables to create. Specify an expression for each variable in the computedVarsProgram parameter. If you do not specify this parameter, then all variables from computedVarsProgram are automatically included. Alias: compVars. The casinvardesc value can be one or more of the following: format="string": specifies the format to apply to the variable. formattedLength=integer: specifies the length of the format field plus the length of the format precision. label="string": specifies the descriptive label for the variable. name="variable-name": specifies the name for the variable. nfd=integer: specifies the length of the format precision. nfl=integer: specifies the length of the format field. computedVarsProgram="string": specifies an expression for each computed variable that you include in the computedVars parameter. Alias: compPgm. dataSourceOptions={key-1=any-list-or-data-type-1 <, key-2=any-list-or-data-type-2, ...>}: specifies data source options. Aliases: options, dataSource. importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}: specifies the settings for reading a table from a data source. Alias: import. For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters). name="table-name": specifies the name of the input table. singlePass=TRUE | FALSE: when set to True, does not create a transient table on the server. Setting this parameter to True can be efficient, but the data might not have stable ordering upon repeated runs. Default: FALSE. vars={{casinvardesc-1} <, {casinvardesc-2}, ...>}: specifies the variables to use in the action. The casinvardesc value can be one or more of the following: format="string": specifies the format to apply to the variable. formattedLength=integer: specifies the length of the format field plus the length of the format precision. label="string": specifies the descriptive label for the variable. name="variable-name": specifies the name for the variable. nfd=integer: specifies the length of the format precision. nfl=integer: specifies the length of the format field. where="where-expression": specifies an expression for subsetting the input data. whereTable={groupbytable}: specifies an input table that contains rows to use as a WHERE filter. If the vars parameter is not specified, then all the variable names that are common to the input table and the filtering table are used to find matching rows. If the where parameter for the input table and this parameter are specified, then this filtering table is applied first. The groupbytable value can be one or more of the following: casLib="string": specifies the caslib for the filter table. By default, the active caslib is used. dataSourceOptions={adls_noreq-parameters | bigquery-parameters | cas_noreq-parameters | clouddex-parameters | db2-parameters | dnfs-parameters | esp-parameters | fedsvr-parameters | gcs_noreq-parameters | hadoop-parameters | hana-parameters | impala-parameters | informix-parameters | jdbc-parameters | mongodb-parameters | mysql-parameters | odbc-parameters | oracle-parameters | path-parameters | postgres-parameters | redshift-parameters | s3-parameters | sapiq-parameters | sforce-parameters | singlestore_standard-parameters | snowflake-parameters | spark-parameters | spde-parameters | sqlserver-parameters | ss_noreq-parameters | teradata-parameters | vertica-parameters | yellowbrick-parameters}: specifies data source options. Aliases: options, dataSource. For more information about specifying the dataSourceOptions parameter, see the common dataSourceOptions parameter (Appendix A: Common Parameters). importOptions={fileType="ANY" | "AUDIO" | "AUTO" | "BASESAS" | "CSV" | "DELIMITED" | "DOCUMENT" | "DTA" | "ESP" | "EXCEL" | "FMT" | "HDAT" | "IMAGE" | "JMP" | "LASR" | "PARQUET" | "SOUND" | "SPSS" | "VIDEO" | "XLS", fileType-specific-parameters}: specifies the settings for reading a table from a data source. Alias: import. For more information about specifying the importOptions parameter, see the common importOptions parameter (Appendix A: Common Parameters). name="table-name": specifies the name of the filter table. vars={{format="string", formattedLength=integer, label="string", name="variable-name", nfd=integer, nfl=integer}, {...}}: specifies the variable names to use from the filter table. The casinvardesc value can be one or more of the following: format="string": specifies the format to apply to the variable. formattedLength=integer: specifies the length of the format field plus the length of the format precision. label="string": specifies the descriptive label for the variable. name="variable-name": specifies the name for the variable. nfd=integer: specifies the length of the format precision. nfl=integer: specifies the length of the format field. where="where-expression": specifies an expression for subsetting the data from the filter table.

Examples

FAQ

Parameters for Reading Input Tables
Parameters for Creating Output Tables
casOut
copyVars
distinctCountLimit
ecdfTolerance
freq
generateLevels
inputs
misraGries
nominals
nProbes
probeMissing
rareThreshold
rareThresholdPercent
sample
saveState
seed
table