Stop Bloating Your Datasets: The Definitive Guide to Variable Selection with KEEP

Difficulty Level

Beginner

Published on : 15/03/2023

Expert Advice

Michael
Responsable de l'infrastructure Viya.

ever use KEEP and DROP in the same DATA step; it creates logical redundancy and can lead to confusion since DROP always takes precedence. Additionally, if you are reading from a massive dataset but only need a few columns, use the KEEP= option on the SET statement. This prevents unnecessary variables from ever entering the PDV, providing the maximum possible performance boost.

The KEEP statement allows a DATA step to write only the variables specified in one or more SAS^© data sets. The KEEP statement applies to all SAS^© data sets created within the same DATA step and can appear anywhere in the step. If no KEEP or DROP statement appears, all data sets created in the DATA step contain all variables.
If the same variable is listed in both DROP and KEEP statements, DROP takes precedence over KEEP, regardless of the order of the statements, and the variable is dropped.
Note: Do not use KEEP and DROP statements in the same DATA step.
Comparisons:
* The KEEP statement cannot be used in PROC SAS^© steps. The data set option KEEP= can be.
* The KEEP statement applies to all output data sets named in the DATA statement. To write different variables to different data sets, you must use the data set option KEEP=.
* The DROP statement is a parallel statement that specifies variables to omit from the output data set.
* KEEP and DROP statements select variables to include or exclude from output data sets. The subsetting IF statement selects observations.
* Do not confuse the KEEP statement with the RETAIN statement. The RETAIN statement causes SAS^© to retain the value of a variable from one DATA step iteration to the next. The KEEP statement does not affect the value of variables, but only specifies which variables to include in the output data sets.

Data Analysis

Type : CREATION_INTERNE

Examples use generated data (datalines) or SASHELP.

1 Code Block

DATA STEP Data

Explanation :
This example demonstrates how to use the KEEP statement to specify which variables to keep in a new `employees_subset` data set. Only the specified variables (`name`, `address`, `city`, `state`, `zip`, `phone`) will be included in the final data set.

Copied!

1	DATA employees;
2	INPUT name $ address $ city $ state $ zip $ phone $;
3	DATALINES;
4	John Doe 123 Main St Anytown CA 90210 555-1234
5	Jane Smith 456 Oak Ave Othercity NY 10001 555-5678
6	;
7	RUN;
8
9	DATA employees_subset;
10	SET employees;
11	keep name address city state zip phone;
12	RUN;

2 Code Block

DATA STEP Data

Explanation :
This example uses the KEEP statement to include only the `name` and `avg` variables in the `average` output data set. Variables `score1` through `score20`, from which `avg` is calculated, are not written to the `average` data set.

Copied!

1	DATA scores;
2	INPUT name $ score1-score20;
3	DATALINES;
4	Alice 85 90 78 92 88 76 95 89 80 82 77 91 85 93 86 79 90 84 87 94
5	Bob 70 65 72 75 68 80 73 78 71 76 69 81 74 79 70 82 75 77 71 80
6	;
7	RUN;
8
9	DATA average;
10	SET scores;
11	keep name avg;
12	avg=mean(of score1-score20);
13	RUN;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Difficulty Level

Published on : 15/03/2023

Expert Advice

Data Analysis

1 Code Block

2 Code Block

Related Documentation