SAS9

How to Conditionally Extract and Concatenate Variable Names

Simon 12 vues

In clinical or marketing data management, it is common to encounter tables where information is scattered across multiple columns. A common task is to identify which columns contain data for a given individual, and then create a summary list of these column names in a single variable.

This article explains how to use the ARRAY, VNAME, and CATX functions in SAS© to transform a "wide" structure into a descriptive concatenated list.

How to Conditionally Extract and Concatenate Variable Names -

The Problem

Let's imagine a dataset containing patient identifiers and several columns corresponding to different studies (Study1ID, Study2ID, etc.). If a patient is enrolled in a study, the column contains an ID. Otherwise, it is empty or missing.

Input Data:

CurrentStudyStudy1IDStudy2IDStudy3IDStudy4ID
001180780.560
002.790.520
003...231

Desired Result:

We want to create a new variable (e.g., Other_enrolled_studies) that lists the names of the studies the patient is participating in, separated by a comma.

CurrentStudyOther_enrolled_studies
001Study1ID, Study2ID, Study4ID
002Study2ID, Study4ID
003Study4ID

The Solution: The Power of VNAME

The key to solving this problem lies in the VNAME() function. Unlike most SAS© functions that process the value of a variable, VNAME returns the name of the variable itself.

Combined with a loop over an array (ARRAY), this function allows for dynamic iteration through columns.

Method 1: The Iterative Approach (Recommended)

This method is the most direct and often the most efficient. It consists of:

  1. Declaring an ARRAY that groups all the columns of interest.

  2. Looping over this array.

  3. Checking if the value is not missing.

  4. If a value exists, concatenate the variable name (via VNAME) to our result variable.

1DATA want;
2 SET have;
3 LENGTH studies $200; /* Définir la longueur pour éviter la troncature */
4
5 /* 1. Déclarer un tableau regroupant les variables commençant par 'Study' */
6 array study(*) study: ;
7
8 /* 2. Boucler sur chaque élément du tableau */
9 DO i=1 to dim(study);
10 /* 3. Vérifier si la valeur n'est pas manquante */
11 IF not missing(study(i)) THEN DO;
12 /* 4. Concaténer le nom de la variable */
13 studies = catx(", ", studies, vname(study(i)));
14 END;
15 END;
16
17 drop i;
18RUN;

Key points of the code:

  • study:: The colon is a wildcard character that selects all variables starting with the prefix "Study".

  • catx(", ", ...): This function is ideal because it automatically handles separators. It does not add a comma at the beginning of the string if it is empty, and it removes unnecessary spaces.

Method 2: The Parallel Arrays Approach

Another way to do this is to use two arrays: one for the numeric values and another to store the names as characters, then concatenate everything at the end.

1
 
1DATA want;
2 SET have;
3 /* Tableau des valeurs sources */
4 array var1(*) Study1ID Study2ID Study3ID Study4ID;
5 /* Tableau temporaire pour stocker les noms */
6 array var2(*) $32 Study1IDc Study2IDc Study3IDc Study4IDc;
7
8 DO i = 1 to dim(var1);
9 IF var1(i) ne . THEN var2(i) = vname(var1(i));
10 END;
11
12 /* Concaténation finale de tout le tableau de noms */
13 newvar = catx(',', of var2(*));
14
15 drop i Study1IDc Study2IDc Study3IDc Study4IDc;
16RUN;

Note on Performance

Although the second method works, it can be less performant on very large datasets if the catx function is placed inside the loop. Indeed, concatenating an entire list in each iteration is an unnecessarily costly operation.

If you use the iterative approach (Method 1), using catx line by line as the conditions are met remains very efficient because only one character string is built progressively.

The combined use of ARRAY and VNAME offers tremendous flexibility for Data Munging. This allows transforming metadata structure (the column names) into usable data (cell values), which is particularly useful for generating automated summary reports.