Data Step

SAS Data Transformation: Switching from "Long" to "Wide" Format and Custom Sorts

Simon 12 views

When manipulating data in SAS©, it is often necessary to reorganize the structure of a table. A classic case is converting a dataset where categories are listed in rows ("long" format) to a format where these categories become distinct columns ("large" or "wide" format).

This article explores how to perform this transformation, calculate sums robustly, and apply complex sorting logic.

The Scenario

Let's imagine a source table named Population1 containing demographic data by nation and gender:

SAS Data Transformation: Switching from "Long" to "Wide" Format and Custom Sorts -

The objective is to obtain a table Population2 structured as follows, with a column for the total and separate columns for each gender:

SAS Data Transformation: Switching from "Long" to "Wide" Format and Custom Sorts -

Step 1: Data Transposition

Although it is technically possible to use a complex DATA step with RETAIN statements and conditions, the most efficient and least error-prone method relies on the TRANSPOSE procedure.

Before transposing, it is imperative to sort the data by the grouping variable (here, Nation).

1/* 1. Tri préalable indispensable */
2PROC SORT DATA = population1;
3 BY Nation;
4RUN;
5 
6/* 2. Transposition */
7PROC TRANSPOSE DATA = population1 out = temp_pop(drop=_name_);
8 BY Nation; /* La ligne devient unique par Nation */
9 id Gender; /* Les valeurs de Gender deviennent les noms de colonnes */
10 var Number; /* Ce sont ces chiffres qui remplissent les colonnes */
11RUN;
At this stage, the table contains the Nation, Men, and Women columns.
SAS Data Transformation: Switching from "Long" to "Wide" Format and Custom Sorts -

Step 2: Calculating the Total and Handling Missing Values

Once the columns are created, we need to calculate the total population. There is a crucial difference between simple arithmetic addition and the sum function in SAS©.

The naive approach

1Population = Men + Women;
If one of the two variables (Men or Women) contains a missing value, the result of the operation will also be missing.

The recommended approach

1Population = sum(Men, Women);
The SUM() function ignores missing values (treating them as zeros). If data is available for men but not for women, the total will still show the figure for men instead of a null value, which is generally the desired behavior.

Step 3: Custom Sort (Conditional Logic)

Now let's assume a specific sorting constraint:

  1. Sort the nations by descending Population.

  2. Exception: A specific nation (e.g., "MyCountry") must always appear in the last position, regardless of its population figure.

To achieve this without modifying the displayed data, the trick is to create a temporary sorting variable (or "rank").

The Logic

We assign the population value to this sorting variable for all nations, except for the special one, to which we assign an artificially low value (for example, -1).

1DATA population2;
2 SET temp_pop;
3
4 /* Calcul de la somme robuste */
5 Population = sum(Men, Women);
6 
7 /* Création de la variable de tri temporaire */
8 IF Nation = "MyCountry" THEN SortKey = -1;
9 ELSE SortKey = Population;
10RUN;
11 
12/* Tri final sur la clé temporaire */
13PROC SORT DATA = population2;
14 BY descending SortKey;
15RUN;
16 
17/* Nettoyage final : on retire la clé de tri */
18DATA population2;
19 SET population2(drop=SortKey);
20RUN;

To efficiently transform a table:

  1. Use PROC SORT then PROC TRANSPOSE with an ID statement to pivot rows into columns.

  2. Prefer the SUM() function over the + operator to avoid the propagation of missing values.

  3. For complex sorts (like forcing a row to the bottom of the table), create a temporary technical variable, sort by it, and then drop it.