SAS9

Understanding the Hierarchy of Multi-Level Sorting with PROC SORT

Simon 13 views

One of the first commands you learn in SAS© programming is the PROC SORT. While its use seems trivial for a single variable, the results can sometimes confuse users when it comes to applying a sort on three or more criteria.

A common confusion arises when trying to sort a table according to a complex logic, for example: by city (ascending), then by debt (descending), and finally by account number (ascending). When inspecting the result, it often happens that the last variable (the account number) seems completely out of order.

Is it a code error? A SAS© bug? No, it's simply a misinterpretation of the hierarchical logic of the sort.

Understanding the Hierarchy of Multi-Level Sorting with PROC SORT -

The "Tie-Breaker" Principle

To understand why your third variable does not seem to be sorted, you have to visualize the sort as a series of filters or Russian dolls.

When you submit the following instruction:

Note :
PROC SORT data=ma_table;
BY town DESCENDING debt accountnumber;
RUN;

SAS© does not try to sort all columns independently. It applies a strict hierarchy:

  1. Level 1 (Town): SAS© first sorts the entire dataset by city.

  2. Level 2 (Debt): The sort by debt is activated only within the groups where the city is identical. It is a tie-breaking criterion.

  3. Level 3 (AccountNumber): This sort only becomes visible and effective if the first two criteria (City AND Debt) are strictly identical for several observations.

The Illusion of Disorder

The perception problem comes from the fact that, in many real-world datasets, the combination of the first two variables is often unique.

Imagine a row where the city is "Paris" and the debt is "500". If there is no other row with exactly "Paris" and "500", the sort by account number never has a chance to be applied. The account number will therefore appear in the position determined solely by the first two criteria. If you look at the accountnumber column from top to bottom without considering the others, it will seem random.

A Concrete Example

Let's take a specific case to illustrate how this works. Suppose we have several observations, but the sort by account number is only effective for a specific subgroup:

  • Group A (Town='Apex', Debt=50.00): A single line. The account number is displayed as is.

  • Group B (Town='Apex', Debt=37.95): Three lines exist with these exact values.

  • Group C (Town='Boston', Debt=100.00): A single line.

In this scenario, the sorting of the account number (3rd variable) will only be visible for Group B. If the account numbers for this group are 3131, 5108, and 9923, they will indeed appear in this ascending order.

However, if you visually compare the account number of Group A with that of Group B, there will be no apparent sorting logic between them, because the higher hierarchy (the Debt) has taken precedence.

If your third sorting variable seems out of order, check your data. It is very likely that you have very few duplicates on the combination of the first two variables.

The sort works correctly, but its effect is "masked" by the high cardinality of the preceding criteria. The sort by variable n is only useful for breaking ties from variables 1 to n-1.