Data Step

How to efficiently extract the last observations from a table?

Simon 14 vistas

During data exploration or debugging, it is common to want to isolate only the "end" of a table (dataset). Unlike the OBS=10 option which gives us the beginning, retrieving the last 10 observations from a table containing $N$ rows requires a more thoughtful approach, especially if the table is large.

Here is a compilation of the best methods from the SAS© community, ranked by performance and use case.

Note :
The Naive Approach (Descending Sort)
The first idea that often comes to mind is to sort the table in reverse chronological order and take the first few rows.
1PROC SORT DATA=MA_TABLE out=sorted_table;
2 BY descending date_variable;
3RUN;
4 
5DATA last_ten;
6 SET sorted_table (obs=10);
7RUN;
Verdict: To be avoided on large tables.

Why? Sorting a table with millions of rows consumes a huge amount of CPU and I/O resources just to read 10 lines. It's inefficient.
Note :
The Arithmetic Approach (NOBS and IF)
A much cleaner method is to use the NOBS option in the SET statement. This option stores the total number of observations in a temporary variable at compile time, without having to read the table.
1DATA last_ten;
2 SET MA_TABLE nobs=total_obs;
3 /* On ne garde que si le numéro de ligne (_N_) est dans les 10 derniers */
4 IF _N_ > (total_obs - 10);
5RUN;
Verdict: Good for medium-sized tables.

Why? Although we only keep 10 rows, SAS© still has to sequentially read the entire table to the end to test the IF condition. On a table with a billion rows, this will be slow.
Note :
The High-Performance Approach (Direct Access with POINT=)
This is the optimal solution recommended by experts. The goal is to use random access (Direct Access) to jump directly to the end of the file without reading the beginning.

We combine NOBS (to know the end) and POINT (to go to a specific line).
1DATA last_ten;
2 /* Boucle de la ligne N-9 jusqu'à la ligne N */
3 DO i = (total_obs - 9) to total_obs;
4 SET MA_TABLE nobs=total_obs point=i;
5 OUTPUT;
6 END;
7 stop; /* IMPORTANT : stop évite une boucle infinie avec l'instruction POINT */
8RUN;
Verdict: The best performance.

Why? SAS© only reads 10 lines, even if the table contains 100 million. The execution is almost instantaneous.

Technical note: The STOP statement is mandatory because the SET statement with POINT= never encounters an End-of-File marker.
Note :
The Macro Approach (Dynamic FIRSTOBS)
An interesting alternative is to calculate the starting point in a previous step, then use the FIRSTOBS option.
1/* Étape 1 : Calculer le point de départ */
2DATA _null_;
3 SET MA_TABLE nobs=n;
4 start_point = n - 9;
5 call symput('start_obs', start_point);
6RUN;
7 
8/* Étape 2 : Lire à partir de ce point */
9DATA last_ten;
10 SET MA_TABLE (firstobs=&start_obs);
11RUN;
Verdict: Functional but verbose. Method 3 remains superior because it is done in a single Data step.

Points of Attention and Limitations

To conclude, here are some technical nuances raised during the discussion:

  1. Views and DBMS: The methods using NOBS and POINT= work perfectly on native Base SAS© tables. However, they can fail on Views or tables from external databases (Oracle, SQL Server via SAS©/ACCESS) because these engines do not always know the number of rows in advance or do not support direct access by row number.

  2. The _N_ variable: In method 3, you can use the automatic variable _N_ as a loop counter (instead of i).

    • Advantage: _N_ is not written to the output table (no need for DROP).

    • Disadvantage: This can be confusing for beginners, as _N_ usually refers to the number of Data step iterations.

In summary: For a large, classic SAS© table, always prefer Method 3 (POINT=).