SAS9

Delete all duplicates (without keeping a copy)

Simon 12 Aufrufe

In data management, deduplication is a classic task. Usually, when we talk about "removing duplicates," we want to keep one unique row for each key and eliminate repetitions. This is what a PROC SORT with the NODUPKEY option does very well.

However, there is a stricter scenario: completely delete any group of data that has duplicates. If an identifier appears multiple times, the data is considered "contaminated" or ambiguous, and we want to keep no trace of these records. Only truly unique identifiers (those that appear only once in the original table) must survive.

Delete all duplicates (without keeping a copy) -

The Problem

Let's take the example of a table containing a customer identifier (ID) and a year (Year). An observation is defined by the combination ID + Year.

Input Data:

ObsIDYearVar1Status
1119995Unique (to keep)
22200010Duplicate
3220008Duplicate
4220006Duplicate
5320017Unique (to keep)
64200212Duplicate
74200215Duplicate

The goal is to obtain a table that contains only observations 1 and 5. The groups for IDs 2 and 4 must be completely removed.

If you use PROC SORT NODUPKEY, SAS© would keep the first row of each group (row 2 and row 6 would remain), which is not the desired result here.

The SQL Solution: GROUP BY and HAVING

The most elegant and concise method to perform this operation uses PROC SQL. The idea is to group the data by the identification key, count the number of elements in each group, and filter only those whose count is strictly equal to 1.

The Code:

1PROC SQL;
2 create TABLE want as
3 select *
4 from have
5 group BY id, year
6 having count(*) = 1;
7QUIT;

How does it work?

  1. GROUP BY id, year: SAS© virtually groups the rows that share the same ID/Year pair.

  2. HAVING count(*) = 1: This is the filtering condition applied after grouping.

    • For ID 1 (Year 1999), the count is 1. -> Kept.

    • For ID 2 (Year 2000), the count is 3. -> Rejected (all rows in the group are ignored).

This approach is very efficient because it avoids multiple steps of sorting and flagging in a classic Data step.

Data Step Alternative (For information)

For DATA step purists, achieving the same result would require a "double-read" logic or using the automatic variables first. and last. after a sort, by checking if first.id is equal to last.id (which means there is only one row for this ID).

1/* Nécessite un tri préalable */
2PROC SORT DATA=have; BY id year; RUN;
3 
4DATA want;
5 SET have;
6 BY id year;
7 /* On ne garde que si c'est à la fois le premier et le dernier du groupe */
8 IF first.year and last.year THEN OUTPUT;
9RUN;