In data management, deduplication is a classic task. Usually, when we talk about "removing duplicates," we want to keep one unique row for each key and eliminate repetitions. This is what a PROC SORT with the NODUPKEY option does very well.
However, there is a stricter scenario: completely delete any group of data that has duplicates. If an identifier appears multiple times, the data is considered "contaminated" or ambiguous, and we want to keep no trace of these records. Only truly unique identifiers (those that appear only once in the original table) must survive.
Let's take the example of a table containing a customer identifier (ID) and a year (Year). An observation is defined by the combination ID + Year.
Input Data:
| Obs | ID | Year | Var1 | Status |
| 1 | 1 | 1999 | 5 | Unique (to keep) |
| 2 | 2 | 2000 | 10 | Duplicate |
| 3 | 2 | 2000 | 8 | Duplicate |
| 4 | 2 | 2000 | 6 | Duplicate |
| 5 | 3 | 2001 | 7 | Unique (to keep) |
| 6 | 4 | 2002 | 12 | Duplicate |
| 7 | 4 | 2002 | 15 | Duplicate |
The goal is to obtain a table that contains only observations 1 and 5. The groups for IDs 2 and 4 must be completely removed.
If you use PROC SORT NODUPKEY, SAS© would keep the first row of each group (row 2 and row 6 would remain), which is not the desired result here.
The most elegant and concise method to perform this operation uses PROC SQL. The idea is to group the data by the identification key, count the number of elements in each group, and filter only those whose count is strictly equal to 1.
The Code:
How does it work?
GROUP BY id, year: SAS© virtually groups the rows that share the same ID/Year pair.
HAVING count(*) = 1: This is the filtering condition applied after grouping.
For ID 1 (Year 1999), the count is 1. -> Kept.
For ID 2 (Year 2000), the count is 3. -> Rejected (all rows in the group are ignored).
This approach is very efficient because it avoids multiple steps of sorting and flagging in a classic Data step.
For DATA step purists, achieving the same result would require a "double-read" logic or using the automatic variables first. and last. after a sort, by checking if first.id is equal to last.id (which means there is only one row for this ID).