Delete all duplicates (without keeping a copy)

In data management, deduplication is a classic task. Usually, when we talk about "removing duplicates," we want to keep one unique row for each key and eliminate repetitions. This is what a PROC SORT with the NODUPKEY option does very well.

However, there is a stricter scenario: completely delete any group of data that has duplicates. If an identifier appears multiple times, the data is considered "contaminated" or ambiguous, and we want to keep no trace of these records. Only truly unique identifiers (those that appear only once in the original table) must survive.

Delete all duplicates (without keeping a copy) -

Inhaltsverzeichnis

The Problem

Let's take the example of a table containing a customer identifier (ID) and a year (Year). An observation is defined by the combination ID + Year.

Input Data:

Obs	ID	Year	Var1	Status
1	1	1999	5	Unique (to keep)
2	2	2000	10	Duplicate
3	2	2000	8	Duplicate
4	2	2000	6	Duplicate
5	3	2001	7	Unique (to keep)
6	4	2002	12	Duplicate
7	4	2002	15	Duplicate

The goal is to obtain a table that contains only observations 1 and 5. The groups for IDs 2 and 4 must be completely removed.

If you use PROC SORT NODUPKEY, SAS^© would keep the first row of each group (row 2 and row 6 would remain), which is not the desired result here.

The SQL Solution: GROUP BY and HAVING

The most elegant and concise method to perform this operation uses PROC SQL. The idea is to group the data by the identification key, count the number of elements in each group, and filter only those whose count is strictly equal to 1.

The Code:

1	PROC SQL;
2	create TABLE want as
3	select *
4	from have
5	group BY id, year
6	having count(*) = 1;
7	QUIT;

How does it work?

GROUP BY id, year: SAS^© virtually groups the rows that share the same ID/Year pair.
HAVING count(*) = 1: This is the filtering condition applied after grouping.
- For ID 1 (Year 1999), the count is 1. -> Kept.
- For ID 2 (Year 2000), the count is 3. -> Rejected (all rows in the group are ignored).

This approach is very efficient because it avoids multiple steps of sorting and flagging in a classic Data step.

Data Step Alternative (For information)

For DATA step purists, achieving the same result would require a "double-read" logic or using the automatic variables first. and last. after a sort, by checking if first.id is equal to last.id (which means there is only one row for this ID).

1	/* Nécessite un tri préalable */
2	PROC SORT DATA=have; BY id year; RUN;
3
4	DATA want;
5	SET have;
6	BY id year;
7	/* On ne garde que si c'est à la fois le premier et le dernier du groupe */
8	IF first.year and last.year THEN OUTPUT;
9	RUN;

Wichtiger Haftungsausschluss

Die auf WeAreCAS.eu bereitgestellten Codes und Beispiele dienen Lehrzwecken. Es ist zwingend erforderlich, sie nicht blind in Ihre Produktionsumgebungen zu kopieren. Der beste Ansatz besteht darin, die Logik zu verstehen, bevor sie angewendet wird. Wir empfehlen dringend, diese Skripte in einer Testumgebung (Sandbox/Dev) zu testen. WeAreCAS übernimmt keine Verantwortung für mögliche Auswirkungen oder Datenverluste auf Ihren Systemen.

Zurück zur Artikelliste