Delete all duplicates (without keeping a copy)

In data management, deduplication is a classic task. Usually, when we talk about "removing duplicates," we want to keep one unique row for each key and eliminate repetitions. This is what a PROC SORT with the NODUPKEY option does very well.

However, there is a stricter scenario: completely delete any group of data that has duplicates. If an identifier appears multiple times, the data is considered "contaminated" or ambiguous, and we want to keep no trace of these records. Only truly unique identifiers (those that appear only once in the original table) must survive.

Delete all duplicates (without keeping a copy) -

The Problem

Let's take the example of a table containing a customer identifier (ID) and a year (Year). An observation is defined by the combination ID + Year.

Input Data:

Obs	ID	Year	Var1	Status
1	1	1999	5	Unique (to keep)
2	2	2000	10	Duplicate
3	2	2000	8	Duplicate
4	2	2000	6	Duplicate
5	3	2001	7	Unique (to keep)
6	4	2002	12	Duplicate
7	4	2002	15	Duplicate

The goal is to obtain a table that contains only observations 1 and 5. The groups for IDs 2 and 4 must be completely removed.

If you use PROC SORT NODUPKEY, SAS^© would keep the first row of each group (row 2 and row 6 would remain), which is not the desired result here.

The SQL Solution: GROUP BY and HAVING

The most elegant and concise method to perform this operation uses PROC SQL. The idea is to group the data by the identification key, count the number of elements in each group, and filter only those whose count is strictly equal to 1.

The Code:

1	PROC SQL;
2	create TABLE want as
3	select *
4	from have
5	group BY id, year
6	having count(*) = 1;
7	QUIT;

How does it work?

GROUP BY id, year: SAS^© virtually groups the rows that share the same ID/Year pair.
HAVING count(*) = 1: This is the filtering condition applied after grouping.
- For ID 1 (Year 1999), the count is 1. -> Kept.
- For ID 2 (Year 2000), the count is 3. -> Rejected (all rows in the group are ignored).

This approach is very efficient because it avoids multiple steps of sorting and flagging in a classic Data step.

Data Step Alternative (For information)

For DATA step purists, achieving the same result would require a "double-read" logic or using the automatic variables first. and last. after a sort, by checking if first.id is equal to last.id (which means there is only one row for this ID).

1	/* Nécessite un tri préalable */
2	PROC SORT DATA=have; BY id year; RUN;
3
4	DATA want;
5	SET have;
6	BY id year;
7	/* On ne garde que si c'est à la fois le premier et le dernier du groupe */
8	IF first.year and last.year THEN OUTPUT;
9	RUN;

Important Disclaimer

The codes and examples provided on WeAreCAS.eu are for educational purposes. It is imperative not to blindly copy-paste them into your production environments. The best approach is to understand the logic before applying it. We strongly recommend testing these scripts in a test environment (Sandbox/Dev). WeAreCAS accepts no responsibility for any impact or data loss on your systems.

Back to article list

Delete all duplicates (without keeping a copy)

Niveau de difficulté

Published on : 29/04/2021

Table of Contents

The Problem

The SQL Solution: GROUP BY and HAVING

How does it work?

Data Step Alternative (For information)

Important Disclaimer