Data Cleaning and Duplicate Removal for Consecutive Dates

When processing clinical or transactional data, "incomplete" duplicates often occur. A classic case is when an event (like a medical treatment) is recorded twice: once on the start date and once on the following day, creating two observations for the same event.

This article explores how to clean a dataset where each observation is repeated on two consecutive days, with the goal of removing the first observation (the oldest date) and keeping only the second (the most recent date).

The Problem

Let's imagine a dataset containing the variables id, group, and treatmentdate. Each treatment spans two consecutive days, creating two rows. Furthermore, a single identifier (id) can have multiple distinct treatment periods within the same group.

Example Raw Data:

id	group	treatmentdate	Note
A1	0	30Sep2017	To be deleted
A1	0	01Oct2017	To be kept
A2	1	06Nov2017	To be deleted
A2	1	07Nov2017	To be kept
A1	0	23Oct2017	To be deleted (New episode for A1)
A1	0	24Oct2017	To be kept

A naive approach using PROC SQL with a GROUP BY id, group and MAX(day) would fail here because it would overwrite the distinction between different treatment periods (e.g., for A1, it would only keep October 24 and lose October 1).

The Optimal Solution

The most robust method is based on using the DATA Step in combination with an intelligent sort (PROC SORT). The idea is to use the DIF function to compare data between rows while protecting group changes.

Step 1: The Sort (PROC SORT)

The trick is to sort the data in descending order by date. By placing the most recent date first, we transform the problem: instead of 'looking at the next row to see if it's the same,' we can simply compare the current row with the previous one.

1
2	PROC SORT
3	DATA=have out=inter;
4
5	BY id group DESCENDING treatmentdate;
6
7	RUN;
8

Why DESCENDING? If we have the dates 30Sep and 01Oct, the descending sort places 01Oct first and 30Sep second. Since we want to keep 01Oct, it gets processed first (and is kept by default), while 30Sep can be identified as the 'previous day' relative to the preceding row and be deleted.

Step 2: The Cleanup (DATA Step)

Here is the code to filter the data:

1	DATA want;
2	SET inter;
3	BY id group DESCENDING treatmentdate;
4
5	/* La condition magique */
6	IF dif1(treatmentdate) = -1 and not first.group THEN delete;
7	RUN;

Detailed Code Analysis

dif1(treatmentdate): This function calculates the difference between the value of treatmentdate in the current row and that of the previous row (row N - row N-1).
- In our sorted case: Previous row = 01Oct, Current row = 30Sep.
- Calculation: 30Sep - 01Oct = -1.
- If the result is -1, this confirms that the current row is exactly one day before the previous row.
not first.group: This is a crucial safeguard.
- The DIF function does not "see" groups; it stubbornly compares row 10 with row 9, even if the ID changes.
- If the last row for patient A is 05Nov and the first row for patient B is 04Nov, DIF will return -1. Without this protection, you would accidentally delete the first row of patient B.
- not first.group ensures that the deletion never occurs on the first row of a new group.
Order of Conditions: It is imperative to place dif(...) first or ensure it is executed for every row. In SAS^©, if you use if condition1 and condition2 and condition1 is false, condition2 is sometimes not evaluated. However, for the DIF function, it is critical that it "reads" every row to keep its lagged memory correct.

Why Not Use Other Methods?

Mathematical Approach (MOD(_N_, 2)): One might be tempted to keep every second row (if mod(_n_,2)=0). This is very risky. If a single observation is missing from your dataset (due to an input error) or you have an odd number of rows, the entire offset will propagate and corrupt the rest of the data.
SQL Approach (HAVING MAX(date)): As mentioned earlier, SQL often aggregates too broadly. If a patient has two different treatments in one month, GROUP BY risks keeping only one (the last of the month), thereby losing the intermediate history.

To delete the oldest observation of a pair of consecutive dates:

Sort by ID and date in descending order.
Calculate the difference with the previous row (DIF).
Delete if the difference is -1 (previous day), but protect the first row of each group (FIRST.group).

Avertissement important

Les codes et exemples fournis sur WeAreCAS.eu sont à but pédagogique. Il est impératif de ne pas les copier-coller aveuglément sur vos environnements de production. La meilleure approche consiste à comprendre la logique avant de l'appliquer. Nous vous recommandons vivement de tester ces scripts dans un environnement de test (Sandbox/Dev). WeAreCAS décline toute responsabilité quant aux éventuels impacts ou pertes de données sur vos systèmes.

Retour à la liste des articles