Data Step

Data Cleaning and Duplicate Removal for Consecutive Dates

Simon 10 vues

When processing clinical or transactional data, "incomplete" duplicates often occur. A classic case is when an event (like a medical treatment) is recorded twice: once on the start date and once on the following day, creating two observations for the same event.

This article explores how to clean a dataset where each observation is repeated on two consecutive days, with the goal of removing the first observation (the oldest date) and keeping only the second (the most recent date).

The Problem

Let's imagine a dataset containing the variables id, group, and treatmentdate. Each treatment spans two consecutive days, creating two rows. Furthermore, a single identifier (id) can have multiple distinct treatment periods within the same group.

Example Raw Data:

idgrouptreatmentdateNote
A1030Sep2017To be deleted
A1001Oct2017To be kept
A2106Nov2017To be deleted
A2107Nov2017To be kept
A1023Oct2017To be deleted (New episode for A1)
A1024Oct2017To be kept

A naive approach using PROC SQL with a GROUP BY id, group and MAX(day) would fail here because it would overwrite the distinction between different treatment periods (e.g., for A1, it would only keep October 24 and lose October 1).

The Optimal Solution

The most robust method is based on using the DATA Step in combination with an intelligent sort (PROC SORT). The idea is to use the DIF function to compare data between rows while protecting group changes.

Step 1: The Sort (PROC SORT)

The trick is to sort the data in descending order by date. By placing the most recent date first, we transform the problem: instead of 'looking at the next row to see if it's the same,' we can simply compare the current row with the previous one.

1 
2PROC SORT
3DATA=have out=inter;
4 
5BY id group DESCENDING treatmentdate;
6 
7RUN;
8 
Why DESCENDING? If we have the dates 30Sep and 01Oct, the descending sort places 01Oct first and 30Sep second. Since we want to keep 01Oct, it gets processed first (and is kept by default), while 30Sep can be identified as the 'previous day' relative to the preceding row and be deleted.

Step 2: The Cleanup (DATA Step)

Here is the code to filter the data:

1DATA want;
2 SET inter;
3 BY id group DESCENDING treatmentdate;
4 
5 /* La condition magique */
6 IF dif1(treatmentdate) = -1 and not first.group THEN delete;
7RUN;

Detailed Code Analysis

  1. dif1(treatmentdate): This function calculates the difference between the value of treatmentdate in the current row and that of the previous row (row N - row N-1).

    • In our sorted case: Previous row = 01Oct, Current row = 30Sep.

    • Calculation: 30Sep - 01Oct = -1.

    • If the result is -1, this confirms that the current row is exactly one day before the previous row.

  2. not first.group: This is a crucial safeguard.

    • The DIF function does not "see" groups; it stubbornly compares row 10 with row 9, even if the ID changes.

    • If the last row for patient A is 05Nov and the first row for patient B is 04Nov, DIF will return -1. Without this protection, you would accidentally delete the first row of patient B.

    • not first.group ensures that the deletion never occurs on the first row of a new group.

  3. Order of Conditions: It is imperative to place dif(...) first or ensure it is executed for every row. In SAS©, if you use if condition1 and condition2 and condition1 is false, condition2 is sometimes not evaluated. However, for the DIF function, it is critical that it "reads" every row to keep its lagged memory correct.

Why Not Use Other Methods?

  • Mathematical Approach (MOD(_N_, 2)): One might be tempted to keep every second row (if mod(_n_,2)=0). This is very risky. If a single observation is missing from your dataset (due to an input error) or you have an odd number of rows, the entire offset will propagate and corrupt the rest of the data.

  • SQL Approach (HAVING MAX(date)): As mentioned earlier, SQL often aggregates too broadly. If a patient has two different treatments in one month, GROUP BY risks keeping only one (the last of the month), thereby losing the intermediate history.

To delete the oldest observation of a pair of consecutive dates:

  1. Sort by ID and date in descending order.

  2. Calculate the difference with the previous row (DIF).

  3. Delete if the difference is -1 (previous day), but protect the first row of each group (FIRST.group).