Data Step

Data Cleaning and De-duplication of Consecutive Dates

Simon 15 Aufrufe

In clinical or transactional data processing, it's common to encounter "imperfect" duplicates. A classic case is when an event (like a medical treatment) is recorded twice: once on the start date and again the following day, thus creating two observations for the same event.

This article explores how to clean a dataset where each observation is repeated over two consecutive days, with the goal of deleting the first observation (the oldest date) and keeping only the second (the most recent date).

The Problem

Imagine a dataset containing the variables id, group and treatmentdate. Each treatment spans two consecutive days, generating two rows. Furthermore, the same identifier (id) can have several distinct treatment episodes over time within the same group.

Example of raw data:

idgrouptreatmentdateNote
A1030Sep2017To be deleted
A1001Oct2017To be kept
A2106Nov2017To be deleted
A2107Nov2017To be kept
A1023Oct2017To be deleted (New episode for A1)
A1024Oct2017To be kept

A naive approach using PROC SQL with a GROUP BY id, group and MAX(day) would fail here, because it would overwrite the distinction between the different treatment episodes (for example, for A1, it would only keep October 24 and would lose October 1).

The Optimal Solution

The most robust method relies on using the DATA Step combined with smart sorting (PROC SORT). The idea is to use the DIF function to compare dates between rows, while protecting group changes.

Step 1: Sorting (PROC SORT)

The trick is to sort the data in descending order of date. By putting the most recent date first, we transform the problem: instead of "looking for the next row to see if it's the same," we can simply compare the current row to the previous one.

1 
2PROC SORT
3DATA=have out=inter;
4 
5BY id group DESCENDING treatmentdate;
6 
7RUN;
8 
Why DESCENDING? If we have the dates 30Sep and 01Oct, the descending sort places 01Oct first and 30Sep second. Since we want to keep 01Oct, it will be processed first (and kept by default), while 30Sep can be identified as "the day before" the previous row and deleted.

Step 2: Cleaning (DATA Step)

Here is the code to filter the data:

1DATA want;
2 SET inter;
3 BY id group DESCENDING treatmentdate;
4 
5 /* La condition magique */
6 IF dif1(treatmentdate) = -1 and not first.group THEN delete;
7RUN;

Detailed code analysis

  1. dif1(treatmentdate): This function calculates the difference between the value of treatmentdate for the current row and that of the previous row (Row N - Row N-1).

    • In our sorted case: Previous row = 01Oct, Current row = 30Sep.

    • Calculation: 30Sep - 01Oct = -1.

    • If the result is -1, it confirms that the current row is exactly one day before the previous row.

  2. not first.group: This is a crucial safeguard.

    • The DIF function does not "see" the groups; it foolishly compares row 10 to row 9, even if the ID changes.

    • If the last row for patient A is 05Nov and the first row for patient B is 04Nov, DIF will return -1. Without this protection, you would mistakenly delete the first row for patient B.

    • not first.group ensures that the deletion is never performed on the first row of a new group.

  3. Order of conditions: It is imperative to place dif(...) first or ensure it is executed for every row. In SAS©, if you use if condition1 and condition2, and condition1 is false, condition2 is sometimes not evaluated. However, for the DIF function, it is vital that it "reads" each row to maintain its correct lagged memory.

Why not use other methods?

  • Mathematical approach (MOD(_N_, 2)): One might be tempted to keep every other row (if mod(_n_,2)=0). This is very risky. If a single observation is missing from your dataset (due to a data entry error) or if you have an odd number of rows, the entire offset will propagate and corrupt the rest of the data.

  • SQL approach (HAVING MAX(date)): As mentioned earlier, SQL often aggregates too broadly. If a patient has two distinct treatments within the month, the GROUP BY is likely to keep only one (the last of the month), thus losing the intermediate history.

To delete the oldest observation from a pair of consecutive dates:

  1. Sort by ID and by date in descending order.

  2. Calculate the difference with the previous row (DIF).

  3. Delete if the difference is -1 (previous day), but protect the first row of each group (FIRST.group).