In clinical or transactional data processing, it's common to encounter "imperfect" duplicates. A classic case is when an event (like a medical treatment) is recorded twice: once on the start date and again the following day, thus creating two observations for the same event.
This article explores how to clean a dataset where each observation is repeated over two consecutive days, with the goal of deleting the first observation (the oldest date) and keeping only the second (the most recent date).
The Problem
Imagine a dataset containing the variables id, group and treatmentdate. Each treatment spans two consecutive days, generating two rows. Furthermore, the same identifier (id) can have several distinct treatment episodes over time within the same group.
Example of raw data:
| id | group | treatmentdate | Note |
| A1 | 0 | 30Sep2017 | To be deleted |
| A1 | 0 | 01Oct2017 | To be kept |
| A2 | 1 | 06Nov2017 | To be deleted |
| A2 | 1 | 07Nov2017 | To be kept |
| A1 | 0 | 23Oct2017 | To be deleted (New episode for A1) |
| A1 | 0 | 24Oct2017 | To be kept |
A naive approach using PROC SQL with a GROUP BY id, group and MAX(day) would fail here, because it would overwrite the distinction between the different treatment episodes (for example, for A1, it would only keep October 24 and would lose October 1).
The Optimal Solution
The most robust method relies on using the DATA Step combined with smart sorting (PROC SORT). The idea is to use the DIF function to compare dates between rows, while protecting group changes.
Step 1: Sorting (PROC SORT)
The trick is to sort the data in descending order of date. By putting the most recent date first, we transform the problem: instead of "looking for the next row to see if it's the same," we can simply compare the current row to the previous one.
