When processing clinical or transactional data, "incomplete" duplicates often occur. A classic case is when an event (like a medical treatment) is recorded twice: once on the start date and once on the following day, creating two observations for the same event.
This article explores how to clean a dataset where each observation is repeated on two consecutive days, with the goal of removing the first observation (the oldest date) and keeping only the second (the most recent date).
The Problem
Let's imagine a dataset containing the variables id, group, and treatmentdate. Each treatment spans two consecutive days, creating two rows. Furthermore, a single identifier (id) can have multiple distinct treatment periods within the same group.
Example Raw Data:
| id | group | treatmentdate | Note |
| A1 | 0 | 30Sep2017 | To be deleted |
| A1 | 0 | 01Oct2017 | To be kept |
| A2 | 1 | 06Nov2017 | To be deleted |
| A2 | 1 | 07Nov2017 | To be kept |
| A1 | 0 | 23Oct2017 | To be deleted (New episode for A1) |
| A1 | 0 | 24Oct2017 | To be kept |
A naive approach using PROC SQL with a GROUP BY id, group and MAX(day) would fail here because it would overwrite the distinction between different treatment periods (e.g., for A1, it would only keep October 24 and lose October 1).
The Optimal Solution
The most robust method is based on using the DATA Step in combination with an intelligent sort (PROC SORT). The idea is to use the DIF function to compare data between rows while protecting group changes.
Step 1: The Sort (PROC SORT)
The trick is to sort the data in descending order by date. By placing the most recent date first, we transform the problem: instead of 'looking at the next row to see if it's the same,' we can simply compare the current row with the previous one.
