SAS9

Matching Data on a Rolling Time Window

Simon 9 vues

One of the classic challenges in Data Management is to join two tables not on strict equality, but on a relative time window.

The typical scenario is as follows: you have a reference table containing key event dates (e.g., balance sheet date, marketing campaign date) and a second table containing a history of transactions. The objective is to retrieve, for each key event, all transactions that occurred within the preceding 12 months.

Unlike a classic merge (MERGE) which requires identical keys, this problem requires an inequality join.

The Scenario

Let's take two datasets to illustrate the problem:

  1. Reference Table (HAVE_BASE): Contains companies and their pivot dates. Note that company 'A' appears twice with two different dates.

  2. Event Table (HAVE_EVENT): Contains the history of financial results (EPS) by company at various dates.

The objective is to extract the rows from the EVENT table where the date is between [Reference Date - 12 months] and [Reference Date].

Data Preparation

Since dates are often imported as character strings (e.g., "199701"), the first critical step is to convert them into true numeric SAS© dates to allow for arithmetic calculations.

1DATA have_base;
2 INPUT company $ _date :$6. profit;
3 /* Conversion de 'YYYYMM' en date SAS (1er du mois) */
4 date = INPUT(_date !! '01', yymmdd8.);
5 FORMAT date monyy6.;
6 drop _date;
7 DATALINES;
8a 199701 5
9a 199606 9
10b 201404 6
11f 200004 78
12;
13RUN;
14 
15/* Même traitement pour la table historique */
16DATA have_event;
17 INPUT company $ _date :$6. EPS;
18 date = INPUT(_date !! '01', yymmdd8.);
19 FORMAT date monyy6.;
20 drop _date;
21 /* ... données ... */
22RUN;

Why is the DATA Step (Merge) Not Suitable?

An intuitive attempt would be to use the DATA step with a MERGE statement. However, MERGE matches rows sequentially based on a strict equality of the BY variables.

If you try to filter with conditions like if date >= intnx('month', date_b, -12), it will often fail in "Many-to-Many" cases (multiple reference dates for the same company) because the SAS© read pointer cannot "go back" to re-test the historical rows for the second reference date.

The Optimal Solution: PROC SQL

The SQL language is perfect for this case because it generates a Cartesian product on the join key (company), and then filters the results according to the time window. This allows a single event row to be associated with multiple reference dates if the windows overlap.

The Code

The key function here is INTNX, which allows you to shift a date by a specific number of intervals (here, -12 months).

1PROC SQL;
2 create TABLE want as
3 select
4 base.company,
5 base.profit,
6 base.date as Date_Reference,
7 evt.EPS,
8 evt.date as Date_Evenement
9 from
10 have_base as base,
11 have_event as evt
12 where
13 /* 1. Jointure sur l'identifiant */
14 base.company = evt.company
15 AND
16 /* 2. Jointure sur la fenêtre temporelle :
17 La date de l'événement doit être entre (Date Ref - 12 mois) et (Date Ref) */
18 intnx('month', base.date, -12) <= evt.date <= base.date;
19QUIT;

Analysis of the WHERE clause

  1. base.company = evt.company: Essential. Without this, you would be comparing the dates of company A with those of company B.

  2. intnx('month', base.date, -12): Calculates the lower bound of the window (the date exactly 12 months before the reference date).

  3. The operator <=: The framing Lower_Bound <= Event_Date <= Upper_Bound is the most efficient and readable method for defining the interval.

Result and Interpretation

If company 'A' has a reference date in January 1997 and another in June 1996:

  • SQL will first process January 1997: It will search for all of 'A's events between Jan 96 and Jan 97.

  • It will then process June 1996: It will search for all of 'A's events between June 95 and June 96.

  • If an event occurs in May 1996, it will appear twice in the final result (once associated with the '97 reference, once with the '96 reference), which is the correct expected behavior for this type of analysis.

To join data over time intervals (rolling windows):

  • Prefer PROC SQL over the DATA step.

  • Ensure your dates are in numeric format.

  • Use INTNX to dynamically calculate the bounds of your time window in the WHERE clause.