Published on :
Data Manipulation CREATION_INTERNE

Examples: Match-Merge Data

This code is also available in: Deutsch Español Français
Awaiting validation
Match-merging is a powerful technique for combining SAS© datasets. It allows joining observations from two or more datasets based on the values of one or more common variables (BY variables). Input datasets must be sorted or indexed by the BY variables before merging. This process can be compared to an INNER JOIN in SQL when all BY variable values match and there are no duplicates. SAS© retains the values of all variables in the program data vector, even if the value is missing or unmatched. When a final observation for a BY group is read from a dataset, SAS© retains its values for all variables unique to that dataset until all observations for that BY group have been read from all datasets. The total number of observations in the final dataset is the sum of the maximum number of observations in a BY group from either of the datasets.
Data Analysis

Type : CREATION_INTERNE


Examples use generated data (datalines) or SASHELP.

1 Code Block
DATA STEP / PROC SORT Data
Explanation :
This example illustrates a simple match-merge where the MERGE statement is used with the BY statement to join two datasets ('animal' and 'plant') based on the common variable 'common'. This is a one-to-one merge as there are no duplicate values for the BY variable in either input dataset. The datasets are first sorted by 'common' before merging.
Copied!
1DATA animal;
2 INPUT common $ animal$;
3 DATALINES;
4a Ant
5b Bird
6c Cat
7d Dog
8e Eagle
9f Frog
10;
11RUN;
12 
13DATA plant;
14 INPUT common $ plant$;
15 DATALINES;
16a Apple
17b Banana
18c Coconut
19d Dewberry
20e Eggplant
21f Fig
22;
23RUN;
24 
25PROC SORT DATA=animal; BY common; RUN;
26PROC SORT DATA=plant; BY common; RUN;
27 
28DATA matchmerge;
29 MERGE animal plant;
30 BY common;
31RUN;
32PROC PRINT DATA=matchmerge; RUN;
2 Code Block
DATA STEP / PROC PRINT Data
Explanation :
This example shows a one-to-many merge. Datasets 'one' and 'many' contain the variables 'ID' and 'state'. 'ID' is the BY variable. When 'state' is not a BY variable, its value from dataset 'one' (listed second) overwrites the first value from dataset 'many' for each BY group, but not subsequent ones. To replace all 'state' values within the BY group, the 'state' variable from dataset 'many' must be dropped or renamed before merging, as shown in the second code block ('data solution').
Copied!
1DATA one;
2 INPUT ID state$;
3 DATALINES;
41 AZ
52 MA
63 WA
74 WI
8;
9RUN;
10 
11DATA many;
12 INPUT ID city $ state$;
13 DATALINES;
141 Phoenix Ariz
152 Boston Mass
162 Foxboro Mass
173 Olympia Mass
183 Seattle Wash
193 Spokane Wash
204 Madison Wis
214 Milwaukee Wis
224 Madison Wis
234 Hurley Wis
24;
25RUN;
26 
27PROC SORT DATA=one; BY ID; RUN;
28PROC SORT DATA=many; BY ID; RUN;
29 
30DATA three;
31 MERGE many one;
32 BY ID;
33RUN;
34PROC PRINT DATA=three noobs; RUN;
35title;
36 
37DATA solution;
38 MERGE many(drop=state) one;
39 BY ID;
40RUN;
41PROC PRINT DATA=solution noobs; RUN;
3 Code Block
DATA STEP / PROC SORT Data
Explanation :
This example illustrates a many-to-one merge when the input datasets 'animalDupes' and 'plantDupes' contain duplicate values for the BY variable 'common'. SAS combines all observations within each BY group, retaining the values of all variables. The datasets are sorted by 'common' before merging.
Copied!
1DATA animalDupes;
2 INPUT common $ animal1$;
3 DATALINES;
4a Ant
5a Ape
6b Bird
7c Cat
8d Dog
9e Eagle
10;
11RUN;
12 
13DATA plantDupes;
14 INPUT common $ plant1$;
15 DATALINES;
16a Apple
17b Banana
18c Coconut
19c Celery
20d Dewberry
21e Eggplant
22;
23RUN;
24 
25PROC SORT DATA=animalDupes; BY common; RUN;
26PROC SORT DATA=plantDupes; BY common; RUN;
27 
28DATA matchmerge;
29 MERGE animalDupes plantDupes;
30 BY common;
31RUN;
32PROC PRINT DATA=matchmerge; RUN;
4 Code Block
DATA STEP / PROC SORT Data
Explanation :
This example shows merging two datasets ('animalMissing' and 'plantMissing2') that have different values for their common BY variable 'common', resulting in unmatched observations. SAS retains the values of all variables from both datasets in the final result, even if the value is missing in one of the datasets. The datasets are sorted by 'common' before merging.
Copied!
1DATA animalMissing;
2 INPUT common $ animal1$;
3 DATALINES;
4a Ant
5c Cat
6d Dog
7e Eagle
8;
9RUN;
10 
11DATA plantMissing2;
12 INPUT common $ plant$;
13 DATALINES;
14a Apple
15b Banana
16c Coconut
17e Eggplant
18f Fig
19;
20RUN;
21 
22PROC SORT DATA=animalMissing; BY common; RUN;
23PROC SORT DATA=plantMissing2; BY common; RUN;
24 
25DATA matchmerge;
26 MERGE animalMissing plantMissing2;
27 BY common;
28RUN;
29 
30PROC PRINT DATA=matchmerge; RUN;
5 Code Block
DATA STEP / PROC SORT Data
Explanation :
This example is similar to the previous one, but it uses the IN= dataset option to remove unmatched observations from the output dataset. The IN= variable is a Boolean variable that is set to 1 if the dataset contributes to the current observation and 0 otherwise. The IF (i=1) and (j=1) statement ensures that only observations present in both datasets (matched) are included in the result.
Copied!
1DATA animalMissing;
2 INPUT common $ animal$;
3 DATALINES;
4a Ant
5c Cat
6d Dog
7e Eagle
8;
9RUN;
10 
11DATA plantMissing2;
12 INPUT common $ plant$;
13 DATALINES;
14a Apple
15b Banana
16c Coconut
17e Eggplant
18f Fig
19;
20RUN;
21 
22PROC SORT DATA=animalMissing; BY common; RUN;
23PROC SORT DATA=plantMissing2; BY common; RUN;
24 
25DATA matchmerge2;
26 MERGE animalMissing(in=i) plantMissing2(in=j);
27 BY common;
28 IF (i=1) and (j=1);
29RUN;
30PROC PRINT DATA=matchmerge2; RUN;
6 Code Block
DATA STEP / PROC PRINT Data
Explanation :
This example shows a match-merge when duplicate values of the BY variable 'ID' exist in both datasets ('fruit' and 'color'). The datasets are first sorted by 'ID'. During the merge, the value of the 'color' variable from the 'color' dataset (listed second) overwrites the value in the PDV for the first observation of the BY group. Subsequent values within the BY group are then filled with this 'color' value until the BY group changes. This happens because variables do not reset to missing values until the BY group changes.
Copied!
1DATA fruit;
2 INPUT ID $ fruit$;
3 DATALINES;
4a apple
5c apricot
6d banana
7e blueberry
8c cantaloupe
9c coconut
10c cherry
11c crabapple
12c cranberry
13;
14RUN;
15 
16DATA color;
17 INPUT ID $ color$;
18 DATALINES;
19a amber
20b brown
21b blue
22b black
23b beige
24b bronze
25c cocoa
26c cream
27;
28RUN;
29 
30PROC SORT DATA=fruit; BY ID; RUN;
31PROC SORT DATA=color; BY ID; RUN;
32 
33DATA merged;
34 MERGE fruit color;
35 BY id;
36RUN;
37 
38PROC PRINT DATA=merged;
39 title 'Merged by ID';
40RUN;
7 Code Block
DATA STEP / PROC PRINT Data
Explanation :
This example shows how to handle missing values during a one-to-many merge with common variables. Initially, a simple merge of datasets 'one' and 'two' (which contain missing values for 'age' in 'one') results in missing values for 'age' in the merged dataset. To correct this, the second code block uses a temporary variable ('temp_age') and the RETAIN statement. For the first record of a BY group, 'temp_age' is reset to a missing value. If 'age' is missing, it is replaced by 'temp_age'. Otherwise, 'temp_age' is updated with the non-missing value of 'age', ensuring that the last non-missing value is propagated within the BY group.
Copied!
1DATA one;
2 INPUT ID age score;
3 DATALINES;
41 8 90
51 . 100
61 . 95
72 9 80
82 . 100
9;
10RUN;
11 
12DATA two;
13 INPUT ID name $ age;
14 DATALINES;
151 Sarah 11
162 John 10
17;
18RUN;
19 
20PROC SORT DATA=one; BY ID; RUN;
21PROC SORT DATA=two; BY ID; RUN;
22 
23DATA merge1;
24 MERGE one two;
25 BY id;
26RUN;
27PROC PRINT DATA=merge1; title 'Merged by ID'; RUN;
28 
29DATA merge2 (drop=temp_age);
30 MERGE one two;
31 BY id;
32 retain temp_age;
33 IF first.id THEN temp_age = .;
34 IF age = . THEN age = temp_age;
35 ELSE temp_age = age;
36RUN;
37PROC PRINT DATA=merge2; title 'Merged by ID with Age Retained'; RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All Rights Reserved


Banner
Expert Advice
Expert
Simon
Expert SAS et fondateur.
« To master data merging in SAS, you must understand how the Program Data Vector (PDV) manages variable overwriting and observation counts. Unlike SQL joins, a SAS MERGE will overwrite variables with the same name across datasets; therefore, always drop or rename overlapping non-BY variables (like state) to prevent losing source data. To control the join type, use the (IN=) data set option—it allows you to mimic inner, left, or right joins by filtering for specific flags in your subsetting IF statements. Finally, remember that if datasets have an unequal number of observations per BY-group, SAS will "retain" the last value of the shorter dataset until the entire group is processed, which can lead to unintended data duplication if not managed carefully. »