Examples use generated data (datalines) or SASHELP.
1 Code Block
DATA STEP / PROC SORT Data
Explanation : This example illustrates a simple match-merge where the MERGE statement is used with the BY statement to join two datasets ('animal' and 'plant') based on the common variable 'common'. This is a one-to-one merge as there are no duplicate values for the BY variable in either input dataset. The datasets are first sorted by 'common' before merging.
Copied!
data animal;
input common $ animal$;
datalines;
a Ant
b Bird
c Cat
d Dog
e Eagle
f Frog
;
run;
data plant;
input common $ plant$;
datalines;
a Apple
b Banana
c Coconut
d Dewberry
e Eggplant
f Fig
;
run;
proc sort data=animal; by common; run;
proc sort data=plant; by common; run;
data matchmerge;
merge animal plant;
by common;
run;
proc print data=matchmerge; run;
1
DATA animal;
2
INPUT common $ animal$;
3
DATALINES;
4
a Ant
5
b Bird
6
c Cat
7
d Dog
8
e Eagle
9
f Frog
10
;
11
RUN;
12
13
DATA plant;
14
INPUT common $ plant$;
15
DATALINES;
16
a Apple
17
b Banana
18
c Coconut
19
d Dewberry
20
e Eggplant
21
f Fig
22
;
23
RUN;
24
25
PROC SORTDATA=animal; BY common; RUN;
26
PROC SORTDATA=plant; BY common; RUN;
27
28
DATA matchmerge;
29
MERGE animal plant;
30
BY common;
31
RUN;
32
PROC PRINTDATA=matchmerge; RUN;
2 Code Block
DATA STEP / PROC PRINT Data
Explanation : This example shows a one-to-many merge. Datasets 'one' and 'many' contain the variables 'ID' and 'state'. 'ID' is the BY variable. When 'state' is not a BY variable, its value from dataset 'one' (listed second) overwrites the first value from dataset 'many' for each BY group, but not subsequent ones. To replace all 'state' values within the BY group, the 'state' variable from dataset 'many' must be dropped or renamed before merging, as shown in the second code block ('data solution').
Copied!
data one;
input ID state$;
datalines;
1 AZ
2 MA
3 WA
4 WI
;
run;
data many;
input ID city $ state$;
datalines;
1 Phoenix Ariz
2 Boston Mass
2 Foxboro Mass
3 Olympia Mass
3 Seattle Wash
3 Spokane Wash
4 Madison Wis
4 Milwaukee Wis
4 Madison Wis
4 Hurley Wis
;
run;
proc sort data=one; by ID; run;
proc sort data=many; by ID; run;
data three;
merge many one;
by ID;
run;
proc print data=three noobs; run;
title;
data solution;
merge many(drop=state) one;
by ID;
run;
proc print data=solution noobs; run;
1
DATA one;
2
INPUT ID state$;
3
DATALINES;
4
1 AZ
5
2 MA
6
3 WA
7
4 WI
8
;
9
RUN;
10
11
DATA many;
12
INPUT ID city $ state$;
13
DATALINES;
14
1 Phoenix Ariz
15
2 Boston Mass
16
2 Foxboro Mass
17
3 Olympia Mass
18
3 Seattle Wash
19
3 Spokane Wash
20
4 Madison Wis
21
4 Milwaukee Wis
22
4 Madison Wis
23
4 Hurley Wis
24
;
25
RUN;
26
27
PROC SORTDATA=one; BY ID; RUN;
28
PROC SORTDATA=many; BY ID; RUN;
29
30
DATA three;
31
MERGE many one;
32
BY ID;
33
RUN;
34
PROC PRINTDATA=three noobs; RUN;
35
title;
36
37
DATA solution;
38
MERGE many(drop=state) one;
39
BY ID;
40
RUN;
41
PROC PRINTDATA=solution noobs; RUN;
3 Code Block
DATA STEP / PROC SORT Data
Explanation : This example illustrates a many-to-one merge when the input datasets 'animalDupes' and 'plantDupes' contain duplicate values for the BY variable 'common'. SAS combines all observations within each BY group, retaining the values of all variables. The datasets are sorted by 'common' before merging.
Copied!
data animalDupes;
input common $ animal1$;
datalines;
a Ant
a Ape
b Bird
c Cat
d Dog
e Eagle
;
run;
data plantDupes;
input common $ plant1$;
datalines;
a Apple
b Banana
c Coconut
c Celery
d Dewberry
e Eggplant
;
run;
proc sort data=animalDupes; by common; run;
proc sort data=plantDupes; by common; run;
data matchmerge;
merge animalDupes plantDupes;
by common;
run;
proc print data=matchmerge; run;
1
DATA animalDupes;
2
INPUT common $ animal1$;
3
DATALINES;
4
a Ant
5
a Ape
6
b Bird
7
c Cat
8
d Dog
9
e Eagle
10
;
11
RUN;
12
13
DATA plantDupes;
14
INPUT common $ plant1$;
15
DATALINES;
16
a Apple
17
b Banana
18
c Coconut
19
c Celery
20
d Dewberry
21
e Eggplant
22
;
23
RUN;
24
25
PROC SORTDATA=animalDupes; BY common; RUN;
26
PROC SORTDATA=plantDupes; BY common; RUN;
27
28
DATA matchmerge;
29
MERGE animalDupes plantDupes;
30
BY common;
31
RUN;
32
PROC PRINTDATA=matchmerge; RUN;
4 Code Block
DATA STEP / PROC SORT Data
Explanation : This example shows merging two datasets ('animalMissing' and 'plantMissing2') that have different values for their common BY variable 'common', resulting in unmatched observations. SAS retains the values of all variables from both datasets in the final result, even if the value is missing in one of the datasets. The datasets are sorted by 'common' before merging.
Copied!
data animalMissing;
input common $ animal1$;
datalines;
a Ant
c Cat
d Dog
e Eagle
;
run;
data plantMissing2;
input common $ plant$;
datalines;
a Apple
b Banana
c Coconut
e Eggplant
f Fig
;
run;
proc sort data=animalMissing; by common; run;
proc sort data=plantMissing2; by common; run;
data matchmerge;
merge animalMissing plantMissing2;
by common;
run;
proc print data=matchmerge; run;
1
DATA animalMissing;
2
INPUT common $ animal1$;
3
DATALINES;
4
a Ant
5
c Cat
6
d Dog
7
e Eagle
8
;
9
RUN;
10
11
DATA plantMissing2;
12
INPUT common $ plant$;
13
DATALINES;
14
a Apple
15
b Banana
16
c Coconut
17
e Eggplant
18
f Fig
19
;
20
RUN;
21
22
PROC SORTDATA=animalMissing; BY common; RUN;
23
PROC SORTDATA=plantMissing2; BY common; RUN;
24
25
DATA matchmerge;
26
MERGE animalMissing plantMissing2;
27
BY common;
28
RUN;
29
30
PROC PRINTDATA=matchmerge; RUN;
5 Code Block
DATA STEP / PROC SORT Data
Explanation : This example is similar to the previous one, but it uses the IN= dataset option to remove unmatched observations from the output dataset. The IN= variable is a Boolean variable that is set to 1 if the dataset contributes to the current observation and 0 otherwise. The IF (i=1) and (j=1) statement ensures that only observations present in both datasets (matched) are included in the result.
Copied!
data animalMissing;
input common $ animal$;
datalines;
a Ant
c Cat
d Dog
e Eagle
;
run;
data plantMissing2;
input common $ plant$;
datalines;
a Apple
b Banana
c Coconut
e Eggplant
f Fig
;
run;
proc sort data=animalMissing; by common; run;
proc sort data=plantMissing2; by common; run;
data matchmerge2;
merge animalMissing(in=i) plantMissing2(in=j);
by common;
if (i=1) and (j=1);
run;
proc print data=matchmerge2; run;
1
DATA animalMissing;
2
INPUT common $ animal$;
3
DATALINES;
4
a Ant
5
c Cat
6
d Dog
7
e Eagle
8
;
9
RUN;
10
11
DATA plantMissing2;
12
INPUT common $ plant$;
13
DATALINES;
14
a Apple
15
b Banana
16
c Coconut
17
e Eggplant
18
f Fig
19
;
20
RUN;
21
22
PROC SORTDATA=animalMissing; BY common; RUN;
23
PROC SORTDATA=plantMissing2; BY common; RUN;
24
25
DATA matchmerge2;
26
MERGE animalMissing(in=i) plantMissing2(in=j);
27
BY common;
28
IF (i=1) and (j=1);
29
RUN;
30
PROC PRINTDATA=matchmerge2; RUN;
6 Code Block
DATA STEP / PROC PRINT Data
Explanation : This example shows a match-merge when duplicate values of the BY variable 'ID' exist in both datasets ('fruit' and 'color'). The datasets are first sorted by 'ID'. During the merge, the value of the 'color' variable from the 'color' dataset (listed second) overwrites the value in the PDV for the first observation of the BY group. Subsequent values within the BY group are then filled with this 'color' value until the BY group changes. This happens because variables do not reset to missing values until the BY group changes.
Copied!
data fruit;
input ID $ fruit$;
datalines;
a apple
c apricot
d banana
e blueberry
c cantaloupe
c coconut
c cherry
c crabapple
c cranberry
;
run;
data color;
input ID $ color$;
datalines;
a amber
b brown
b blue
b black
b beige
b bronze
c cocoa
c cream
;
run;
proc sort data=fruit; by ID; run;
proc sort data=color; by ID; run;
data merged;
merge fruit color;
by id;
run;
proc print data=merged;
title 'Merged by ID';
run;
1
DATA fruit;
2
INPUT ID $ fruit$;
3
DATALINES;
4
a apple
5
c apricot
6
d banana
7
e blueberry
8
c cantaloupe
9
c coconut
10
c cherry
11
c crabapple
12
c cranberry
13
;
14
RUN;
15
16
DATA color;
17
INPUT ID $ color$;
18
DATALINES;
19
a amber
20
b brown
21
b blue
22
b black
23
b beige
24
b bronze
25
c cocoa
26
c cream
27
;
28
RUN;
29
30
PROC SORTDATA=fruit; BY ID; RUN;
31
PROC SORTDATA=color; BY ID; RUN;
32
33
DATA merged;
34
MERGE fruit color;
35
BY id;
36
RUN;
37
38
PROC PRINTDATA=merged;
39
title 'Merged by ID';
40
RUN;
7 Code Block
DATA STEP / PROC PRINT Data
Explanation : This example shows how to handle missing values during a one-to-many merge with common variables. Initially, a simple merge of datasets 'one' and 'two' (which contain missing values for 'age' in 'one') results in missing values for 'age' in the merged dataset. To correct this, the second code block uses a temporary variable ('temp_age') and the RETAIN statement. For the first record of a BY group, 'temp_age' is reset to a missing value. If 'age' is missing, it is replaced by 'temp_age'. Otherwise, 'temp_age' is updated with the non-missing value of 'age', ensuring that the last non-missing value is propagated within the BY group.
Copied!
data one;
input ID age score;
datalines;
1 8 90
1 . 100
1 . 95
2 9 80
2 . 100
;
run;
data two;
input ID name $ age;
datalines;
1 Sarah 11
2 John 10
;
run;
proc sort data=one; by ID; run;
proc sort data=two; by ID; run;
data merge1;
merge one two;
by id;
run;
proc print data=merge1; title 'Merged by ID'; run;
data merge2 (drop=temp_age);
merge one two;
by id;
retain temp_age;
if first.id then temp_age = .;
if age = . then age = temp_age;
else temp_age = age;
run;
proc print data=merge2; title 'Merged by ID with Age Retained'; run;
1
DATA one;
2
INPUT ID age score;
3
DATALINES;
4
1890
5
1 . 100
6
1 . 95
7
2980
8
2 . 100
9
;
10
RUN;
11
12
DATA two;
13
INPUT ID name $ age;
14
DATALINES;
15
1 Sarah 11
16
2 John 10
17
;
18
RUN;
19
20
PROC SORTDATA=one; BY ID; RUN;
21
PROC SORTDATA=two; BY ID; RUN;
22
23
DATA merge1;
24
MERGE one two;
25
BY id;
26
RUN;
27
PROC PRINTDATA=merge1; title 'Merged by ID'; RUN;
28
29
DATA merge2 (drop=temp_age);
30
MERGE one two;
31
BY id;
32
retain temp_age;
33
IF first.id THEN temp_age = .;
34
IF age = . THEN age = temp_age;
35
ELSE temp_age = age;
36
RUN;
37
PROC PRINTDATA=merge2; title 'Merged by ID with Age Retained'; RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
« To master data merging in SAS, you must understand how the Program Data Vector (PDV) manages variable overwriting and observation counts. Unlike SQL joins, a SAS MERGE will overwrite variables with the same name across datasets; therefore, always drop or rename overlapping non-BY variables (like state) to prevent losing source data. To control the join type, use the (IN=) data set option—it allows you to mimic inner, left, or right joins by filtering for specific flags in your subsetting IF statements. Finally, remember that if datasets have an unequal number of observations per BY-group, SAS will "retain" the last value of the shorter dataset until the entire group is processed, which can lead to unintended data duplication if not managed carefully. »
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. WeAreCAS is an independent community site and is not affiliated with SAS Institute Inc.
This site uses technical and analytical cookies to improve your experience.
Read more.