Published on :
Data Manipulation CREATION_INTERNE

Examples: One-to-One Data Merging

This code is also available in: Español Français
Awaiting validation
The functional analysis focuses on the behavior of the MERGE statement in the absence of the BY statement. In this one-to-one merging mode, SAS© does not compare the values of common variables, but implicitly associates observations based on their row number. When column names are identical in the input datasets, the values of common variables from the last dataset specified in the MERGE statement overwrite those from previous datasets. Non-shared columns are added as new columns. Observations are read sequentially from each dataset, and the output dataset will contain all observations, adjusted by the common variable overload rules. The document warns against using this method for data relationships other than strictly one-to-one, as it can produce unexpected and incorrect results.
Data Analysis

Type : CREATION_INTERNE


Examples use generated data (datalines) or SASHELP.

1 Code Block
DATA STEP Data
Explanation :
This example merges two datasets, 'animal' and 'plantG', which have an equal number of observations and a common variable named 'common'. The MERGE statement is used without the BY statement, meaning that observations are merged based on their position (row number). The value 'g' of the 'common' variable in 'plantG' replaces the value 'f' of 'common' in 'animal' for observation 6 of the output dataset, demonstrating that values from the last named dataset overwrite previous ones for common variables.
Copied!
1DATA animal;
2 INPUT common $ animal $;
3 DATALINES;
4a Ant
5b Bird
6c Cat
7d Dog
8e Eagle
9f Frog
10;
11 
12DATA plantG;
13 INPUT common $ plant $;
14 DATALINES;
15a Apple
16b Banana
17c Coconut
18d Dewberry
19e Eggplant
20g Fig
21;
22 
23DATA merged;
24 MERGE animal plantG;
25RUN;
26 
27PROC PRINT DATA=merged;
28RUN;
2 Code Block
DATA STEP Data
Explanation :
This example merges the 'animal' dataset (6 observations) and 'plantMissing' (3 observations). Without a BY statement, one-to-one merging associates observations by row number. The resulting 'merged' dataset will have 6 observations. For observations 4, 5, and 6, the variables from 'plantMissing' (plant, common) will contain missing values because 'plantMissing' only has 3 observations. The values of 'common' from the last dataset ('plantMissing') overwrite those from 'animal' for the corresponding observations.
Copied!
1DATA animal;
2INPUT common $ animal $;
3DATALINES;
4a Ant
5b Bird
6c Cat
7d Dog
8e Eagle
9f Frog
10;
11 
12DATA plantMissing;
13INPUT common $ plant $;
14DATALINES;
15a Apple
16b Banana
17c Coconut
18;
19 
20DATA merged;
21 MERGE animal plantmissing;
22RUN;
23PROC PRINT DATA=merged;
24RUN;
3 Code Block
DATA STEP Data
Explanation :
This example shows the difference in behavior between MERGE and SET when combining datasets of unequal sizes without a BY statement. With two consecutive SET statements, the DATA step stops selecting observations for output after reading the last observation from the dataset with the smallest number of observations ('plantMissing' with 3 observations). The resulting 'combine' dataset will therefore only have 3 observations, where the variables from 'animal' will be present but not combined by row with 'plantMissing' beyond the 3rd observation.
Copied!
1DATA animal;
2INPUT common $ animal $;
3DATALINES;
4a Ant
5b Bird
6c Cat
7d Dog
8e Eagle
9f Frog
10;
11 
12DATA plantMissing;
13INPUT common $ plant $;
14DATALINES;
15a Apple
16b Banana
17c Coconut
18;
19 
20DATA combine;
21 SET animal;
22 SET plantMissing;
23RUN;
24PROC PRINT DATA=combine;
25RUN;
4 Code Block
DATA STEP Data
Explanation :
This example illustrates the undesirable results of a one-to-one merge (without BY) when datasets contain duplicate values for common variables. Since the merge is based on row number and not on 'common' values, duplicates of 'a' in 'animalDupes' and 'c' in 'plantDupes' are not handled as one might expect. The second observation from 'animalDupes' (a Ape) is merged with the second from 'plantDupes' (b Banana), which does not correspond to the common variable values. The result is a dataset where the logical pairing of observations is compromised due to duplications and the absence of the BY statement.
Copied!
1DATA animalDupes;
2INPUT common $ animal $;
3DATALINES;
4a Ant
5a Ape
6b Bird
7c Cat
8d Dog
9e Eagle
10;
11 
12DATA plantDupes;
13INPUT common $ plant $;
14DATALINES;
15a Apple
16b Banana
17c Coconut
18c Celery
19d Dewberry
20e Eggplant
21;
22 
23DATA merged;
24 MERGE animalDupes plantDupes;
25RUN;
26PROC PRINT DATA=merged;
27RUN;
5 Code Block
DATA STEP Data
Explanation :
This example demonstrates the undesirable results of a one-to-one merge (without BY) when the input datasets ('animalMissing' and 'plantMissing2') have different values for the common variable ('common'). Since the merge is done by row number, observation 2 from 'animalMissing' (c Cat) is merged with observation 2 from 'plantMissing2' (b Banana), and so on. This leads to a dataset where the values of the 'common' variable and other variables do not logically match, as the pairing is not based on identifier concordance.
Copied!
1DATA animalMissing;
2INPUT common $ animal $;
3DATALINES;
4a Ant
5c Cat
6d Dog
7e Eagle
8;
9 
10DATA plantMissing2;
11INPUT common $ plant $;
12DATALINES;
13a Apple
14b Banana
15c Coconut
16e Eggplant
17f Fig
18;
19 
20DATA merged;
21 MERGE animalMissing plantMissing2;
22RUN;
23PROC PRINT DATA=merged;
24RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All Rights Reserved