Examples: One-to-One Data Merging

The functional analysis focuses on the behavior of the MERGE statement in the absence of the BY statement. In this one-to-one merging mode, SAS^© does not compare the values of common variables, but implicitly associates observations based on their row number. When column names are identical in the input datasets, the values of common variables from the last dataset specified in the MERGE statement overwrite those from previous datasets. Non-shared columns are added as new columns. Observations are read sequentially from each dataset, and the output dataset will contain all observations, adjusted by the common variable overload rules. The document warns against using this method for data relationships other than strictly one-to-one, as it can produce unexpected and incorrect results.

Data Analysis

Type : CREATION_INTERNE

Examples use generated data (datalines) or SASHELP.

1 Code Block

DATA STEP Data

Explanation :
This example merges two datasets, 'animal' and 'plantG', which have an equal number of observations and a common variable named 'common'. The MERGE statement is used without the BY statement, meaning that observations are merged based on their position (row number). The value 'g' of the 'common' variable in 'plantG' replaces the value 'f' of 'common' in 'animal' for observation 6 of the output dataset, demonstrating that values from the last named dataset overwrite previous ones for common variables.

Copied!

1	DATA animal;
2	INPUT common $ animal $;
3	DATALINES;
4	a Ant
5	b Bird
6	c Cat
7	d Dog
8	e Eagle
9	f Frog
10	;
11
12	DATA plantG;
13	INPUT common $ plant $;
14	DATALINES;
15	a Apple
16	b Banana
17	c Coconut
18	d Dewberry
19	e Eggplant
20	g Fig
21	;
22
23	DATA merged;
24	MERGE animal plantG;
25	RUN;
26
27	PROC PRINT DATA=merged;
28	RUN;

2 Code Block

DATA STEP Data

Explanation :
This example merges the 'animal' dataset (6 observations) and 'plantMissing' (3 observations). Without a BY statement, one-to-one merging associates observations by row number. The resulting 'merged' dataset will have 6 observations. For observations 4, 5, and 6, the variables from 'plantMissing' (plant, common) will contain missing values because 'plantMissing' only has 3 observations. The values of 'common' from the last dataset ('plantMissing') overwrite those from 'animal' for the corresponding observations.

Copied!

1	DATA animal;
2	INPUT common $ animal $;
3	DATALINES;
4	a Ant
5	b Bird
6	c Cat
7	d Dog
8	e Eagle
9	f Frog
10	;
11
12	DATA plantMissing;
13	INPUT common $ plant $;
14	DATALINES;
15	a Apple
16	b Banana
17	c Coconut
18	;
19
20	DATA merged;
21	MERGE animal plantmissing;
22	RUN;
23	PROC PRINT DATA=merged;
24	RUN;

3 Code Block

DATA STEP Data

Explanation :
This example shows the difference in behavior between MERGE and SET when combining datasets of unequal sizes without a BY statement. With two consecutive SET statements, the DATA step stops selecting observations for output after reading the last observation from the dataset with the smallest number of observations ('plantMissing' with 3 observations). The resulting 'combine' dataset will therefore only have 3 observations, where the variables from 'animal' will be present but not combined by row with 'plantMissing' beyond the 3rd observation.

Copied!

1	DATA animal;
2	INPUT common $ animal $;
3	DATALINES;
4	a Ant
5	b Bird
6	c Cat
7	d Dog
8	e Eagle
9	f Frog
10	;
11
12	DATA plantMissing;
13	INPUT common $ plant $;
14	DATALINES;
15	a Apple
16	b Banana
17	c Coconut
18	;
19
20	DATA combine;
21	SET animal;
22	SET plantMissing;
23	RUN;
24	PROC PRINT DATA=combine;
25	RUN;

4 Code Block

DATA STEP Data

Explanation :
This example illustrates the undesirable results of a one-to-one merge (without BY) when datasets contain duplicate values for common variables. Since the merge is based on row number and not on 'common' values, duplicates of 'a' in 'animalDupes' and 'c' in 'plantDupes' are not handled as one might expect. The second observation from 'animalDupes' (a Ape) is merged with the second from 'plantDupes' (b Banana), which does not correspond to the common variable values. The result is a dataset where the logical pairing of observations is compromised due to duplications and the absence of the BY statement.

Copied!

1	DATA animalDupes;
2	INPUT common $ animal $;
3	DATALINES;
4	a Ant
5	a Ape
6	b Bird
7	c Cat
8	d Dog
9	e Eagle
10	;
11
12	DATA plantDupes;
13	INPUT common $ plant $;
14	DATALINES;
15	a Apple
16	b Banana
17	c Coconut
18	c Celery
19	d Dewberry
20	e Eggplant
21	;
22
23	DATA merged;
24	MERGE animalDupes plantDupes;
25	RUN;
26	PROC PRINT DATA=merged;
27	RUN;

5 Code Block

DATA STEP Data

Explanation :
This example demonstrates the undesirable results of a one-to-one merge (without BY) when the input datasets ('animalMissing' and 'plantMissing2') have different values for the common variable ('common'). Since the merge is done by row number, observation 2 from 'animalMissing' (c Cat) is merged with observation 2 from 'plantMissing2' (b Banana), and so on. This leads to a dataset where the values of the 'common' variable and other variables do not logically match, as the pairing is not based on identifier concordance.

Copied!

1	DATA animalMissing;
2	INPUT common $ animal $;
3	DATALINES;
4	a Ant
5	c Cat
6	d Dog
7	e Eagle
8	;
9
10	DATA plantMissing2;
11	INPUT common $ plant $;
12	DATALINES;
13	a Apple
14	b Banana
15	c Coconut
16	e Eggplant
17	f Fig
18	;
19
20	DATA merged;
21	MERGE animalMissing plantMissing2;
22	RUN;
23	PROC PRINT DATA=merged;
24	RUN;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste