Reading and Correcting Delimited Data with Missing Delimiters

The main objective of this script is to provide strategies for a common problem when importing flat files: missing delimiters. The script first creates a temporary text file ('C:\TEMP\osbbb.txt') intentionally containing lines with missing delimiters. Then, it attempts to read this file with a standard INFILE statement, demonstrating the resulting errors (column shift). Finally, it offers two corrective approaches within a DATA step: the first using SAS^© FIND and SUBSTR functions with the automatic variable _INFILE_, and a second, more concise one, using the PRXCHANGE function with a regular expression to insert missing delimiters before final reading.

Data Analysis

Type : CREATION_INTERNE

The data is generated directly in the script via a DATA _NULL_ step, which writes a delimited text file with intentional missing delimiters. This temporary file is then used as a source for the reading and correction DATA steps.

1 Code Block

DATA STEP Data

Explanation :
This block initializes a file reference (FILENAME) to a text file named 'osbbb.txt' in 'C:\TEMP'. Then, a DATA _NULL_ step is used to create and write data lines into this file. The content includes a header line and several data lines, some of which intentionally have missing delimiters (e.g., 'Gannet~22JUL2018~Y' is missing the second delimiter between 'Gannet' and '22JUL2018'), thereby simulating a corrupted data scenario for subsequent tests.

Copied!

1	FILENAME MISSING 'C:\TEMP\osbbb.txt';
2
3	DATA _NULL_;
4	file MISSING;
5	put 'Animal~Mode~Date~Bird';
6	put 'Beaver~~09OCT2019~N';
7	put 'Gannet~22JUL2018~Y';
8	* ^-- missing delimiter;
9	put 'Peacock~~17DEC2017~Y';
10	put 'Robin~23FEB2019~Y';
11	* ^-- missing delimiter;
12	put 'Nuthatch~19APR2019~Y';
13	* ^-- missing delimiter;
14	RUN;

2 Code Block

DATA STEP Data

Explanation :
This DATA step attempts to read the 'MISSING' file created previously. The INFILE statement uses the tilde (~) delimiter, the DSD option to treat consecutive delimiters as missing values, FIRSTOBS=2 to skip the header, and TRUNCOVER to prevent errors when lines are shorter than expected. However, due to missing delimiters that are not consecutive delimiters but rather offsets, this block demonstrates that reading will fail for corrupted records, leading to incorrectly assigned or truncated data.

Copied!

1	DATA missing;
2	INFILE MISSING dlm='~' DSD firstobs=2 truncover;
3	FORMAT Animal $8. Mode $5. Date DATE9. Bird $1.;
4	INPUT Animal Mode Date:DATE9. Bird;
5	RUN;

3 Code Block

DATA STEP Data

Explanation :
This DATA step offers a first solution to correct missing delimiters. The 'input @;' statement reads the complete line into the automatic variable _INFILE_ without parsing it. It then uses the SCAN function to check the supposed content of the fourth column ('field4'). If 'field4' is not 'Y' or 'N', it indicates a missing delimiter. In this case, the code finds the first delimiter (FIND) and inserts an additional tilde ('~') after it by manipulating the _INFILE_ string with SUBSTR. Finally, a second INPUT statement rereads the corrected line from _INFILE_ for correct parsing. The temporary variables DLM1AT and field4 are then dropped (DROP).

Copied!

1	DATA fixed (drop=DLM1AT field4);
2	INFILE MISSING dlm='~' DSD firstobs=2;
3	FORMAT Animal $8. Mode $5. Date DATE9. Bird $1.;
4	INPUT @;
5	LENGTH field4 $1;
6	field4 = scan(_INFILE_, 4, '~', 'M');
7	IF field4 not in ('Y', 'N') THEN DO;
8	DLM1AT = FIND(_INFILE_, '~');
9	_INFILE_ = SUBSTR(_INFILE_, 1, DLM1AT) \|\|
10	'~' \|\| SUBSTR(_INFILE_, DLM1AT + 1);
11	END;
12	INPUT Animal Mode Date:DATE9. Bird;
13	RUN;

4 Code Block

DATA STEP Data

Explanation :
This DATA step offers a more elegant and concise solution using regular expressions. Like the previous block, 'input @;' reads the complete line into _INFILE_. The PRXCHANGE function is then used to search for a specific pattern: the beginning of the string, followed by a sequence of 1 to 8 non-tilde characters, then a tilde, capturing everything in the first capture group (\1). It then looks for a non-tilde character immediately after the first capture group (\2). If this pattern is found (which signals a missing delimiter between the first and second fields), it inserts a tilde between the two capture groups (\1~\2), thus correcting _INFILE_. The corrected line is then read by the final INPUT statement.

Copied!

1	DATA everythings_better_with_regex;
2	INFILE MISSING dlm='~' DSD firstobs=2;
3	FORMAT Animal $8. Mode $5. Date DATE9. Bird $1.;
4	INPUT @;
5	_INFILE_ = PRXCHANGE('s/^([^~]{1,8}~)([^~])/, 1, _INFILE_);
6	INPUT Animal Mode Date:DATE9. Bird;
7	RUN;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Copyright Info : Paper 4839-2020 Read Before You Read: Reading, Rewriting & Re-Reading Difficult Delimited Data in a Data Step Appendix B - Complete Code for Missing Delimiters

Retour à la liste