Published on :
ETL INTERNAL_CREATION

Implementation and Comparison of the '_INFILE_' Technique for Data Delimitation

This code is also available in: Deutsch Español Français
Awaiting validation
The script initializes a dataset ('HAVE') with a character variable ('ROW') containing a long sequence of numbers delimited by vertical bars. Two approaches are then presented: the first ('SPLIT') uses a loop and the SCAN function to parse this string, a method considered slow. The second ('SPLIT2') aims to implement the 'Infile Trick' by overloading the _INFILE_ buffer with the content of the 'ROW' variable, allowing for fast and efficient reading of delimited elements directly into a variable array. Finally, a PROC COMPARE is used to verify that the results of the two methods are identical, thus validating the increased performance without loss of data integrity, subject to the correction of syntax errors in 'SPLIT2'.
Data Analysis

Type : INTERNAL_CREATION


All data is internally generated by the 'HAVE' DATA step, which creates a long character variable ('ROW') containing delimited values. No external dataset is used as a primary source.

1 Code Block
DATA STEP Data
Explanation :
This DATA step block creates a dataset named 'HAVE'. It initializes a character variable 'ROW' of 8000 bytes. Then, it constructs a complex character string by concatenating 700 formatted numbers (z3.) separated by the '|' delimiter. This process is repeated 10,000 times to generate a large number of observations, simulating a substantial dataset to process.
Copied!
1DATA HAVE;
2 LENGTH ROW $8000;
3 ROW='|||';
4 DO I=1 to 700;
5 ROW=catx('|',ROW,put(i,z3.));
6 END;
7 DO J = 1 to 1e4;
8 OUTPUT;
9 END;
10RUN;
2 Code Block
DATA STEP
Explanation :
This DATA step reads the 'HAVE' dataset. It declares an array of 700 character variables ('VAR'). A loop is then used to iterate over this array and sequentially extract each element from the 'ROW' variable using the SCAN function. This approach, although functional, is notoriously slow for large amounts of data or a large number of elements to scan, due to its O(N^2) algorithmic complexity.
Copied!
1DATA SPLIT; /* Slow */
2 SET HAVE;
3 array VAR(700) $10;
4 DO I = 1 to dim(VAR);
5 VAR[I]=scan(ROW,I,'|','m');
6 END;
7RUN;
3 Code Block
DATA STEP
Explanation :
This DATA step aims to implement the 'Infile Trick' for fast extraction of delimited data. It reads the 'HAVE' dataset. The INFILE statement is configured to read a virtual file with the '|' delimiter and a high record length. The crucial line `_infile_=ROW;` copies the content of the 'ROW' variable (from the 'HAVE' dataset) directly into the DATA step's input buffer.
**SYNTAX NOTE:** The following `input` statements contain file paths (` @code_sas_json/Mastering SAS Section3.json SAS Section7 (1).json` and ` @code_sas_json/hsdua2304@gmail.com_SAS_Assignment_1.json`) which are syntactically incorrect for specifying variables or column pointers in a SAS DATA step. To be executable and functional, the probable intention was to directly read the delimited values into a variable array, for example `input (VAR1-VAR700) (:$10.);`. As written, this SAS block will cause syntax errors and will not execute correctly.
Copied!
1DATA SPLIT2; /* Fast */
2 SET HAVE;
3 INFILE sasautos(verify.sas) dsd dlm='|' lrecl=8000;
4 INPUT @code_sas_json/Mastering SAS Section3.json SAS Section7 (1).json @;
5 _infile_=ROW;
6 INPUT @code_sas_json/Mastering SAS Section3.json SAS Section7 (1).json (VAR1-VAR700) (:$10.) @code_sas_json/hsdua2304@gmail.com_SAS_Assignment_1.json;
7RUN;
4 Code Block
PROC COMPARE
Explanation :
This procedure compares the 'SPLIT' (created with the slow method) and 'SPLIT2' (attempted 'Infile Trick' method) datasets to verify that both methods produce identical results. This validates the accuracy of the optimization technique despite its different approach. Due to the syntax error in 'SPLIT2', this PROC COMPARE will fail or report differences if 'SPLIT2' cannot be constructed correctly.
Copied!
1PROC COMPARE DATA=SPLIT compare=SPLIT2;
2RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : SPDX-License-Identifier: Apache-2.0. Code original de ChrisNZ, discussion sur SAS Communities (https://communities.sas.com/t5/SAS-Programming/How-to-delimit-large-dataset-28-Million-rows-into-700-variables/m-p/487676).