SAS Performance Tuning: Supercharge Your Hadoop Uploads with BULKLOAD=YES

Difficulty Level

Beginner

Published on : 16/03/2022

Expert Advice

Michael
Responsable de l'infrastructure Viya.

The sastrace option enabled in the first line is critical for debugging Bulkload operations because it allows you to see the exact temporary files being created on HDFS and the specific HiveQL 'LOAD DATA' commands generated by SAS, which are otherwise hidden from the standard log.

Attention : This code requires administrator privileges.

The script illustrates the use of `PROC APPEND` with the `SAS^©/ACCESS to Hadoop` interface. It executes three distinct tests. The first uses `BULKLOAD=YES` for a fast load of the `sashelp.cars` table. The second performs the same operation without `BULKLOAD` for comparison. The third test shows how to load data while specifying an underlying storage format, here `Parquet`, in Hadoop. Each test is followed by a cleanup step via `PROC SQL` to drop the created table, making the script re-executable.

Data Analysis

Type : MIXTE

The data source is the `sashelp.cars` table, which is an internal SAS example table. The destination is an external Hadoop database, connected via a `libname`. The script does not read external data; it writes to it.

1 Code Block

LIBNAME

Explanation :
Defines a connection to a Hadoop server via the SAS/ACCESS to Hadoop interface. It also enables tracing options (`sastrace`) to record detailed information about the interaction with the database in the SAS log.

Copied!

1	LIBNAME mycdh hadoop server="quickstart.cloudera" user=cloudera password=cloudera;
2	options sastrace=',,,d' sastraceloc=saslog nostsuffix;
3

2 Code Block

PROC APPEND Data

Explanation :
This block loads data from the `sashelp.cars` table into a new `cars` table on the Hadoop server. The `bulkload=yes` option activates bulk loading mode, optimized for large volume data transfers. The table is then dropped with `PROC SQL` to clean up the environment.

Copied!

1	PROC APPEND base=mycdh.cars (bulkload=yes)
2	DATA=sashelp.cars;
3	RUN;
4
5	PROC SQL;
6	drop TABLE mycdh.cars;
7	QUIT;

3 Code Block

PROC APPEND Data

Explanation :
This block performs the same load as the previous one but without the `bulkload=yes` option. This allows comparison of the performance difference between a standard load (potentially row-by-row) and a bulk load. The table is then dropped.

Copied!

1	PROC APPEND base=mycdh.cars
2	DATA=sashelp.cars;
3	RUN;
4
5	PROC SQL;
6	drop TABLE mycdh.cars;
7	QUIT;

4 Code Block

PROC APPEND Data

Explanation :
This block loads the data again, but uses the `dbcreate_table_opts` option to pass a specific instruction to Hadoop when creating the table. Here, it requests that the table be stored in 'Parquet' file format, a highly performant columnar storage format. The table is finally dropped.

Copied!

1	PROC APPEND base=mycdh.cars (dbcreate_table_opts='stored as parquetfile')
2	DATA=sashelp.cars;
3	RUN;
4
5	PROC SQL;
6	drop TABLE mycdh.cars;
7	QUIT;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.