Unlocking ADLS Gen2: The Secret to Reading Distributed ORC Tables with SAS

Difficulty Level

Beginner

Published on : 02/03/2024

Expert Advice

Michael
Responsable de l'infrastructure Viya.

This is standard behavior for big data formats (ORC, Parquet) generated by distributed systems (Spark, Hive). Always check if your source "file" is actually a folder before coding your connection.

You must explicitly tell SAS when a directory is actually a table.

This example illustrates reading an ORC table stored in Azure Data Lake Storage Gen2. To access tables created by Hive and stored as directories, the DIRECTORIES_AS_DATA=YES option must be specified in the LIBNAME statement. The presented SAS^© code reads an 'inventory' table previously created in Hive.

Data Analysis

Type : EXTERNAL

The SAS example reads an 'inventory' table in ORC format which must be pre-created in Azure ADLS. Hive commands for creating this table are provided as prerequisites in the first code block. The SAS example does not create its own internal data.

1 Code Block

HiveQL Data

Explanation :
This HiveQL code block is a prerequisite for the SAS example. It creates a table named 'inventory' in ORC format in the '/mydata/inventory' path of Azure ADLS and inserts some records. This table will then be read by the SAS code. This code must be executed in an appropriate Hive environment before executing the SAS code.

Copied!

1	CREATE TABLE inventory(item INT, descr CHAR(32), price DOUBLE, day DATE) STORED AS ORC LOCATION '/mydata/inventory';
2	INSERT INTO inventory VALUES(4600, 'incandescent A19', 15.54, '2019-10-17');
3	INSERT INTO inventory VALUES(1023, 'led A19', 14.97, '2019-10-17');
4	INSERT INTO inventory VALUES(1023, 'incandescent B10', 5.49, '2019-10-16');
5	INSERT INTO inventory VALUES(4219, 'incandescent E12', 3.97, '2019-10-15');
6	SELECT * FROM inventory;

2 Code Block

LIBNAME ORC

Explanation :
This LIBNAME `mylib` statement connects SAS to an Azure Data Lake Storage (ADLS) location using the ORC engine. It specifies the storage account name (`storage_account_name`), the storage application ID (`storage_application_id`), and the storage file system (`storage_file_system`). The `directories_as_data=yes` option is essential for SAS to recognize and access Hive tables stored as directories in ADLS. Finally, the PROC PRINT procedure is used to display the content of the ORC table `mylib.inventory`.

Copied!

1	LIBNAME mylib orc '/data/user' /1/
2	storage_account_name = myacct
3	storage_application_id = 'b1fc955d5c-e0e2-45b3-a3cc-a1cf54120f'
4	storage_file_system = myorcdata
5	directories_as_data=yes /2/
6	;
7	PROC PRINT DATA=mylib.inventory noobs; /3/
8	RUN;

Pro Tip

Hive and Spark often save tables not as a single file, but as a directory containing multiple "part-files" (e.g., part-00000, part-00001). By default, the SAS ORC engine looks for a specific file. You must use DIRECTORIES_AS_DATA=YES to force SAS to treat the folder name as the table name and logically concatenate all the part-files inside it into a single SAS dataset.

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Unlocking ADLS Gen2: The Secret to Reading Distributed ORC Tables with SAS

Difficulty Level

Published on : 02/03/2024

Expert Advice

You must explicitly tell SAS when a directory is actually a table.

Data Analysis

1 Code Block

2 Code Block

Pro Tip

CAS vs. Compute Server: Where Should You Run Your FedSQL Code in SAS Viya

Under the Hood: Mastering SAS Engines and the Multi-Engine Architecture

The Big Four: Decoding the Differences Between Work, User, Sashelp, and Sasuser

Beyond Physical Files: Unlocking the Logical Power of the SAS Dataset Mode

Beyond the Basics: Mastering SAS Data Access Models and Record-Level Locking

SAS Table Structure: Optimize Your Metadata and Sorting (V9/SPD)

Related Documentation

CAS vs. Compute Server: Where Should You Run Your FedSQL Code in SAS Viya

The Big Four: Decoding the Differences Between Work, User, Sashelp, and Sasuser

Beyond Physical Files: Unlocking the Logical Power of the SAS Dataset Mode

Beyond the Basics: Mastering SAS Data Access Models and Record-Level Locking

SAS Table Structure: Optimize Your Metadata and Sorting (V9/SPD)