Unlocking ADLS Gen2: The Secret to Reading Distributed ORC Tables with SAS

This code is also available in: Deutsch Español Français
Difficulty Level
Beginner
Published on :
Michael

Expert Advice

Michael
Responsable de l'infrastructure Viya.

This is standard behavior for big data formats (ORC, Parquet) generated by distributed systems (Spark, Hive). Always check if your source "file" is actually a folder before coding your connection.

You must explicitly tell SAS when a directory is actually a table.
This example illustrates reading an ORC table stored in Azure Data Lake Storage Gen2. To access tables created by Hive and stored as directories, the DIRECTORIES_AS_DATA=YES option must be specified in the LIBNAME statement. The presented SAS© code reads an 'inventory' table previously created in Hive.
Data Analysis

Type : EXTERNAL


The SAS example reads an 'inventory' table in ORC format which must be pre-created in Azure ADLS. Hive commands for creating this table are provided as prerequisites in the first code block. The SAS example does not create its own internal data.

1 Code Block
HiveQL Data
Explanation :
This HiveQL code block is a prerequisite for the SAS example. It creates a table named 'inventory' in ORC format in the '/mydata/inventory' path of Azure ADLS and inserts some records. This table will then be read by the SAS code. This code must be executed in an appropriate Hive environment before executing the SAS code.
Copied!
1CREATE TABLE inventory(item INT, descr CHAR(32), price DOUBLE, day DATE) STORED AS ORC LOCATION '/mydata/inventory';
2INSERT INTO inventory VALUES(4600, 'incandescent A19', 15.54, '2019-10-17');
3INSERT INTO inventory VALUES(1023, 'led A19', 14.97, '2019-10-17');
4INSERT INTO inventory VALUES(1023, 'incandescent B10', 5.49, '2019-10-16');
5INSERT INTO inventory VALUES(4219, 'incandescent E12', 3.97, '2019-10-15');
6SELECT * FROM inventory;
2 Code Block
LIBNAME ORC
Explanation :
This LIBNAME `mylib` statement connects SAS to an Azure Data Lake Storage (ADLS) location using the ORC engine. It specifies the storage account name (`storage_account_name`), the storage application ID (`storage_application_id`), and the storage file system (`storage_file_system`). The `directories_as_data=yes` option is essential for SAS to recognize and access Hive tables stored as directories in ADLS. Finally, the PROC PRINT procedure is used to display the content of the ORC table `mylib.inventory`.
Copied!
1LIBNAME mylib orc '/data/user' /*1*/
2 storage_account_name = myacct
3 storage_application_id = 'b1fc955d5c-e0e2-45b3-a3cc-a1cf54120f'
4 storage_file_system = myorcdata
5 directories_as_data=yes /*2*/
6 ;
7PROC PRINT DATA=mylib.inventory noobs; /*3*/
8RUN;
Pro Tip
Hive and Spark often save tables not as a single file, but as a directory containing multiple "part-files" (e.g., part-00000, part-00001). By default, the SAS ORC engine looks for a specific file. You must use DIRECTORIES_AS_DATA=YES to force SAS to treat the folder name as the table name and logically concatenate all the part-files inside it into a single SAS dataset.
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.