SAS and Hadoop: How to Maintain the Power of Parallel Processing

A common misconception persists among long-time SAS^© users when faced with Big Data: "To analyze my Hadoop data with SAS^©, I must first extract it from HDFS, convert it into local SAS^© tables, and then run my PROC SQL."

This approach (classic ETL) is not only inefficient, but it also negates the main advantage of Hadoop: distributed parallel processing. If you move petabytes of data to a single SAS^© server (Compute Server), you create a massive bottleneck.

The modern answer (since SAS^© 9.4) is based on the principle of "In-Database Processing": move the code to the data, not the other way around. Here are the three key technologies for integrating SAS^© and Hadoop without losing parallelization.

1. PROC FEDSQL: The New Generation SQL

If you are used to PROC SQL, you know it has limitations in a distributed environment. It often tends to pull data back to perform joins or sorts locally (on the SAS^© server).

The solution: PROC FEDSQL Introduced with SAS^© 9.4, FedSQL is a proprietary implementation of the ANSI SQL:1999 standard, designed specifically for "Scalable" processing.

How it works: FedSQL acts as an intelligent translator. When targeting Hadoop (via Hive or Impala), it attempts to translate your query into the native dialect (HiveQL or Impala SQL) and execute it directly on the cluster.
Advantage: Joins, filters, and aggregations are executed by the Hadoop nodes in parallel. Only the final result (often small) is returned to SAS^©.
Connectivity: Supports Hive, HAWQ, and Impala.

1	/* Exemple FedSQL exécuté DANS Hadoop */
2	PROC FEDSQL sessref=myHadoopSession;
3	select region, sum(sales)
4	from HadoopLib.LargeSales
5	group BY region;
6	QUIT;

2. The DS2 Language and the "Code Accelerator"

The classic SAS^© DATA Step is sequential and single-threaded (it processes one row after another on a single processor). It is therefore incompatible with the distributed nature of Hadoop.

The solution: The DS2 Language DS2 is the multithreaded successor to the Data Step. It includes object-oriented programming concepts and strong typing, but most importantly, it is designed for parallel execution.

SAS^© In-Database Code Accelerator: This is the magic component. It allows you to take a DS2 program written in your SAS^© interface and "push" it to be executed inside the Hadoop data nodes (as MapReduce tasks or via Spark depending on the versions).
Data Program & Thread Program: The data program and the processing threads run in parallel on each node of the cluster, leveraging all available CPU power on your Big Data infrastructure.

3. SAS^© SPD Engine (SPDE) on HDFS

Sometimes, you need the physical structure of a SAS^© table (indexing, compression, precise metadata) but you want to store this data on the distributed file system (HDFS) for reliability and I/O speed.

The solution: Scalable Performance Data Engine (SPDE) The SPDE engine allows storing tables in SAS^© format directly on HDFS.

It partitions the data (data files .dpf, index files .hbx, metadata .mdf).
This allows SAS^© to read and write in parallel from multiple nodes, offering much higher I/O throughput than classic disk storage, while keeping the data "accessible" as if it were local.

Need	Recommended Technology	Execution Location
SQL Queries (Joins, Aggregations)	PROC FEDSQL	Distributed in Hadoop (via Hive/Impala)
Complex transformations, Data Quality	DS2 Language (+ Code Accelerator)	Distributed in Hadoop (Compute Nodes)
High-performance SAS^© format storage	SPD Engine	Distributed on HDFS
Reading raw text files	SAS^©/ACCESS to Hadoop (Hive)	Hadoop (via Hive layer)

Do not perform massive ETL to SAS^©. Use FedSQL and DS2 to push the processing logic to the cluster. This way, you will retain the advantage of Hadoop parallelism while coding from your familiar SAS^© environment.

Important Disclaimer

The codes and examples provided on WeAreCAS.eu are for educational purposes. It is imperative not to blindly copy-paste them into your production environments. The best approach is to understand the logic before applying it. We strongly recommend testing these scripts in a test environment (Sandbox/Dev). WeAreCAS accepts no responsibility for any impact or data loss on your systems.

Back to article list

1. PROC FEDSQL: The New Generation SQL

2. The DS2 Language and the "Code Accelerator"

3. SAS© SPD Engine (SPDE) on HDFS

Important Disclaimer

3. SAS^© SPD Engine (SPDE) on HDFS