A common misconception persists among long-time SAS© users when faced with Big Data: "To analyze my Hadoop data with SAS©, I must first extract it from HDFS, convert it into local SAS© tables, and then run my PROC SQL."
This approach (classic ETL) is not only inefficient, but it also negates the main advantage of Hadoop: distributed parallel processing. If you move petabytes of data to a single SAS© server (Compute Server), you create a massive bottleneck.
The modern answer (since SAS© 9.4) is based on the principle of "In-Database Processing": move the code to the data, not the other way around. Here are the three key technologies for integrating SAS© and Hadoop without losing parallelization.
1. PROC FEDSQL: The New Generation SQL
If you are used to PROC SQL, you know it has limitations in a distributed environment. It often tends to pull data back to perform joins or sorts locally (on the SAS© server).
The solution: PROC FEDSQL
Introduced with SAS© 9.4, FedSQL is a proprietary implementation of the ANSI SQL:1999 standard, designed specifically for "Scalable" processing.
How it works: FedSQL acts as an intelligent translator. When targeting Hadoop (via Hive or Impala), it attempts to translate your query into the native dialect (HiveQL or Impala SQL) and execute it directly on the cluster.
Advantage: Joins, filters, and aggregations are executed by the Hadoop nodes in parallel. Only the final result (often small) is returned to SAS©.
Connectivity: Supports Hive, HAWQ, and Impala.
2. The DS2 Language and the "Code Accelerator"
The classic SAS© DATA Step is sequential and single-threaded (it processes one row after another on a single processor). It is therefore incompatible with the distributed nature of Hadoop.
The solution: The DS2 Language
DS2 is the multithreaded successor to the Data Step. It includes object-oriented programming concepts and strong typing, but most importantly, it is designed for parallel execution.
SAS© In-Database Code Accelerator: This is the magic component. It allows you to take a DS2 program written in your SAS© interface and "push" it to be executed inside the Hadoop data nodes (as MapReduce tasks or via Spark depending on the versions).
Data Program & Thread Program: The data program and the processing threads run in parallel on each node of the cluster, leveraging all available CPU power on your Big Data infrastructure.
3. SAS© SPD Engine (SPDE) on HDFS
Sometimes, you need the physical structure of a SAS© table (indexing, compression, precise metadata) but you want to store this data on the distributed file system (HDFS) for reliability and I/O speed.
The solution: Scalable Performance Data Engine (SPDE)
The SPDE engine allows storing tables in SAS© format directly on HDFS.
It partitions the data (data files .dpf, index files .hbx, metadata .mdf).
This allows SAS© to read and write in parallel from multiple nodes, offering much higher I/O throughput than classic disk storage, while keeping the data "accessible" as if it were local.
Do not perform massive ETL to SAS©. Use FedSQL and DS2 to push the processing logic to the cluster. This way, you will retain the advantage of Hadoop parallelism while coding from your familiar SAS© environment.