With the SAS© Viya™ architecture, Data Step execution evolves from sequential processing to massively parallel processing within the CAS (Cloud Analytic Services) server. To take full advantage of this power, it's necessary to understand the SPDM (Single Program, Multiple Data) paradigm and master certain configuration settings.
1. The Execution Paradigm: SPDM
Unlike Base SAS© which executes code on a single processor, CAS uses the SPDM approach. The Data Step code is duplicated and sent simultaneously to each "Worker Node" and each "Thread".
To ensure execution in CAS (and not a costly transfer back to the SAS© client), two conditions are mandatory:
The DSACCEL=ANY option must be active (this is the default).
All tables (input and output) must reside in a CAS library (LIBNAME ... CAS).
Point of caution: If even one table in the process (for example in a SET or MERGE statement) is local (e.g., SASHELP.CARS or WORK.DATA), the entire process will switch to the client side (SAS© Workspace Server), negating the performance gains of the cluster.
2. The Performance Lever: The COPIES Parameter
In a distributed environment, redundancy management is crucial for fault tolerance, but it comes at a cost in terms of network traffic (Data Movement).
COPIES=1 (Default behavior): CAS creates a backup copy of data blocks on a neighboring node. This generates significant inter-node traffic during writing.
COPIES=0: Data is written locally on the node that processes it, without immediate replication.
Expert Recommendation:
For intermediate tables (which are not persistent final results), systematically force the COPIES=0 option. On large volumes (e.g., 100 million rows), this can reduce execution time by a factor of 5 by eliminating the network bottleneck.
3. Memory and Disk Cache Management (MaxTableMem)
The balance between RAM (fast) and disk (slow) is the heart of performance in CAS. Although CAS is an "In-Memory" technology, it relies on a disk cache mechanism (CAS_DISK_CACHE) when RAM is saturated or for temporary persistence.
The MaxTableMem parameter defines the memory threshold beyond which CAS starts using memory-mapped files backed by the disk, rather than pure RAM.
To delve deeper into the internal mechanics of these allocations, two reference resources are essential:
Understanding CAS Storage: For a detailed view of the block structure and the interaction between memory and disk, consult Nicolas Housset's article on In-Memory data management in CAS. He details how CAS segments data to optimize parallel access.
The role of CAS_DISK_CACHE: Contrary to popular belief, the disk cache is not just an overflow space (swap). It plays an active role in managing loaded tables. The technical article When is CAS_DISK_CACHE used? on the SAS© community precisely explains the scenarios that trigger its use (initial loading, memory overflow, failover).
Technical Tip: If your storage infrastructure (hosting the CAS_DISK_CACHE) is slow (HDD vs SSD) or your network is limited (1GbE), increasing the value of MaxTableMem can force the use of RAM and reduce disk I/O, thereby significantly improving performance.
4. Functional Specifics in CAS
Coding in Data Step for CAS involves some adaptations compared to traditional SAS©:
RETAIN Statement: As data is partitioned, the RETAIN statement only works within the boundaries of a thread. It is impossible to calculate a global cumulative sum across the entire table in a single, simple Data Step pass.
Data Types: CAS introduces the VARCHAR type (variable length), which is more memory-efficient than the traditional CHAR (fixed length) for long strings.
BY Statement: Group processing requires that the data be correctly partitioned on the nodes beforehand (via partition= or groupby=), otherwise the implicit sort can be costly.