Optimizing the Data Step in SAS Viya: The Expert Guide

With the SAS^© Viya^™ architecture, Data Step execution evolves from sequential processing to massively parallel processing within the CAS (Cloud Analytic Services) server. To take full advantage of this power, it's necessary to understand the SPDM (Single Program, Multiple Data) paradigm and master certain configuration settings.

1. The Execution Paradigm: SPDM

Unlike Base SAS^© which executes code on a single processor, CAS uses the SPDM approach. The Data Step code is duplicated and sent simultaneously to each "Worker Node" and each "Thread".

To ensure execution in CAS (and not a costly transfer back to the SAS^© client), two conditions are mandatory:

The DSACCEL=ANY option must be active (this is the default).
All tables (input and output) must reside in a CAS library (LIBNAME ... CAS).

Point of caution: If even one table in the process (for example in a SET or MERGE statement) is local (e.g., SASHELP.CARS or WORK.DATA), the entire process will switch to the client side (SAS^© Workspace Server), negating the performance gains of the cluster.

2. The Performance Lever: The `COPIES` Parameter

In a distributed environment, redundancy management is crucial for fault tolerance, but it comes at a cost in terms of network traffic (Data Movement).

COPIES=1 (Default behavior): CAS creates a backup copy of data blocks on a neighboring node. This generates significant inter-node traffic during writing.
COPIES=0: Data is written locally on the node that processes it, without immediate replication.

Expert Recommendation: For intermediate tables (which are not persistent final results), systematically force the COPIES=0 option. On large volumes (e.g., 100 million rows), this can reduce execution time by a factor of 5 by eliminating the network bottleneck.

1	/* Optimisation pour table temporaire */
2	DATA casuser.temp(copies=0);
3	SET casuser.SOURCE;
4	/* Traitements... */
5	RUN;

3. Memory and Disk Cache Management (`MaxTableMem`)

The balance between RAM (fast) and disk (slow) is the heart of performance in CAS. Although CAS is an "In-Memory" technology, it relies on a disk cache mechanism (CAS_DISK_CACHE) when RAM is saturated or for temporary persistence.

The MaxTableMem parameter defines the memory threshold beyond which CAS starts using memory-mapped files backed by the disk, rather than pure RAM.

To delve deeper into the internal mechanics of these allocations, two reference resources are essential:

Understanding CAS Storage: For a detailed view of the block structure and the interaction between memory and disk, consult Nicolas Housset's article on In-Memory data management in CAS. He details how CAS segments data to optimize parallel access.
The role of CAS_DISK_CACHE: Contrary to popular belief, the disk cache is not just an overflow space (swap). It plays an active role in managing loaded tables. The technical article When is CAS_DISK_CACHE used? on the SAS^© community precisely explains the scenarios that trigger its use (initial loading, memory overflow, failover).

Technical Tip: If your storage infrastructure (hosting the CAS_DISK_CACHE) is slow (HDD vs SSD) or your network is limited (1GbE), increasing the value of MaxTableMem can force the use of RAM and reduce disk I/O, thereby significantly improving performance.

4. Functional Specifics in CAS

RETAIN Statement: As data is partitioned, the RETAIN statement only works within the boundaries of a thread. It is impossible to calculate a global cumulative sum across the entire table in a single, simple Data Step pass.
Data Types: CAS introduces the VARCHAR type (variable length), which is more memory-efficient than the traditional CHAR (fixed length) for long strings.
BY Statement: Group processing requires that the data be correctly partitioned on the nodes beforehand (via partition= or groupby=), otherwise the implicit sort can be costly.

Summary of Best Practices

Load everything into CAS before running the Data Step (PROC CASUTIL).
Use COPIES=0 for all intermediate tables.
Monitor the usage of the CAS_DISK_CACHE and adjust MaxTableMem if disk I/O slows down the process (see links above).
Adapt programming logic (totals, aggregations) to multithreaded mode.

Important Disclaimer

The codes and examples provided on WeAreCAS.eu are for educational purposes. It is imperative not to blindly copy-paste them into your production environments. The best approach is to understand the logic before applying it. We strongly recommend testing these scripts in a test environment (Sandbox/Dev). WeAreCAS accepts no responsibility for any impact or data loss on your systems.

Back to article list

Optimizing the Data Step in SAS Viya: The Expert Guide

Difficulty Level

Published on : 10/12/2025

Expert Advice

1. The Execution Paradigm: SPDM

2. The Performance Lever: The `COPIES` Parameter

3. Memory and Disk Cache Management (`MaxTableMem`)

4. Functional Specifics in CAS

Summary of Best Practices

Important Disclaimer

Difficulty Level

Published on : 10/12/2025

Expert Advice

1. The Execution Paradigm: SPDM

2. The Performance Lever: The COPIES Parameter

3. Memory and Disk Cache Management (MaxTableMem)

4. Functional Specifics in CAS

Summary of Best Practices

Important Disclaimer

2. The Performance Lever: The `COPIES` Parameter

3. Memory and Disk Cache Management (`MaxTableMem`)