Data Step

Deduplication of CAS tables: Why your Hash Tables fail and how to correct it

Simon 10 vues

If you are a SAS© 9 expert, you probably use Hash Tables to efficiently deduplicate large volumes of data without sorting (PROC SORT). However, by migrating this code to SAS© Viya (CAS), you risk an unpleasant surprise: your output table still contains duplicates.

Why does this proven technique fail in the Cloud, and what are the effective alternatives?

The Trap: Hash Tables and Distributed Processing

In a classic SAS© 9 environment (SMP), the Data Step executes sequentially on a single processor. The hash table keeps track of unique keys seen since the beginning of the file.

In CAS (MPP - Massively Parallel Processing), execution is fundamentally different:

  1. Your data is partitioned across multiple nodes (workers) and threads.

  2. The Data Step executes in parallel on each thread.

  3. Each thread has its own local instance of the Hash Table.

The problem: If the duplicate of key "A" is on Thread 1 and the original is on Thread 2, their respective Hash Tables do not communicate. Each thread will think it has a unique key. Result: duplicates persist in the final consolidated table.


Solution 1: The native method (Recommended)

The most efficient way to deduplicate in Viya is to use the dedicated CAS action. It is optimized for the distributed engine and handles the data shuffling necessary to compare records between nodes.

Use the deduplication.deduplicate action via PROC CAS:

1PROC CAS;
2 deduplication.deduplicate /
3 TABLE={caslib="casuser", name="ma_table_source", groupBy={"var_cle1", "var_cle2"}}
4 noDuplicateKeys=true
5 casOut={name="ma_table_dedoublonnee", caslib="casuser", replace=true};
6QUIT;
Advantage: Maximum performance.

Note: The groupBy option defines the deduplication keys.

Solution 2: The PROC SORT surprise

Contrary to popular belief, the good old PROC SORT with the NODUPKEY option is very efficient in CAS.

The CAS engine intercepts the PROC SORT syntax and translates it into optimized distributed operations. It does not bring the data locally to sort it; everything happens in memory on the cluster.

1 
2PROC SORT
3DATA=casuser.ma_table_source out=casuser.ma_table_dedoublonnee nodupkey;
4 
5BY var_cle1 var_cle2;
6 
7RUN;
8 
Verdict: Tests show that this method is almost as fast as the deduplicate action. It is often the best choice for migrating existing code without complex re-writing.

Solution 3: The Data Step with BY group

If you absolutely want to use a Data Step, you must abandon the Hash Table logic and return to the sequential BY + FIRST. logic.

Why does this work without prior sorting? In CAS, the BY statement forces a data reorganization: the controller ensures that all rows with the same BY key are sent to the same thread (grouping).

1DATA casuser.ma_table_dedoublonnee;
2 SET casuser.ma_table_source;
3 BY var_cle1 var_cle2;
4 IF first.var_cle2; /* Garde la première occurrence */
5RUN;
Caution: Although functional, this method involves massive data movement (network shuffle) to group keys, which can be less performant than the dedicated action for very large volumes.

Single-threaded: Forcing execution on a single thread (/single=yes) would solve the Hash Table problem, but would kill performance by negating the entire purpose of Viya's parallel processing.

Summary

MethodPerformanceComplexityVerdict
Hash TablesHighMediumTo avoid for deduplication (risk of logical error in distributed processing).
CAS Action (deduplicate)⭐⭐⭐MediumThe Best for native CASL scripts.
PROC SORT (nodupkey)⭐⭐½LowIdeal for compatibility and simplicity.
Data Step (By Group)⭐⭐Low⚠️ Functional, but beware of network cost.