Your data is partitioned across multiple nodes (workers) and threads.
The Data Step executes in parallel on each thread.
Each thread has its own local instance of the Hash Table.
The problem: If the duplicate of key "A" is on Thread 1 and the original is on Thread 2, their respective Hash Tables do not communicate. Each thread will think it has a unique key. Result: duplicates persist in the final consolidated table.
Solution 1: The native method (Recommended)
The most efficient way to deduplicate in Viya™ is to use the dedicated CAS action. It is optimized for the distributed engine and handles the data shuffling necessary to compare records between nodes.
Contrary to popular belief, the good old PROC SORT with the NODUPKEY option is very efficient in CAS.
The CAS engine intercepts the PROC SORT syntax and translates it into optimized distributed operations. It does not bring the data locally to sort it; everything happens in memory on the cluster.
proc sort data=casuser.ma_table_source out=casuser.ma_table_dedoublonnee nodupkey;
by var_cle1 var_cle2;
run;
Verdict: Tests show that this method is almost as fast as the deduplicate action. It is often the best choice for migrating existing code without complex re-writing.
If you absolutely want to use a Data Step, you must abandon the Hash Table logic and return to the sequential BY + FIRST. logic.
Why does this work without prior sorting?
In CAS, the BY statement forces a data reorganization: the controller ensures that all rows with the same BY key are sent to the same thread (grouping).
data casuser.ma_table_dedoublonnee;
set casuser.ma_table_source;
by var_cle1 var_cle2;
if first.var_cle2; /* Garde la première occurrence */
run;
1
DATA casuser.ma_table_dedoublonnee;
2
SET casuser.ma_table_source;
3
BY var_cle1 var_cle2;
4
IF first.var_cle2; /* Garde la première occurrence */
5
RUN;
Caution: Although functional, this method involves massive data movement (network shuffle) to group keys, which can be less performant than the dedicated action for very large volumes.
Single-threaded: Forcing execution on a single thread (/single=yes) would solve the Hash Table problem, but would kill performance by negating the entire purpose of Viya™'s parallel processing.
The codes and examples provided on WeAreCAS.eu are for educational purposes. It is imperative not to blindly copy-paste them into your production environments. The best approach is to understand the logic before applying it. We strongly recommend testing these scripts in a test environment (Sandbox/Dev). WeAreCAS accepts no responsibility for any impact or data loss on your systems.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. WeAreCAS is an independent community site and is not affiliated with SAS Institute Inc.
This site uses technical and analytical cookies to improve your experience.
Read more.