Your data is partitioned across multiple nodes (workers) and threads.
The Data Step executes in parallel on each thread.
Each thread has its own local instance of the Hash Table.
The problem: If the duplicate of key "A" is on Thread 1 and the original is on Thread 2, their respective Hash Tables do not communicate. Each thread will think it has a unique key. Result: duplicates persist in the final consolidated table.
Solution 1: The native method (Recommended)
The most efficient way to deduplicate in Viya™ is to use the dedicated CAS action. It is optimized for the distributed engine and handles the data shuffling necessary to compare records between nodes.
Contrary to popular belief, the good old PROC SORT with the NODUPKEY option is very efficient in CAS.
The CAS engine intercepts the PROC SORT syntax and translates it into optimized distributed operations. It does not bring the data locally to sort it; everything happens in memory on the cluster.
proc sort data=casuser.ma_table_source out=casuser.ma_table_dedoublonnee nodupkey;
by var_cle1 var_cle2;
run;
Verdict: Tests show that this method is almost as fast as the deduplicate action. It is often the best choice for migrating existing code without complex re-writing.
If you absolutely want to use a Data Step, you must abandon the Hash Table logic and return to the sequential BY + FIRST. logic.
Why does this work without prior sorting?
In CAS, the BY statement forces a data reorganization: the controller ensures that all rows with the same BY key are sent to the same thread (grouping).
data casuser.ma_table_dedoublonnee;
set casuser.ma_table_source;
by var_cle1 var_cle2;
if first.var_cle2; /* Garde la première occurrence */
run;
1
DATA casuser.ma_table_dedoublonnee;
2
SET casuser.ma_table_source;
3
BY var_cle1 var_cle2;
4
IF first.var_cle2; /* Garde la première occurrence */
5
RUN;
Caution: Although functional, this method involves massive data movement (network shuffle) to group keys, which can be less performant than the dedicated action for very large volumes.
Single-threaded: Forcing execution on a single thread (/single=yes) would solve the Hash Table problem, but would kill performance by negating the entire purpose of Viya™'s parallel processing.
Les codes et exemples fournis sur WeAreCAS.eu sont à but pédagogique. Il est impératif de ne pas les copier-coller aveuglément sur vos environnements de production. La meilleure approche consiste à comprendre la logique avant de l'appliquer. Nous vous recommandons vivement de tester ces scripts dans un environnement de test (Sandbox/Dev). WeAreCAS décline toute responsabilité quant aux éventuels impacts ou pertes de données sur vos systèmes.
SAS et tous les autres noms de produits ou de services de SAS Institute Inc. sont des marques déposées ou des marques de commerce de SAS Institute Inc. aux États-Unis et dans d'autres pays. ® indique un enregistrement aux États-Unis. WeAreCAS est un site communautaire indépendant et n'est pas affilié à SAS Institute Inc.
Ce site utilise des cookies techniques et analytiques pour améliorer votre expérience.
En savoir plus.