Your data is partitioned across multiple nodes (workers) and threads.
The Data Step executes in parallel on each thread.
Each thread has its own local instance of the Hash Table.
The problem: If the duplicate of key "A" is on Thread 1 and the original is on Thread 2, their respective Hash Tables do not communicate. Each thread will think it has a unique key. Result: duplicates persist in the final consolidated table.
Solution 1: The native method (Recommended)
The most efficient way to deduplicate in Viya™ is to use the dedicated CAS action. It is optimized for the distributed engine and handles the data shuffling necessary to compare records between nodes.
Contrary to popular belief, the good old PROC SORT with the NODUPKEY option is very efficient in CAS.
The CAS engine intercepts the PROC SORT syntax and translates it into optimized distributed operations. It does not bring the data locally to sort it; everything happens in memory on the cluster.
proc sort data=casuser.ma_table_source out=casuser.ma_table_dedoublonnee nodupkey;
by var_cle1 var_cle2;
run;
Verdict: Tests show that this method is almost as fast as the deduplicate action. It is often the best choice for migrating existing code without complex re-writing.
If you absolutely want to use a Data Step, you must abandon the Hash Table logic and return to the sequential BY + FIRST. logic.
Why does this work without prior sorting?
In CAS, the BY statement forces a data reorganization: the controller ensures that all rows with the same BY key are sent to the same thread (grouping).
data casuser.ma_table_dedoublonnee;
set casuser.ma_table_source;
by var_cle1 var_cle2;
if first.var_cle2; /* Garde la première occurrence */
run;
1
DATA casuser.ma_table_dedoublonnee;
2
SET casuser.ma_table_source;
3
BY var_cle1 var_cle2;
4
IF first.var_cle2; /* Garde la première occurrence */
5
RUN;
Caution: Although functional, this method involves massive data movement (network shuffle) to group keys, which can be less performant than the dedicated action for very large volumes.
Single-threaded: Forcing execution on a single thread (/single=yes) would solve the Hash Table problem, but would kill performance by negating the entire purpose of Viya™'s parallel processing.
Die auf WeAreCAS.eu bereitgestellten Codes und Beispiele dienen Lehrzwecken. Es ist zwingend erforderlich, sie nicht blind in Ihre Produktionsumgebungen zu kopieren. Der beste Ansatz besteht darin, die Logik zu verstehen, bevor sie angewendet wird. Wir empfehlen dringend, diese Skripte in einer Testumgebung (Sandbox/Dev) zu testen. WeAreCAS übernimmt keine Verantwortung für mögliche Auswirkungen oder Datenverluste auf Ihren Systemen.
SAS und alle anderen Produkt- oder Dienstleistungsnamen von SAS Institute Inc. sind eingetragene Marken oder Marken von SAS Institute Inc. in den USA und anderen Ländern. ® zeigt die Registrierung in den USA an. WeAreCAS ist eine unabhängige Community-Site und nicht mit SAS Institute Inc. verbunden.
Diese Website verwendet technische und analytische Cookies, um Ihre Erfahrung zu verbessern.
Mehr erfahren.