Deduplication of CAS tables: Why your Hash Tables fail and how to correct it

If you are a SAS^© 9 expert, you probably use Hash Tables to efficiently deduplicate large volumes of data without sorting (PROC SORT). However, by migrating this code to SAS^© Viya^™ (CAS), you risk an unpleasant surprise: your output table still contains duplicates.

Why does this proven technique fail in the Cloud, and what are the effective alternatives?

The Trap: Hash Tables and Distributed Processing

In a classic SAS^© 9 environment (SMP), the Data Step executes sequentially on a single processor. The hash table keeps track of unique keys seen since the beginning of the file.

In CAS (MPP - Massively Parallel Processing), execution is fundamentally different:

Your data is partitioned across multiple nodes (workers) and threads.
The Data Step executes in parallel on each thread.
Each thread has its own local instance of the Hash Table.

The problem: If the duplicate of key "A" is on Thread 1 and the original is on Thread 2, their respective Hash Tables do not communicate. Each thread will think it has a unique key. Result: duplicates persist in the final consolidated table.

Solution 1: The native method (Recommended)

The most efficient way to deduplicate in Viya^™ is to use the dedicated CAS action. It is optimized for the distributed engine and handles the data shuffling necessary to compare records between nodes.

Use the deduplication.deduplicate action via PROC CAS:

1	PROC CAS;
2	deduplication.deduplicate /
3	TABLE={caslib="casuser", name="ma_table_source", groupBy={"var_cle1", "var_cle2"}}
4	noDuplicateKeys=true
5	casOut={name="ma_table_dedoublonnee", caslib="casuser", replace=true};
6	QUIT;

Advantage: Maximum performance.

Note: The groupBy option defines the deduplication keys.

Solution 2: The `PROC SORT` surprise

Contrary to popular belief, the good old PROC SORT with the NODUPKEY option is very efficient in CAS.

The CAS engine intercepts the PROC SORT syntax and translates it into optimized distributed operations. It does not bring the data locally to sort it; everything happens in memory on the cluster.

1
2	PROC SORT
3	DATA=casuser.ma_table_source out=casuser.ma_table_dedoublonnee nodupkey;
4
5	BY var_cle1 var_cle2;
6
7	RUN;
8

Verdict: Tests show that this method is almost as fast as the deduplicate action. It is often the best choice for migrating existing code without complex re-writing.

Solution 3: The Data Step with `BY` group

If you absolutely want to use a Data Step, you must abandon the Hash Table logic and return to the sequential BY + FIRST. logic.

Why does this work without prior sorting? In CAS, the BY statement forces a data reorganization: the controller ensures that all rows with the same BY key are sent to the same thread (grouping).

1	DATA casuser.ma_table_dedoublonnee;
2	SET casuser.ma_table_source;
3	BY var_cle1 var_cle2;
4	IF first.var_cle2; /* Garde la première occurrence */
5	RUN;

Caution: Although functional, this method involves massive data movement (network shuffle) to group keys, which can be less performant than the dedicated action for very large volumes.

Single-threaded: Forcing execution on a single thread (/single=yes) would solve the Hash Table problem, but would kill performance by negating the entire purpose of Viya^™'s parallel processing.

Summary

Method	Performance	Complexity	Verdict
Hash Tables	High	Medium	❌ To avoid for deduplication (risk of logical error in distributed processing).
CAS Action (`deduplicate`)	⭐⭐⭐	Medium	✅ The Best for native CASL scripts.
PROC SORT (`nodupkey`)	⭐⭐½	Low	✅ Ideal for compatibility and simplicity.
Data Step (`By Group`)	⭐⭐	Low	⚠️ Functional, but beware of network cost.

Avertissement important

Les codes et exemples fournis sur WeAreCAS.eu sont à but pédagogique. Il est impératif de ne pas les copier-coller aveuglément sur vos environnements de production. La meilleure approche consiste à comprendre la logique avant de l'appliquer. Nous vous recommandons vivement de tester ces scripts dans un environnement de test (Sandbox/Dev). WeAreCAS décline toute responsabilité quant aux éventuels impacts ou pertes de données sur vos systèmes.

Retour à la liste des articles

Deduplication of CAS tables: Why your Hash Tables fail and how to correct it

Niveau de difficulté

Publié le : 24/06/2023

The Trap: Hash Tables and Distributed Processing

Solution 1: The native method (Recommended)

Solution 2: The `PROC SORT` surprise

Solution 3: The Data Step with `BY` group

Summary

Avertissement important

Niveau de difficulté

Publié le : 24/06/2023

The Trap: Hash Tables and Distributed Processing

Solution 1: The native method (Recommended)

Solution 2: The PROC SORT surprise

Solution 3: The Data Step with BY group

Summary

Avertissement important

Solution 2: The `PROC SORT` surprise

Solution 3: The Data Step with `BY` group