Fedsql

SAS Viya Tutorial: Pre-process Data in the Database Before CAS Loading

Simon 19 vues

In the SAS© Viya architecture, the standard data lifecycle often involves copying tables from a source (Oracle, Postgres, Hadoop) to the CAS server's memory for analysis. This is the "Load First, Analyze Later" approach.

However, when data sources are very large (several terabytes), loading the entire table only to use a summary or a subset is inefficient. It unnecessarily consumes network bandwidth and RAM.

Although PROC CASUTIL offers simple options (WHERE=, VAR=), there is a much more powerful method for performing complex aggregations or joins directly in the database before the data even reaches CAS: FedSQL Pass-Through.

This tutorial explores two methods for delegating work to the database: Implicit mode and Explicit mode.


Method 1: Implicit Pass-Through (The "Standard" Method)

Implicit Pass-Through allows you to write standard FedSQL code (similar to ANSI SQL). The CAS Data Connector handles translating your query into the native language of the database (Postgres SQL, T-SQL, PL/SQL, etc.).

Example: Simple Aggregation

Imagine a Postgres table with 2 million rows (orders_large). We want to load into CAS only the sum of sales by order type, filtered by date.

SAS© Code:

1cas mysession;
2/* Définition de la connexion à Postgres */
3caslib caspgr datasource=(srctype="postgres", database="DCON", server="...", ...) libref=caspgr;
4 
5PROC FEDSQL sessref=mysession;
6 /* Création de la table résultante directement dans la RAM de CAS */
7 create TABLE caspgr.revenuesummary{options replace=true} as
8 select order_type, sum(total_sale) as revenue
9 from caspgr."orders_large" /* Notez les guillemets pour préserver la casse */
10 where order_date < date'2015-01-01'
11 group BY order_type;
12QUIT;

What happens behind the scenes: If the orders_large table is not already loaded into CAS, the engine detects that it is an external table. It translates the query into Postgres SQL, sends it to the server, and loads only the aggregated result (a few rows) into CAS.

Performance Gain:

  • Classic Method (Full Load + CAS Aggregation): ~38 seconds.

  • Pass-Through Method (Database Aggregation + Result Load): ~0.32 seconds.

The difference is massive because it avoids transferring 2 million rows over the network. The SAS© log will confirm the optimization with the note:

NOTE: The SQL statement was fully offloaded to the underlying data source via full pass-through

Warning about Unsupported Functions (Partial Pass-Through)

Implicit Pass-Through has its limits. If you use a SAS© function that has no direct equivalent in the database (e.g., KURTOSIS to calculate kurtosis), the engine cannot delegate that part of the calculation.

In this case, SAS© will send the WHERE clause to the database (for filtering), but it will retrieve all the filtered rows to calculate KURTOSIS in CAS. The performance will be degraded compared to "Full Pass-Through".

Method 2: Explicit Pass-Through (The "Power User" Method)

Sometimes, you need to use features very specific to your database (JSON functions, geospatial functions, special indexes) that standard FedSQL is not aware of.

Explicit Pass-Through allows you to send a query in its native language (pure database SQL) directly to the engine, without translation by SAS©.

Syntax

The syntax relies on the CONNECTION TO clause:

1SELECT ... FROM CONNECTION TO nom_caslib ( votre_requête_native_ici )

Advanced Example: Parsing JSON in Postgres

Suppose your Postgres table jsonorders contains a JSON type column. You want to extract structured information (customer state, quantity) stored in this JSON and aggregate it before loading it into CAS.

FedSQL doesn't know how to parse JSON natively, but Postgres does.

SAS© Code with Explicit Pass-Through:

1PROC FEDSQL sessref=mysession;
2 create TABLE caspgr.StateSalesSummary as
3 select customerstate, minquantity, maxquantity, sumquantity
4 from connection to caspgr
5 (
6 /* DEBUT DU SQL NATIF POSTGRES */
7 SELECT customerstate,
8 min(quantity) as minquantity,
9 max(quantity) as maxquantity,
10 sum(quantity) as sumquantity
11 FROM
12 (
13 /* Utilisation des opérateurs JSON Postgres -> et ->> */
14 SELECT orderinfo -> 'customer' ->> 'state' as customerstate,
15 cast(orderinfo -> 'items' ->> 'qty' as integer) as quantity
16 FROM jsonorders
17 ) t
18 GROUP BY customerstate
19 /* FIN DU SQL NATIF POSTGRES */
20 );
21QUIT;
FROM CONNECTION TO caspgr (...) : Everything within the parentheses is invisible to the SAS©/FedSQL interpreter. It is sent as is to Postgres.

Operators -> and ->> : These are Postgres-specific operators for navigating JSON objects. SAS© would not understand them in implicit mode.

Result: Postgres performs the heavy JSON extraction and aggregation. CAS receives only a small, ready-to-use summary table.