ETL SAS VIYA CAS

Clean Data on Ingest: How to Transform Dates and Percentages Directly into CAS

This code is also available in: Deutsch Español
Difficulty Level
Beginner
Published on :
Simon

Expert Advice

Simon
Expert SAS et fondateur.

The vars parameter within importOptions acts very much like the INPUT statement in a traditional SAS Data Step. If you do not define these variables explicitly, the CAS server attempts to "guess" the data type based on the first few rows (similar to GUESSINGROWS). This often fails with dirty data (e.g., a "10%" string might force the whole column to VARCHAR). By explicitly defining {name="pct" type="double" informat="percent8."}, you override the guesser and force CAS to perform the mathematical conversion immediately, ensuring your numbers are ready for analysis the moment they hit memory.

The `table.loadTable` action is used to load a CSV file into an in-memory table on the CAS server. `importOptions` are configured to specify the file type (CSV) and to explicitly define variables via the `vars` parameter. For each variable, the name, data type (double), informat (`ANYDTDTE10.` for dates, `COMMA9.` for numbers with commas, and `PERCENT8.` for percentages), and format (`NLDATE20.`, `COMMA.`, `PERCENT.`) are defined. The `table.columnInfo` action displays table metadata, including applied data types and formats. Finally, `table.fetch` is used twice: once to display raw numeric values (without format) and once to display values formatted by CAS, illustrating the impact of informats and formats.
Data Analysis

Type : CREATION_INTERNE


Examples use CSV data generated directly in the code via `filename` statements, ensuring their autonomy and reproducibility without dependence on pre-existing external files on the CAS file system.

1 Code Block
PROC CAS / table.loadTable Data
Explanation :
This basic example shows how to load a CSV file into CAS using `table.loadTable`. The CSV file is dynamically created using a `filename` statement and a `DATA _NULL_` step to ensure script autonomy. The `importOptions` parameter is used to specify the file type, the absence of column names, and the definition of informats and formats for each variable (`dt`, `volume`, `pct`). The resulting table, `volume_basic`, is promoted to the global scope of the CAS session using `promote=True` to make it accessible to other sessions or users. The `table.columnInfo` action is used to check the applied data types and formats. Finally, `table.fetch` is used twice to show raw and formatted data.
Copied!
1/* Remplacez 'cloud.example.com' et '5570' par les valeurs de votre environnement */
2/*options cashost="cloud.example.com" casport=5570;*/
3 
4/* Démarre la session Casauto si ce n'est pas déjà fait. */
5/*cas casauto;*/
6 
7/* Création d'un fichier CSV temporaire dans une caslib accessible */
8filename volume_basic temp;
9DATA _null_;
10 file volume_basic;
11 put '12-06-2019,"1,245",10%';
12 put '01-03-2020,"3,400",27%';
13 put '02-07-2020,"2,341",18%';
14 put '03-06-2020,"5,700",45%';
15 put '04-03-2020,"12,686",100%';
16RUN;
17 
18PROC CAS;
19 SESSION casauto;
20 
21 /* Charge le fichier CSV temporaire en mémoire CAS */
22 TABLE.loadTable /
23 path="volume_basic",
24 caslib="casuser", /* Spécifie une caslib, ex: 'casuser' */
25 importOptions={
26 fileType="csv",
27 getNames=false,
28 vars={
29 {name="dt" type="double" informat="anydtdte10." FORMAT="nldate20."},
30 {name="volume" type="double" informat="comma9." FORMAT="comma."},
31 {name="pct" type="double" informat="percent8." FORMAT="percent."}
32 }
33 },
34 casout={name="volume_basic", replace=True, promote=True}; /* Promote rend la table globale */
35 
36 /* Affiche les informations sur les colonnes de la table chargée */
37 TABLE.columnInfo / TABLE="volume_basic";
38 
39 title "Valeurs numériques brutes (Exemple Basique)";
40 /* Récupère et affiche les données sans appliquer de formats */
41 TABLE.fetch /
42 TABLE="volume_basic",
43 fetchVars={
44 {name="dt", FORMAT=""},
45 {name="volume", FORMAT=""},
46 {name="pct", FORMAT=""}
47 };
48RUN;
49 
50 title "Formats appliqués par CAS (Exemple Basique)";
51 /* Récupère et affiche les données en appliquant les formats définis */
52 TABLE.fetch / TABLE="volume_basic", FORMAT=True;
53RUN;
54QUIT;
55 
56/* Nettoyage : supprime la table CAS en mémoire et le fichier temporaire */
57PROC CAS;
58 SESSION casauto;
59 TABLE.dropTable / caslib="casuser", name="volume_basic";
60RUN;
61QUIT;
62filename volume_basic clear;
2 Code Block
PROC CAS / table.loadTable Data
Explanation :
This example focuses on using common options when loading data. The CSV file is loaded into a CAS table named `volume_inter_global`. The `caslib='casuser'` option is explicitly used to specify the destination caslib. Additionally, `promote=True` in the `casout` parameter ensures that the loaded table is promoted from a session table to a global table, making it available to other users or sessions without needing to be reloaded. Variable names are also more explicit for better readability. `table.fetch` is used to display a preview of the formatted data.
Copied!
1/* Remplacez 'cloud.example.com' et '5570' par les valeurs de votre environnement */
2/*options cashost="cloud.example.com" casport=5570;*/
3 
4/* Démarre la session Casauto si ce n'est pas déjà fait. */
5/*cas casauto;*/
6 
7/* Création d'un fichier CSV temporaire avec plus de données */
8filename volume_inter temp;
9DATA _null_;
10 file volume_inter;
11 put '12-06-2019,"1,245",10%';
12 put '01-03-2020,"3,400",27%';
13 put '02-07-2020,"2,341",18%';
14 put '03-06-2020,"5,700",45%';
15 put '04-03-2020,"12,686",100%';
16 put '05-01-2020,"8,910",70%';
17 put '06-02-2020,"1,500",15%';
18RUN;
19 
20PROC CAS;
21 SESSION casauto;
22 
23 /* Charge le fichier CSV temporaire dans la caslib 'casuser' et le promeut */
24 TABLE.loadTable /
25 path="volume_inter",
26 caslib="casuser", /* Spécifie explicitement la caslib 'casuser' */
27 importOptions={
28 fileType="csv",
29 getNames=false,
30 vars={
31 {name="DateTransaction" type="double" informat="anydtdte10." FORMAT="nldate20."},
32 {name="VolumeVendu" type="double" informat="comma9." FORMAT="comma."},
33 {name="PourcentageCroissance" type="double" informat="percent8." FORMAT="percent."}
34 }
35 },
36 casout={name="volume_inter_global", replace=True, promote=True}; /* Promouvoir pour accès global */
37 
38 /* Affiche les 10 premières lignes de la table globale */
39 title "Données de Volume (Exemple Intermédiaire)";
40 TABLE.fetch / TABLE="volume_inter_global", to=10, FORMAT=True;
41RUN;
42QUIT;
43 
44/* Nettoyage : supprime la table CAS globale et le fichier temporaire */
45PROC CAS;
46 SESSION casauto;
47 TABLE.dropTable / caslib="casuser", name="volume_inter_global";
48RUN;
49QUIT;
50filename volume_inter clear;
3 Code Block
PROC CAS / table.loadTable, dataStep.runCode Data
Explanation :
This advanced example illustrates more complex data manipulation directly within the CAS environment. After the initial loading of the `volume_initial` table, a `dataStep.runCode` is used to execute a CASL DATA step. This DATA step creates a new variable `VolumeAjuste` by multiplying the `Volume` by the adjusted `CroissancePct`. Additionally, filtering (`if VolumeAjuste > 5000`) is applied to include only rows meeting this condition in the new `volume_processed` table. This demonstrates how to perform transformations and data selections in distributed memory.
Copied!
1/* Remplacez 'cloud.example.com' et '5570' par les valeurs de votre environnement */
2/*options cashost="cloud.example.com" casport=5570;*/
3 
4/* Démarre la session Casauto si ce n'est pas déjà fait. */
5/*cas casauto;*/
6 
7/* Création d'un fichier CSV temporaire pour l'exemple avancé */
8filename volume_adv temp;
9DATA _null_;
10 file volume_adv;
11 put '12-06-2019,"1,245",10%';
12 put '01-03-2020,"3,400",27%';
13 put '02-07-2020,"2,341",18%';
14 put '03-06-2020,"5,700",45%';
15 put '04-03-2020,"12,686",100%';
16 put '05-01-2020,"8,910",70%';
17 put '06-02-2020,"1,500",15%';
18 put '07-01-2020,"6,000",50%';
19 put '08-01-2020,"9,500",80%';
20RUN;
21 
22PROC CAS;
23 SESSION casauto;
24 
25 /* Charge le fichier CSV temporaire */
26 TABLE.loadTable /
27 path="volume_adv",
28 caslib="casuser",
29 importOptions={
30 fileType="csv",
31 getNames=false,
32 vars={
33 {name="Date" type="double" informat="anydtdte10." FORMAT="nldate20."},
34 {name="Volume" type="double" informat="comma9." FORMAT="comma."},
35 {name="CroissancePct" type="double" informat="percent8." FORMAT="percent."}
36 }
37 },
38 casout={name="volume_initial", replace=True, promote=False};
39 
40 /* Exécute un DATA step en CAS pour créer une nouvelle variable et filtrer les données */
41 dataStep.runCode /
42 code="
43 data casuser.volume_processed;
44 set casuser.volume_initial;
45 VolumeAjuste = Volume * (1 + CroissancePct);
46 if VolumeAjuste > 5000 then output;
47 run;",
48 single="yes";
49 
50 /* Affiche les informations sur la nouvelle table traitée */
51 TABLE.columnInfo / TABLE="volume_processed", caslib="casuser";
52 
53 title "Données de Volume Traitées (Exemple Avancé)";
54 /* Récupère et affiche les données de la table traitée */
55 TABLE.fetch / TABLE="volume_processed", caslib="casuser", FORMAT=True;
56RUN;
57QUIT;
58 
59/* Nettoyage : supprime les tables CAS et le fichier temporaire */
60PROC CAS;
61 SESSION casauto;
62 TABLE.dropTable / caslib="casuser", name="volume_initial";
63 TABLE.dropTable / caslib="casuser", name="volume_processed";
64RUN;
65QUIT;
66filename volume_adv clear;
4 Code Block
PROC CAS / table.loadTable, simple.freq Data
Explanation :
This example illustrates a deeper integration with CAS analytical capabilities. A CSV file is loaded, including a new categorical variable 'Region'. Then, the `simple.freq` action is used to calculate the frequency distribution of the 'Region' variable on the distributed in-memory table. Additionally, `simple.summary` is used to calculate descriptive statistics (mean) of the `Volume` by `Region`, demonstrating how CAS can efficiently perform aggregated analyses on large amounts of data. This highlights Viya's parallel computing power.
Copied!
1/* Remplacez 'cloud.example.com' et '5570' par les valeurs de votre environnement */
2/*options cashost="cloud.example.com" casport=5570;*/
3 
4/* Démarre la session Casauto si ce n'est pas déjà fait. */
5/*cas casauto;*/
6 
7/* Création d'un fichier CSV temporaire avec une variable catégorielle */
8filename volume_cas_analysis temp;
9DATA _null_;
10 file volume_cas_analysis;
11 put '12-06-2019,"1,245",10%,Est';
12 put '01-03-2020,"3,400",27%,Ouest';
13 put '02-07-2020,"2,341",18%,Nord';
14 put '03-06-2020,"5,700",45%,Sud';
15 put '04-03-2020,"12,686",100%,Est';
16 put '05-01-2020,"8,910",70%,Ouest';
17 put '06-02-2020,"1,500",15%,Nord';
18 put '07-01-2020,"6,000",50%,Sud';
19 put '08-01-2020,"9,500",80%,Est';
20RUN;
21 
22PROC CAS;
23 SESSION casauto;
24 
25 /* Charge le fichier CSV temporaire avec une nouvelle variable 'Region' */
26 TABLE.loadTable /
27 path="volume_cas_analysis",
28 caslib="casuser",
29 importOptions={
30 fileType="csv",
31 getNames=false,
32 vars={
33 {name="Date" type="double" informat="anydtdte10." FORMAT="nldate20."},
34 {name="Volume" type="double" informat="comma9." FORMAT="comma."},
35 {name="CroissancePct" type="double" informat="percent8." FORMAT="percent."},
36 {name="Region" type="varchar" LENGTH=50}
37 }
38 },
39 casout={name="volume_regions", replace=True, promote=False};
40 
41 /* Exécute une analyse de fréquence sur la variable 'Region' */
42 title "Analyse de Fréquence par Région (Exemple Viya/CAS)";
43 SIMPLE.freq /
44 TABLE={name="volume_regions", caslib="casuser"},
45 inputs={{name="Region"}};
46 
47 /* Calcule les moyennes du volume par région */
48 title "Statistiques Descriptives par Région (Exemple Viya/CAS)";
49 SIMPLE.summary /
50 TABLE={name="volume_regions", caslib="casuser"},
51 casout={name="summary_volume_regions", replace=True},
52 subSet={"mean"},
53 group={{
54 name="Region"}
55 },
56 inputs={{
57 name="Volume"}
58 };
59RUN;
60 
61QUIT;
62 
63/* Nettoyage : supprime les tables CAS et le fichier temporaire */
64PROC CAS;
65 SESSION casauto;
66 TABLE.dropTable / caslib="casuser", name="volume_regions";
67 TABLE.dropTable / caslib="casuser", name="summary_volume_regions";
68RUN;
69QUIT;
70filename volume_cas_analysis clear;
Pro Tip
If your CSV lacks headers, set getNames=false and provide a complete list of variable definitions in the vars parameter to manually build your table schema during the ingestion process.
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Related Documentation

Aucune documentation spécifique pour cette catégorie.