Load a CSV File from a GZ Archive

The `table.loadTable` action is a key function of the 'table' action set in SAS^© Viya^™, allowing the import of data from various sources to the CAS server. This example focuses on loading CSV files that have been individually compressed using gzip (.gz). The `archiveType='gz'` option is crucial to tell the CAS server how to decompress the file before processing the CSV content. Using `getNames=true` automatically detects column names from the first line of the CSV file. For the examples to be executable, make sure to create a CSV file, compress it to GZ (e.g., `gzip my_file.csv` which will produce `my_file.csv.gz`), and place it in a file system path accessible by the CAS server via the specified caslib (e.g., `/tmp/cas_data_gz/`).

Data Analysis

Type : FICHIER_COMPRESSE_EXTERNE

The examples require prior creation of a CSV file, its compression to GZ, and its placement in a file system directory accessible by the CAS server.

1 Code Block

PROC CAS / table.loadTable

Explanation :
This example illustrates the basic loading of a GZ compressed CSV file. It configures a temporary caslib pointing to a local path on the CAS server. Then, it uses `table.loadTable` with `fileType='csv'` and `archiveType='gz'` to decompress and load the file. `getNames=true` indicates that the first line of the CSV contains the column names. A quick check with `table.fetch` is performed, followed by a cleanup of CAS resources.

Copied!

1	/* Préparation requise par l'utilisateur: */
2	/* Créez un fichier 'sample_data.csv' avec le contenu suivant: */
3	/* id,name,value */
4	/* 1,Alice,100 */
5	/* 2,Bob,150 */
6	/* 3,Charlie,200 */
7	/* Compressez-le: `gzip sample_data.csv` pour obtenir 'sample_data.csv.gz'. */
8	/* Placez 'sample_data.csv.gz' dans le répertoire '/tmp/cas_data_gz/' accessible par le serveur CAS. */
9
10	PROC CAS;
11	/* Ajout d'une caslib temporaire pour l'exemple */
12	/* Remplacez '/tmp/cas_data_gz' par un chemin accessible et inscriptible par le serveur CAS */
13	TABLE.addCaslib /
14	name="mycaslib"
15	dataSource={srcType="path"},
16	path="/tmp/cas_data_gz";
17
18	/* Chargement du fichier GZ contenant un CSV */
19	TABLE.loadTable /
20	caslib="mycaslib",
21	path="sample_data.csv.gz",
22	importOptions={
23	fileType="csv",
24	archiveType="gz",
25	getNames=true
26	},
27	casout={
28	name="mydata_basic",
29	replace=true
30	};
31
32	/* Vérification du chargement (affiche les 5 premières lignes) */
33	TABLE.fetch /
34	TABLE={name="mydata_basic"},
35	maxRows=5;
36
37	/* Nettoyage: Suppression de la table chargée */
38	TABLE.dropTable /
39	caslib="mycaslib",
40	name="mydata_basic";
41
42	/* Nettoyage: Suppression de la caslib temporaire */
43	TABLE.dropCaslib /
44	caslib="mycaslib";
45	RUN;

2 Code Block

PROC CAS / table.loadTable

Explanation :
This example demonstrates how to load a GZ compressed CSV file when the delimiter is not the standard comma. The `delimiter=';'` option is used to specify the semicolon. `UTF-8` encoding is also specified, which is good practice to ensure character compatibility. Verification and cleanup steps are included.

Copied!

1	/* Préparation requise par l'utilisateur: */
2	/* Créez un fichier 'sample_data_semicolon.csv' avec le contenu suivant: */
3	/* id;name;value */
4	/* 1;Alice;100 */
5	/* 2;Bob;150 */
6	/* 3;Charlie,200 */
7	/* Compressez-le: `gzip sample_data_semicolon.csv` pour obtenir 'sample_data_semicolon.csv.gz'. */
8	/* Placez 'sample_data_semicolon.csv.gz' dans le répertoire '/tmp/cas_data_gz/' accessible par le serveur CAS. */
9
10	PROC CAS;
11	/* Ajout d'une caslib temporaire pour l'exemple */
12	TABLE.addCaslib /
13	name="mycaslib"
14	dataSource={srcType="path"},
15	path="/tmp/cas_data_gz";
16
17	/* Chargement du fichier GZ contenant un CSV avec délimiteur personnalisé */
18	TABLE.loadTable /
19	caslib="mycaslib",
20	path="sample_data_semicolon.csv.gz",
21	importOptions={
22	fileType="csv",
23	archiveType="gz",
24	getNames=true,
25	delimiter=";", /* Spécifie le point-virgule comme délimiteur */
26	encoding="UTF-8" /* Spécifie l'encodage du fichier */
27	},
28	casout={
29	name="mydata_custom_delimiter",
30	replace=true
31	};
32
33	/* Vérification du chargement (affiche les 5 premières lignes) */
34	TABLE.fetch /
35	TABLE={name="mydata_custom_delimiter"},
36	maxRows=5;
37
38	/* Nettoyage: Suppression de la table chargée */
39	TABLE.dropTable /
40	caslib="mycaslib",
41	name="mydata_custom_delimiter";
42
43	/* Nettoyage: Suppression de la caslib temporaire */
44	TABLE.dropCaslib /
45	caslib="mycaslib";
46	RUN;

3 Code Block

PROC CAS / table.loadTable

Explanation :
This example uses advanced options to optimize the loading process and data schema detection. `guessingRows=500` forces CAS to analyze the first 500 rows to determine column types and lengths, useful for large files with heterogeneous data. `nRows=100000` can help with size estimation. `blocksize='1M'` adjusts the read block size for better performance. `table.columnInfo` is then used to inspect the detected schema.

Copied!

1	/* Préparation requise par l'utilisateur: */
2	/* Créez un fichier 'sample_data_large.csv' avec le même contenu que 'sample_data.csv' ou plus de lignes. */
3	/* Compressez-le: `gzip sample_data_large.csv` pour obtenir 'sample_data_large.csv.gz'. */
4	/* Placez 'sample_data_large.csv.gz' dans le répertoire '/tmp/cas_data_gz/' accessible par le serveur CAS. */
5
6	PROC CAS;
7	/* Ajout d'une caslib temporaire pour l'exemple */
8	TABLE.addCaslib /
9	name="mycaslib"
10	dataSource={srcType="path"},
11	path="/tmp/cas_data_gz";
12
13	/* Chargement avec des options avancées de détection de schéma et de performance */
14	TABLE.loadTable /
15	caslib="mycaslib",
16	path="sample_data_large.csv.gz",
17	importOptions={
18	fileType="csv",
19	archiveType="gz",
20	getNames=true,
21	guessingRows=500, /* Augmente le nombre de lignes pour une meilleure détection du type de colonnes */
22	nRows=100000, /* Indique le nombre de lignes à lire si le fichier est très grand (pour l'estimation) */
23	blocksize="1M" /* Spécifie la taille des blocs de lecture pour optimiser les performances */
24	},
25	casout={
26	name="mydata_perf_options",
27	replace=true
28	};
29
30	/* Vérification des informations des colonnes de la table chargée */
31	TABLE.columnInfo /
32	TABLE={name="mydata_perf_options"};
33
34	/* Nettoyage: Suppression de la table chargée */
35	TABLE.dropTable /
36	caslib="mycaslib",
37	name="mydata_perf_options";
38
39	/* Nettoyage: Suppression de la caslib temporaire */
40	TABLE.dropCaslib /
41	caslib="mycaslib";
42	RUN;

4 Code Block

PROC CAS / table.loadTable

Explanation :
This example explores error handling and verification in a CAS environment. It attempts to load a GZ compressed CSV file that might contain formatting errors. Options like `maxRows` and `nthreads` are included. After the loading attempt, `table.tableInfo` is used to get metadata about the table, including indications of potential issues. `table.fetch` displays the data if the loading was successful. The goal is to show how the CAS server reacts to malformed data and how information can be retrieved for debugging. The user should consult the CAS log for detailed error messages.

Copied!

1	/* Préparation requise par l'utilisateur: */
2	/* Créez un fichier 'sample_data_error.csv' avec un format intentionnellement incorrect (ex: ligne manquante ou colonne supplémentaire) */
3	/* id,name,value */
4	/* 1,Alice,100 */
5	/* 2,Bob,150,extra */
6	/* 3,Charlie */
7	/* Compressez-le: `gzip sample_data_error.csv` pour obtenir 'sample_data_error.csv.gz'. */
8	/* Placez 'sample_data_error.csv.gz' dans le répertoire '/tmp/cas_data_gz/' accessible par le serveur CAS. */
9
10	PROC CAS;
11	/* Ajout d'une caslib temporaire pour l'exemple */
12	TABLE.addCaslib /
13	name="mycaslib"
14	dataSource={srcType="path"},
15	path="/tmp/cas_data_gz";
16
17	/* Tentative de chargement d'un fichier avec des erreurs potentielles */
18	/* Des options comme `addrowid=true` ou `maxerrs` peuvent être utiles pour le débogage */
19	TABLE.loadTable /
20	caslib="mycaslib",
21	path="sample_data_error.csv.gz",
22	importOptions={
23	fileType="csv",
24	archiveType="gz",
25	getNames=true,
26	maxRows=100, /* Limite les lignes lues pour éviter de surcharger en cas de problème */
27	nthreads=4 /* Utilise plusieurs threads pour le chargement */
28	},
29	casout={
30	name="mydata_error_check",
31	replace=true
32	};
33
34	/* Vérification de l'état de la table et des messages d'erreur si le chargement a échoué */
35	TABLE.tableInfo /
36	caslib="mycaslib",
37	name="mydata_error_check";
38
39	/* Si le chargement est réussi, afficher les données, sinon, inspecter le log CAS */
40	TABLE.fetch /
41	TABLE={name="mydata_error_check"}
42	;
43
44	/* Nettoyage: Suppression de la table chargée */
45	TABLE.dropTable /
46	caslib="mycaslib",
47	name="mydata_error_check";
48
49	/* Nettoyage: Suppression de la caslib temporaire */
50	TABLE.dropCaslib /
51	caslib="mycaslib";
52	RUN;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste

Expert Advice

Michael

Responsable de l'infrastructure Viya.

« While Gzip is excellent for individual files, it is a non-splittable format. This means the decompression must happen on a single node (the CAS Controller) before the data is distributed to workers. If you have massive datasets (hundreds of GBs), consider using a splittable format like Parquet or ORC for even higher parallel loading performance. »