Published on :
Data Access INTERNAL_CREATION

Hadoop Data Connector

This code is also available in: Deutsch Español Français
Awaiting validation
The Hadoop data connector facilitates data transfer between Hive data warehouses and SAS© Cloud Analytic Services (CAS) on the SAS© Viya platform. It manages connections to Hive tables, enabling data loading into CAS. Key features include connection with or without dynamic service discovery (via ZooKeeper), Kerberos authentication, and the ability to load data in parallel. Parallel loading requires the installation of SAS© In-Database Technologies and the SAS© Embedded process. The connector supports Hive data types by converting them to appropriate CAS data types, with specific handling for VARCHAR data and integer types in terms of length and precision. Since version 2023.10, SAS© uses the Simba Apache Hive JDBC connector for connectivity.
Data Analysis

Type : INTERNAL_CREATION


Examples use generated data (datalines) or placeholders for connections to external sources (Hive).

1 Code Block
CASLIB Data
Explanation :
This example shows the creation of a caslib to connect to a Hive server without using dynamic service discovery (ZooKeeper) or HDFS access. The `bulkload=false` option is specified because HDFS access is not configured. Replace `your_hive_server`, `your_username`, `your_password`, and `your_schema` with your own connection information.
Copied!
1/* Connexion basique à Hive sans accès HDFS direct */
2/* Les valeurs entre crochets angulaires (<>) doivent être remplacées par vos informations */
3 
4caslib hvcaslib_basic desc='Caslib Hadoop Basique'
5 dataSource=(srctype='hadoop',
6 server='your_hive_server',
7 username='your_username',
8 password='your_password',
9 schema='your_schema',
10 bulkload=false);
11 
12/* Charger et activer le caslib */
13PROC CAS;
14 SESSION casauto;
15 caslib _all_ assign;
16RUN;
17 
18/* Afficher les informations sur le caslib */
19PROC CAS;
20 SESSION casauto;
21 caslib hvcaslib_basic info;
22RUN;
23QUIT;
2 Code Block
CASLIB Data
Explanation :
This example configures a Hadoop caslib that uses ZooKeeper for dynamic service discovery. It is ideal for HiveServer2 environments in high availability mode. The options `zooKeeperQuorum`, `zooKeeperNamespace`, and `zooKeeperPort` are used to specify ZooKeeper connection information. Be sure to replace the placeholders with the actual values for your environment, including the paths to the Hadoop jar and configuration directories, as well as ZooKeeper and authentication information.
Copied!
1/* Connexion à Hive avec découverte de service dynamique via ZooKeeper */
2/* Les valeurs entre crochets angulaires (<>) doivent être remplacées par vos informations */
3 
4caslib zoocaslib desc='Caslib ZooKeeper Hadoop'
5 dataSource=(srctype='hadoop',
6 username='your_username',
7 password='your_password',
8 dataTransferMode='serial',
9 hadoopJarPath='<Hadoop_jar_path_directory>',
10 hadoopConfigDir='<Hadoop_configuration_directory>',
11 zooKeeperQuorum="<node1.company.com>,<node2.company.com>,<node3.company.com>",
12 zooKeeperNamespace='hiveserver2',
13 zooKeeperPort=2181,
14 schema='your_schema');
15 
16/* Charger et activer le caslib */
17PROC CAS;
18 SESSION casauto;
19 caslib _all_ assign;
20RUN;
21 
22/* Afficher les informations sur le caslib */
23PROC CAS;
24 SESSION casauto;
25 caslib zoocaslib info;
26RUN;
27QUIT;
3 Code Block
CASLIB Data
Explanation :
This example illustrates connecting to a Hive server configured with Kerberos authentication. The options `krbAuthType`, `krbHostFQDN`, `krbRealm`, and `krbServiceName` are essential for passing Kerberos connection properties to the Simba driver. Replace placeholders such as `your_hive_server`, `your_schema`, `myhost.company.com`, and `example.com` with the appropriate values for your Kerberos environment.
Copied!
1/* Connexion à Hive avec authentification Kerberos */
2/* Les valeurs entre crochets angulaires (<>) doivent être remplacées par vos informations */
3 
4caslib hadoopkrb desc='Caslib Hadoop Kerberos'
5 dataSource=(srctype='hadoop',
6 server='your_hive_server',
7 schema='your_schema',
8 use_ssl=true,
9 krbAuthType=2,
10 krbHostFQDN='<myhost.company.com>',
11 krbRealm='<example.com>',
12 krbServiceName='hive',
13 bulkload=false);
14 
15/* Charger et activer le caslib */
16PROC CAS;
17 SESSION casauto;
18 caslib _all_ assign;
19RUN;
20 
21/* Afficher les informations sur le caslib */
22PROC CAS;
23 SESSION casauto;
24 caslib hadoopkrb info;
25RUN;
26QUIT;
4 Code Block
PROC CASUTIL Data
Explanation :
This example illustrates configuring a caslib for parallel data loading from Hive to CAS, then uses `PROC CASUTIL` to perform this loading. To make this example self-contained, a CAS table named `myHDdata` is first created to simulate the data that would normally be read from Hive. The options `dataTransferMode="parallel"`, `jobManagementURL`, `hadoopConfigDir`, and `bulkLoad=false` are crucial for enabling parallel loading. Replace placeholders such as `your_hive_server`, `your_schema`, `your_username`, `your_password`, `Livy_URL`, and `path-to-EP-config-files` with the specific values for your Hadoop/Hive installation.
Copied!
1/* Création d'une table CAS de démonstration pour simuler une source Hive */
2DATA casuser.myHDdata;
3 INPUT id $ name $ value;
4 DATALINES;
51 Apple 10
62 Banana 20
73 Cherry 30
84 Date 40
9;
10RUN;
11 
12/* Connexion à Hive avec chargement parallèle activé */
13/* Les valeurs entre crochets angulaires (<>) doivent être remplacées par vos informations */
14 
15caslib hvcaslib_parallel desc='Caslib Hadoop avec Chargement Parallèle'
16 dataSource=(srctype='hadoop',
17 server='your_hive_server',
18 schema='your_schema',
19 username='your_username',
20 password='your_password',
21 dataTransferMode="parallel",
22 jobManagementURL='<Livy_URL>',
23 hadoopConfigDir='<path-to-EP-config-files>',
24 bulkLoad=false);
25 
26/* Charger et activer le caslib */
27PROC CAS;
28 SESSION casauto;
29 caslib _all_ assign;
30RUN;
31 
32/* Charger la table 'myHDdata' de Hive (simulée ici par la table CAS) dans CAS */
33/* En pratique, 'myHDdata' serait une table existante dans Hive */
34PROC CASUTIL;
35 list files incaslib="hvcaslib_parallel";
36 load casdata="myHDdata" incaslib="hvcaslib_parallel" outcaslib="casuser"
37 casout="HDdata_from_hvcaslib_parallel";
38 list tables incaslib="casuser";
39 contents casdata="HDdata_from_hvcaslib_parallel" incaslib="casuser";
40RUN;
41QUIT;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Related Documentation : Data Access

Sujet / Mot-cléLien vers la ressource
DOC FedSQL en/sampleCode/FEDSQL9D66
Banner
Expert Advice
Expert
Michael
Responsable de l'infrastructure Viya.
« Before running a large parallel load, check your Livy URL and Hadoop configuration directories. A common cause of "Parallel Load Failure" is a mismatch between the Hadoop site-xml files on the CAS controller and those actually running on the Hadoop cluster nodes. »