Hadoop Data Connector

The Hadoop data connector facilitates data transfer between Hive data warehouses and SAS^© Cloud Analytic Services (CAS) on the SAS^© Viya^™ platform. It manages connections to Hive tables, enabling data loading into CAS. Key features include connection with or without dynamic service discovery (via ZooKeeper), Kerberos authentication, and the ability to load data in parallel. Parallel loading requires the installation of SAS^© In-Database Technologies and the SAS^© Embedded process. The connector supports Hive data types by converting them to appropriate CAS data types, with specific handling for VARCHAR data and integer types in terms of length and precision. Since version 2023.10, SAS^© uses the Simba Apache Hive JDBC connector for connectivity.

Data Analysis

Type : INTERNAL_CREATION

Examples use generated data (datalines) or placeholders for connections to external sources (Hive).

1 Code Block

CASLIB Data

Explanation :
This example shows the creation of a caslib to connect to a Hive server without using dynamic service discovery (ZooKeeper) or HDFS access. The `bulkload=false` option is specified because HDFS access is not configured. Replace `your_hive_server`, `your_username`, `your_password`, and `your_schema` with your own connection information.

Copied!

1	/* Connexion basique à Hive sans accès HDFS direct */
2	/* Les valeurs entre crochets angulaires (<>) doivent être remplacées par vos informations */
3
4	caslib hvcaslib_basic desc='Caslib Hadoop Basique'
5	dataSource=(srctype='hadoop',
6	server='your_hive_server',
7	username='your_username',
8	password='your_password',
9	schema='your_schema',
10	bulkload=false);
11
12	/* Charger et activer le caslib */
13	PROC CAS;
14	SESSION casauto;
15	caslib _all_ assign;
16	RUN;
17
18	/* Afficher les informations sur le caslib */
19	PROC CAS;
20	SESSION casauto;
21	caslib hvcaslib_basic info;
22	RUN;
23	QUIT;

2 Code Block

CASLIB Data

Explanation :
This example configures a Hadoop caslib that uses ZooKeeper for dynamic service discovery. It is ideal for HiveServer2 environments in high availability mode. The options `zooKeeperQuorum`, `zooKeeperNamespace`, and `zooKeeperPort` are used to specify ZooKeeper connection information. Be sure to replace the placeholders with the actual values for your environment, including the paths to the Hadoop jar and configuration directories, as well as ZooKeeper and authentication information.

Copied!

1	/* Connexion à Hive avec découverte de service dynamique via ZooKeeper */
2	/* Les valeurs entre crochets angulaires (<>) doivent être remplacées par vos informations */
3
4	caslib zoocaslib desc='Caslib ZooKeeper Hadoop'
5	dataSource=(srctype='hadoop',
6	username='your_username',
7	password='your_password',
8	dataTransferMode='serial',
9	hadoopJarPath='<Hadoop_jar_path_directory>',
10	hadoopConfigDir='<Hadoop_configuration_directory>',
11	zooKeeperQuorum="<node1.company.com>,<node2.company.com>,<node3.company.com>",
12	zooKeeperNamespace='hiveserver2',
13	zooKeeperPort=2181,
14	schema='your_schema');
15
16	/* Charger et activer le caslib */
17	PROC CAS;
18	SESSION casauto;
19	caslib _all_ assign;
20	RUN;
21
22	/* Afficher les informations sur le caslib */
23	PROC CAS;
24	SESSION casauto;
25	caslib zoocaslib info;
26	RUN;
27	QUIT;

3 Code Block

CASLIB Data

Explanation :
This example illustrates connecting to a Hive server configured with Kerberos authentication. The options `krbAuthType`, `krbHostFQDN`, `krbRealm`, and `krbServiceName` are essential for passing Kerberos connection properties to the Simba driver. Replace placeholders such as `your_hive_server`, `your_schema`, `myhost.company.com`, and `example.com` with the appropriate values for your Kerberos environment.

Copied!

1	/* Connexion à Hive avec authentification Kerberos */
2	/* Les valeurs entre crochets angulaires (<>) doivent être remplacées par vos informations */
3
4	caslib hadoopkrb desc='Caslib Hadoop Kerberos'
5	dataSource=(srctype='hadoop',
6	server='your_hive_server',
7	schema='your_schema',
8	use_ssl=true,
9	krbAuthType=2,
10	krbHostFQDN='<myhost.company.com>',
11	krbRealm='<example.com>',
12	krbServiceName='hive',
13	bulkload=false);
14
15	/* Charger et activer le caslib */
16	PROC CAS;
17	SESSION casauto;
18	caslib _all_ assign;
19	RUN;
20
21	/* Afficher les informations sur le caslib */
22	PROC CAS;
23	SESSION casauto;
24	caslib hadoopkrb info;
25	RUN;
26	QUIT;

4 Code Block

PROC CASUTIL Data

Explanation :
This example illustrates configuring a caslib for parallel data loading from Hive to CAS, then uses `PROC CASUTIL` to perform this loading. To make this example self-contained, a CAS table named `myHDdata` is first created to simulate the data that would normally be read from Hive. The options `dataTransferMode="parallel"`, `jobManagementURL`, `hadoopConfigDir`, and `bulkLoad=false` are crucial for enabling parallel loading. Replace placeholders such as `your_hive_server`, `your_schema`, `your_username`, `your_password`, `Livy_URL`, and `path-to-EP-config-files` with the specific values for your Hadoop/Hive installation.

Copied!

1	/* Création d'une table CAS de démonstration pour simuler une source Hive */
2	DATA casuser.myHDdata;
3	INPUT id $ name $ value;
4	DATALINES;
5	1 Apple 10
6	2 Banana 20
7	3 Cherry 30
8	4 Date 40
9	;
10	RUN;
11
12	/* Connexion à Hive avec chargement parallèle activé */
13	/* Les valeurs entre crochets angulaires (<>) doivent être remplacées par vos informations */
14
15	caslib hvcaslib_parallel desc='Caslib Hadoop avec Chargement Parallèle'
16	dataSource=(srctype='hadoop',
17	server='your_hive_server',
18	schema='your_schema',
19	username='your_username',
20	password='your_password',
21	dataTransferMode="parallel",
22	jobManagementURL='<Livy_URL>',
23	hadoopConfigDir='<path-to-EP-config-files>',
24	bulkLoad=false);
25
26	/* Charger et activer le caslib */
27	PROC CAS;
28	SESSION casauto;
29	caslib _all_ assign;
30	RUN;
31
32	/* Charger la table 'myHDdata' de Hive (simulée ici par la table CAS) dans CAS */
33	/* En pratique, 'myHDdata' serait une table existante dans Hive */
34	PROC CASUTIL;
35	list files incaslib="hvcaslib_parallel";
36	load casdata="myHDdata" incaslib="hvcaslib_parallel" outcaslib="casuser"
37	casout="HDdata_from_hvcaslib_parallel";
38	list tables incaslib="casuser";
39	contents casdata="HDdata_from_hvcaslib_parallel" incaslib="casuser";
40	RUN;
41	QUIT;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Hadoop Data Connector

Data Analysis

1 Code Block

2 Code Block

3 Code Block

4 Code Block

Related Documentation : Data Access

Expert Advice