Date: Tue, 19 Mar 2024 05:12:06 +0000 (UTC) Message-ID: <20984562.54351.1710825126268@cwiki-he-fi.apache.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_54350_2057111087.1710825126267" ------=_Part_54350_2057111087.1710825126267 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
JdbcStorageHandler supports rea= ding from jdbc data source in Hive. Currently writing to a jdbc data source= is not supported. To use JdbcStorageHandler, you need to create an externa= l table using JdbcStorageHandler. Here is a simple example:= p>
CREATE EXTERNA= L TABLE student_jdbc ( name string, age int, gpa double ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" =3D "MYSQL", "hive.sql.jdbc.driver" =3D "com.mysql.jdbc.Driver", "hive.sql.jdbc.url" =3D "jdbc:mysql://localhost/sample", "hive.sql.dbcp.username" =3D "hive", "hive.sql.dbcp.password" =3D "hive", "hive.sql.table" =3D "STUDENT", "hive.sql.dbcp.maxActive" =3D "1" );
You can also alter table properties o= f the jdbc external table using alter table statement, just like other non-= native Hive table:
ALTER TABLE st= udent_jdbc SET TBLPROPERTIES ("hive.sql.dbcp.password" =3D "passwd");
In the create table statement, you ar= e required to specify the following table properties:
hive.sql.database.type= : MYSQL, POSTGRES, ORACLE, DERBY, DB2
hive.sql.jdbc.url: jdbc connection string
hive.sql.jdbc.driver: jdbc driver class
hive.sql.dbcp.username= : jdbc user name
hive.sql.dbcp.password= : jdbc password in clear text, this parameter is strongly discouraged= . The recommended way is to store it in a keystore. See the section =E2=80= =9Csecuring password=E2=80=9D for detail
hive.sql.table / hive.sql.que= ry: You will need to specify either =E2=80=9Chive.sql.table=E2= =80=9D or =E2=80=9Chive.sql.query=E2=80=9D to tell how to get data from jdb= c database. =E2=80=9Chive.sql.table=E2=80=9D denotes a single table, and = =E2=80=9Chive.sql.query=E2=80=9D denotes an arbitrary sql query.
=Besides the above required properties= , you can also specify optional parameters to tune the connection details a= nd performance:
hive.sql.catalog: jdbc catalo= g name (only valid if =E2=80=9Chive.sql.table=E2=80=9C is specified)= span>
hive.sql.schema: jdbc schema = name (only valid if =E2=80=9Chive.sql.table=E2=80=9C is spec= ified)
hive.sql.jdbc.fetch.size: number of rows to fetch in a batch
hive.sql.dbcp.xxx: all dbcp parameters will pass to commons-dbcp= . See https://commons.apach= e.org/proper/commons-dbcp/configuration.html for definition of the parameters. For= example, if you specify hive.sql.dbcp.maxActive=3D1 in table property, Hiv= e will pass maxActive=3D1 to commons-dbcp
The column data type for a Hive JdbcS= torageHandler table can be:
Numeric data type: byte, shor= t, int, long, float, double
Decimal with scale and precis= ion
String date type: string, cha= r, varchar
Date
Timestamp
Note complex data type: struct, map, = array are not supported
hive.sql.table / hive.sql.query= defines a tabular data with a schema. The schema definition has to be the = same as the table schema definition. For example, the following create tabl= e statement will fail:
CREATE EXTERNA= L TABLE student_jdbc ( name string, age int, gpa double ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( . . . . . . "hive.sql.query" =3D "SELECT name, age, gpa, gender FROM STUDENT", );
However, column name and column type = of hive.sql.table / hive.sql.query schema may be different than the table s= chema. In this case, database column maps to hive column by position. If da= ta type is different, Hive will try to convert it according to Hive table s= chema. For example:
CREATE EXTERNA= L TABLE student_jdbc ( sname string, age int, effective_gpa decimal(4,3) ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( . . . . . . "hive.sql.query" =3D "SELECT name, age, gpa FROM STUDENT", );
Hive will try to convert the double = =E2=80=9Cgpa=E2=80=9D of underlining table STUDENT to decimal(4,3) as the e= ffective_gpa field of the student_jdbc table. In case the conversion is not= possible, Hive will produce null for the field.
JdbcStorageHandler will ship re= quired jars to MR/Tez/LLAP backend automatically if JdbcStorageHandler is u= sed in the query. User don=E2=80=99t need to add jar manually. JdbcStorageH= andler will also ship required jdbc driver jar to the backend if it detects= any jdbc driver jar in classpath (include mysql, postgres, oracle and mssq= l). However, user are still required to copy jdbc driver jar to hive classp= ath (usually, lib directory in hive).
In most cases, we don=E2=80=99t= want to store jdbc password in clear text in table property "hive.sql.dbcp= .password". Instead, user can store password in a Java keystore file on HDF= S using the following command:
hadoop creden= tial create host1.password -provider jceks://hdfs/user/foo/test.jceks -v pa= sswd1 hadoop credential create host2.password -provider jceks://hdfs/user/foo/tes= t.jceks -v passwd2
This will create a keystore file located on hdfs://user/foo/test.jceks which contains = two keys: host1.password and host2.password. When creating table in Hive, y= ou will need to specify =E2=80=9Chive.sql.dbcp.password.keystore=E2=80=9D a= nd =E2=80=9Chive.sql.dbcp.password.key=E2=80=9D instead of =E2=80=9Chive.sq= l.dbcp.password=E2=80=9D in create table statement:
CREATE EXTERNA= L TABLE student_jdbc ( name string, age int, gpa double ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( . . . . . . "hive.sql.dbcp.password.keystore" =3D "jceks://hdfs/user/foo/test.jceks= ", "hive.sql.dbcp.password.key" =3D "host1.password", . . . . . . );
You will need to protect the ke= ystore file by only authorize targeted user to read this file using authori= zer (such as ranger). Hive will check the permission of the keystore file t= o make sure user has read permission of it when creating/altering table.
Hive is able to split the jdbc data s= ource and process each split in parallel. User can use the following table = property to decide whether or not to split and how many splits to split int= o:
hive.sql.numPartitions= : how many split to generate for the data source, 1 if no split
hive.sql.partitionColumn: which column to split on. If this is specified, Hive will split t= he column into hive.sql.numPartitions equal intervals from hive.sql.lowerBo= und to hive.sql.upperBound. If partitionColumn is not defined but numPartit= ions > 1, Hive will split the data source using offset. However, offset = is not always reliable for some databases. It is highly recommended to defi= ne a partitionColumn if you want to split the data source. The partitionCol= umn must exist in the schema =E2=80=9Chive.sql.table=E2=80=9D/=E2=80=9Dhive= .sql.query=E2=80=9D produces.
hive.sql.lowerBound / hive.sq= l.upperBound: lower/upper bound of the partitionColumn used to= calculate the intervals. Both properties are optional. If undefined, Hive = will do a MIN/MAX query against the data source to get the lower/upper boun= d. Note both hive.sql.lowerBound and hive.sql.upperBound cannot be null. Th= e first and last split are open ended. And all null value for the column wi= ll go to the first split.
For example:
TBLPROPERTIES = ( . . . . . . "hive.sql.table" =3D "DEMO", "hive.sql.partitionColumn" =3D "num", "hive.sql.numPartitions" =3D "3", "hive.sql.lowerBound" =3D "1", "hive.sql.upperBound" =3D "10", . . . . . . );
This table will create 3 splits= : num<4 or num is null, 4<=3Dnum<7, num>=3D7
TBLPROPERTIES = ( . . . . . . "hive.sql.query" =3D "SELECT name, age, gpa/5.0*100 AS percentage FROM = STUDENT", "hive.sql.partitionColumn" =3D "percentage", "hive.sql.numPartitions" =3D "4", . . . . . . );
Hive will do a jdbc query to get the = MIN/MAX of the percentage column of the query, which is 60, 100. Then table= will create 4 splits: (,70),[70,80),[80,90),[90,). The first split also in= clude null value.
To see the splits generated by JdbcSt= orageHandler, looking for the following messages in hiveserver2 log or Tez = AM log:
jdbc.JdbcInpu= tFormat: Num input splits created 4 jdbc.JdbcInputFormat: split:interval:ikey[,70) jdbc.JdbcInputFormat: split:interval:ikey[70,80) jdbc.JdbcInputFormat: split:interval:ikey[80,90) jdbc.JdbcInputFormat: split:interval:ikey[90,)
Hive will pushdown computation to jdb= c table aggressively, so we can make best usage of the native capacity of j= dbc data source.
For example, if we have another table= voter_jdbc:
EATE EXTERNAL = TABLE voter_jdbc ( name string, age int, registration string, contribution decimal(10,2) ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" =3D "MYSQL", "hive.sql.jdbc.driver" =3D "com.mysql.jdbc.Driver", "hive.sql.jdbc.url" =3D "jdbc:mysql://localhost/sample", "hive.sql.dbcp.username" =3D "hive", "hive.sql.dbcp.password" =3D "hive", "hive.sql.table" =3D "VOTER" );
Then the following join operation wil= l push down to mysql:
select * from = student_jdbc join voter_jdbc on student_jdbc.name=3Dvoter_jdbc.name;
This can be manifest by explain= :
explain select= * from student_jdbc join voter_jdbc on student_jdbc.name=3Dvoter_jdbc.name= ; . . . . . . TableScan alias: student_jdbc properties: hive.sql.query SELECT `t`.`name`, `t`.`age`, `t`.`gpa`, `t0`.`n= ame` AS `name0`, `t0`.`age` AS `age0`, `t0`.`registration`, `t0`.`contribut= ion` FROM (SELECT * FROM `STUDENT` WHERE `name` IS NOT NULL) AS `t` INNER JOIN (SELECT * FROM `VOTER` WHERE `name` IS NOT NULL) AS `t0` ON `t`.`name` =3D `t0`.`name` . . . . . .
Computation pushdown will only happen= when the jdbc table is defined by =E2=80=9Chive.sql.table=E2=80=9D. Hive w= ill rewrite the data source with a =E2=80=9Chive.sql.query=E2=80=9D propert= y with more computation on top of the table. In the above example, mysql wi= ll run the query and retrieve the join result, rather than fetch both table= s and do the join in Hive.
The operators can be pushed down incl= ude filter, transform, join, union, aggregation and sort.
The derived mysql query can be very c= omplex and in many cases we don=E2=80=99t want to split the data source thu= s run the complex query multiple times on each split. So if the computation= is more then just filter and transform, Hive will not split the query resu= lt even if =E2=80=9Chive.sql.numPartitions=E2=80=9D is more than 1.<= /p>
The notion of schema differs from DBM= S to DBMS, such as Oracle, MSSQL, MySQL, and PostgreSQL. Correct usage of t= he hive.sql.schema table property can prevent problems with client connecti= ons to external JDBC tables. For more information, see Hive-25591. To create external tables based on a user-defined sc= hema in a JDBC-compliant database, follow the examples below for respective= databases.
CREATE SCHEMA = bob; CREATE TABLE bob.country ( id int, name varchar(20) ); insert into bob.country values (1, 'India'); insert into bob.country values (2, 'Russia'); insert into bob.country values (3, 'USA'); CREATE SCHEMA alice; CREATE TABLE alice.country ( id int, name varchar(20) ); insert into alice.country values (4, 'Italy'); insert into alice.country values (5, 'Greece'); insert into alice.country values (6, 'China'); insert into alice.country values (7, 'Japan');
CREATE DATABAS= E world; USE world; CREATE SCHEMA bob; CREATE TABLE bob.country ( id int, name varchar(20) ); insert into bob.country values (1, 'India'); insert into bob.country values (2, 'Russia'); insert into bob.country values (3, 'USA'); CREATE SCHEMA alice; CREATE TABLE alice.country ( id int, name varchar(20) ); insert into alice.country values (4, 'Italy'); insert into alice.country values (5, 'Greece'); insert into alice.country values (6, 'China'); insert into alice.country values (7, 'Japan');
Create a user and associate them with a default schema. For example:
CREATE LOGIN g= reg WITH PASSWORD =3D 'GregPass123!$'; CREATE USER greg FOR LOGIN greg WITH DEFAULT_SCHEMA=3Dbob;
Allow the user to connect to the database and run queries. For example:<= /p>
GRANT CONNECT,= SELECT TO greg;
In Oracle, dividing the tables into different namespaces/schemas is achi= eved through different users. The CREATE SCHEMA statement exists in Oracle,= but has different semantics from those defined by SQL Standard and those a= dopted in other DBMS.
To create "local" users in Oracle you need to be connected to the Plugga= ble Database (PDB), not to the Container Database (CDB). The following exam= ple was tested in Oracle XE edition, using only PDB XEPDB1.
ALTER SESSION = SET CONTAINER =3D XEPDB1;
Create the bob schema/user and give appropriate connections to be able t= o connect to the database. For example:
CREATE USER bo= b IDENTIFIED BY bobpass; ALTER USER bob QUOTA UNLIMITED ON users; GRANT CREATE SESSION TO bob; CREATE TABLE bob.country ( id int, name varchar(20) ); insert into bob.country values (1, 'India'); insert into bob.country values (2, 'Russia'); insert into bob.country values (3, 'USA');
Create the alice schema/user and give appropriate connections to be able= to connect to the database. For example:
CREATE USER al= ice IDENTIFIED BY alicepass; ALTER USER alice QUOTA UNLIMITED ON users; GRANT CREATE SESSION TO alice; CREATE TABLE alice.country ( id int, name varchar(20) ); insert into alice.country values (4, 'Italy'); insert into alice.country values (5, 'Greece'); insert into alice.country values (6, 'China'); insert into alice.country values (7, 'Japan');
Without the SELECT ANY privilege, a user cannot see the tables/views of =
another user. When a user connects to the database using a specific user an=
d schema it is not possible to refer to tables in another user/schema
-- namespace. You need to grant the SELECT ANY privilege. For example:
GRANT SELECT A= NY TABLE TO bob; GRANT SELECT ANY TABLE TO alice;
Allow the users to perform inserts on any table/view in the database, no= t only those present on their own schema. For example:
GRANT INSERT A= NY TABLE TO bob; GRANT INSERT ANY TABLE TO alice;
CREATE SCHEMA = bob; CREATE TABLE bob.country ( id int, name varchar(20) ); insert into bob.country values (1, 'India'); insert into bob.country values (2, 'Russia'); insert into bob.country values (3, 'USA'); CREATE SCHEMA alice; CREATE TABLE alice.country ( id int, name varchar(20) ); insert into alice.country values (4, 'Italy'); insert into alice.country values (5, 'Greece'); insert into alice.country values (6, 'China'); insert into alice.country values (7, 'Japan');
Create a user and associate them with a default schema <=3D> searc= h_path. For example:
CREATE ROLE gr= eg WITH LOGIN PASSWORD 'GregPass123!$'; ALTER ROLE greg SET search_path TO bob;
Grant the necessary permissions to access the schema. For example:
GRANT USAGE ON= SCHEMA bob TO greg; GRANT SELECT ON ALL TABLES IN SCHEMA bob TO greg;