Skip to end of metadata
Go to start of metadata

Objective

Configure a connection between Apache Drill and your file system, Hive, or HBase data source and query data.

Glossary

The following table provides some Drill terms and their definitions:

Term

Definition

storage plugin

An interface that Drill uses to interact with a data source to get the following information:

  • Metadata available in the data source
  • Interfaces that Drill can use to read from and write to data sources
  • Location of data and a set of optimization rules to help with efficient and faster execution of Drill queries

storage plugin type

The type of interface that Drill uses to interact with data sources. Drill provides interfaces for the following types of data sources:

  • hive
  • file
  • hbase

storage plugin instance

A data source that you have registered with Drill using a distinct name. This configuration enables Drill to interact with the data source. You can register more than one storage plugin instance for a storage plugin type. For example, you can register multiple Hive metastores to one Hive storage plugin type.

schema

Schema is a database. A database is a container for tables. Drill databases have the following structure:

<storage plugin instance name>.<storage plugin specific database structure>.<table>.<column>

metadata

The characteristics that describe data. Drill metadata comes from storage plugin instances and corresponds directly to those instances. Drill has a decentralized metadata architecture.

Overview

Apache Drill serves as a query layer that connects to data sources through storage plugins. Drill uses the storage plugins to interact with data sources. You can think of a storage plugin as a connection between Drill and a data source.

The following image represents the storage plugin layer between Drill and a data source: 

Storage plugins provide the following information to Drill:

  • Metadata available in the underlying data source
  • Location of data
  • Interfaces that Drill can use to read from and write to data sources
  • A set of storage plugin optimization rules that assist with efficient and faster execution of Drill queries, such as pushdowns, statistics, and partition awareness

Storage plugins perform scanner and writer functions, and inform the metadata repository of any known metadata, such as:

  • Schema
  • File size
  • Data ordering
  • Secondary indices
  • Number of blocks

Storage plugins inform the execution engine of any native capabilities, such as predicate pushdown, joins, and SQL.

Drill provides storage plugins for files and HBase/M7.  Drill also integrates with Hive through a storage plugin. Hive provides a metadata abstraction layer on top of files and HBase/M7.

When you run Drill to query files in HBase/M7, Drill can perform direct queries on the data or go through Hive, if you have metadata defined there. Drill integrates with the Hive metastore for metadata and also uses a Hive SerDe for the deserialization of records. Drill does not invoke the Hive execution engine for any requests.

Storage Plugin Registration

You can connect Drill to a file system, Hive, or HBase data source. To connect Drill to a data source, you must register the data source as a storage plugin instance in the Drill Web UI. You register an instance of a data source as a filehive, or hbase storage plugin type. You can register multiple storage plugin instances for each storage plugin type. 

Each node with a Drillbit installed has a running browser that provides access to the Drill Web UI at http://localhost:8047. The default Drillbit Web UI port is 8047. The Drill Web UI includes cpdfshive, and hbase storage plugin instances by default, but the hive and hbase instances are disabled. You can update the hive and hbase instances with configuration information to enable them. The cp instance points to Drill’s classpath. Drill’s classpath is the location where various JAR files (Drill archive files) are stored. The cp instance points to a JAR file with sample data that you can query. By default, the dfs instance points to the local file system on your machine, but you can configure this instance to point to any distributed file system, such as a Hadoop or S3 file system.

When you add or update storage plugin instances on one Drill node, Drill broadcasts the information to all of the other Drill nodes so they all have identical storage plugin configurations. You do not need to restart any of the Drillbits when you add or update a storage plugin instance.

Each storage plugin instance that you register with Drill must have a distinct name. For example, if you register two storage plugin instances for a Hadoop file system, you might name one storage plugin instance hdfstest and the other instance hdfsproduction.

The following example shows an HDFS data source registered in the Drill Web UI as a storage plugin instance of plugin type "file":

Workspaces

When you register an instance of a file system data source, you can configure one or more workspaces for the instance. A workspace is a directory within the file system that you define. Drill searches the workspace to locate data when you run a query. Each workspace that you register defines a schema that you can connect to and query. Configuring workspaces is useful when you want to run multiple queries on files or tables in a specific directory. You cannot create workspaces for hive and hbase instances, though Hive databases show up as workspaces in Drill. 

The following example shows an instance of a file type storage plugin with a workspace named json configured to point Drill to the /users/max/drill/json/ directory in the local file system (dfs):

Note: The connection parameter in the configuration above is "file:///", connecting Drill to the local file system (dfs). To connect to a Hadoop or MapR file system the connection parameter would be "hdfs:///" or "maprfs:///", respectively. 

To query a file in the example json workspace, you can issue the USE command to tell Drill to use the json workspace configured in the dfs instance for each query that you issue:

Example

If the json workspace did not exist, the query would have to include the full path to the donuts.json file: 

Using a workspace alleviates the need to repeatedly enter the directory path in subsequent queries on the directory.

Default Workspaces

By default, each file and hive instance includes a workspace named default. The default workspace points to the file system or to the Hive metastore. When you query a table in the hive default workspace, you can omit the workspace name from the query.

For example, you can issue a query on a Hive table in the default workspace using either of the following formats and get the the same results:

Example

Note: Default is a reserved word. You must enclose reserved words in back ticks.

Because HBase instances do not have workspaces, you can use the following format to query a table in HBase:

Example

After you register a data source as a storage plugin instance with Drill, and optionally configure workspaces, you can query the data source.  

Registering a File System

You can register a storage plugin instance that connects Drill to a local file system or a distributed file system registered in core-site.xml, such as S3 or HDFS. When you register a storage plugin instance for a file system, provide a unique name for the instance, and identify the type as “file”. By default, Drill includes an instance named dfs that points to the local file system on your machine. You can update this configuration to point to a distributed file system or you can create a new instance to point to a distributed file system.  

To register a local or a distributed file system with Apache Drill, complete the following steps:

  1. Navigate to http://localhost:8047, and select the Storage tab.
  2. In the New Storage Plugin window, enter a unique name and then click Create
  3. In the Configuration window, provide the following configuration information for the type of file system that you are configuring as a data source.
    1. Local file system example: 

    2. Distributed file system example:

  4. Click Enable.

    Once you have configured a storage plugin instance for the file system, you can issue Drill queries against it. For information about querying a file system, refer to Querying a File System

Registering Hive

You can register a storage plugin instance that connects Drill to a Hive data source that has a remote or embedded metastore service. When you register a storage plugin instance for a Hive data source, provide a unique name for the instance, and identify the type as “hive”. You must also provide the metastore connection information.

Hive Remote Metastore

In this configuration, the Hive metastore runs as a separate service outside of Hive. Drill communicates with the Hive metastore through Thrift.  The metastore service communicates with the Hive database over JDBC. Point Drill to the Hive metastore service address, and provide the connection parameters in the Drill Web UI to configure a connection to Drill.

Note: Verify that the Hive metastore service is running before you register the Hive metastore.

To register a remote Hive metastore with Drill, complete the following steps:

  1. Issue the following command to start the Hive metastore service on the system specified in the hive.metastore.uris:
    hive --service metastore
  2. Navigate to http://localhost:8047, and select the Storage tab.
  3. In the disabled storage plugins section, click Update next to the hive instance.

  4. In the configuration window, add the Thrift URI and port to hive.metastore.uris.

    Example
       
  5. Click Enable.

  6. Verify that HADOOP_CLASSPATH is set in drill-env.sh. If you need to set the classpath, add the following line to drill-env.sh.

    Once you have configured a storage plugin instance for a Hive data source, you can issue Drill queries against it. For information about querying a Hive data source, refer to Querying Hive Tables

Hive Embedded Metastore

In this configuration, the Hive metastore is embedded within the Drill process. Provide the metastore database configuration settings in the Drill Web UI. Before you register Hive, verify that the driver you use to connect to the Hive metastore is in the Drill classpath located in /<drill installation dirctory>/lib/. If the driver is not there, copy the driver to /<drill installation directory>/lib on the Drill node. For more information about storage types and configurations, refer to AdminManual MetastoreAdmin.

To register an embedded Hive metastore with Drill, complete the following steps:

  1. Navigate to http://localhost:8047and select the Storage tab

  2. In the disabled storage plugins section, click Update next to hive instance.

  3. In the configuration window, add the database configuration settings.     

    Example
  4. Click Enable.

  5. Verify that HADOOP_CLASSPATH is set in drill-env.sh. If you need to set the classpath, add the following line to drill-env.sh.

    Once you have configured a storage plugin instance for the Hive, you can issue Drill queries against it. For information about querying a Hive data source, refer to Querying Hive Tables

     

Registering HBase

Register a storage plugin instance and specify a zookeeper quorum to connect Drill to an HBase data source. When you register a storage plugin instance for an HBase data source, provide a unique name for the instance, and identify the type as “hbase” in the Drill Web UI. 

Icon

Currently, Drill only works with HBase version 0.94.

To register HBase with Drill, complete the following steps:

  1. Navigate to http://localhost:8047, and select the Storage tab
  2. In the disabled storage plugins section, click Update next to the hbase instance.
  3. In the Configuration window, specify the Zookeeper quorum and port.  

    Example
  4. Click Enable.

Once you have configured a storage plugin instance for the HBase, you can issue Drill queries against it. For information about querying an HBase data source, refer to Querying HBase Tables

Connect to Drill

You can start Drill after you register your data sources. You start a Drillbit and invoke SQLLine to start Drill. SQLLine is used as the Drill shell. It connects to relational databases and executes SQL commands. You invoke SQLLine for Drill in embedded or distributed mode, depending on your Drill installation and Hive metastore, if you registered Hive as a data source. If you want to connect directly to a particular schema, you can indicate the schema name when you invoke SQLLine.

Starting a Drillbit

If you installed Drill in embedded mode, you do not need to start the Drillbit. The Drillbit starts automatically when you invoke SQLLine.

To start a Drillbit, cd to /opt/drill, and issue the following command:

bin/drillbit.sh restart

Invoking SQLLine/Connecting to Schema

To start SQLLine, issue the appropriate command for your Drill installation type:

Drill Install Type

Example

Command

Embedded

Drill installed locally (embedded mode);

Hive with embedded metastore

To connect without specifying schema, issue the following command:

$ bin/sqlline -u jdbc:drill:zk=local -n admin -p admin  

Once you are in the prompt, you can issue USE <schema> or you can use absolute notation, like schema.table.column.

To connect to schema directly, issue the command with the schema name:

$ bin/sqlline -u jdbc:drill:schema=<database>;zk=local -n admin -p admin

Distributed

Drill installed in distributed mode;

Hive with remote metastore;

HBase

To connect without specifying schema, issue the following command:

$ bin/sqlline -u jdbc:drill: zk=<zk1host>:<port>,<zk2host>:<port>,<zk3host>:<port> -n admin -p admin

Once you are in the prompt, you can issue USE <schema> or you can use absolute notation, like schema.table.column.

To connect to schema directly, issue the command with the schema name:

$ bin/sqlline -u jdbc:drill:schema=<database>;zk=<zk1host>:<port>,<zk2host>:<port>,<zk3host>:<port> -n admin -p admin

When SQLLine starts, the system displays the following prompt:

0: jdbc:drill: schema=<database>;zk=<zkhost>:<port>>

At this point, you can use Drill to query your data source or you can discover metadata.

To exit SQLLine, issue the following command:

!quit

Query Data Sources

You can query local and distributed file systems, Hive, and HBase data sources registered with Drill. If you connected directly to a particular schema when you invoked SQLLine, you can issue SQL queries against that schema. If you did not indicate a schema when you invoked SQLLine, you can issue the USE <schema> statement to run your queries against a particular schema. After you issue the USE statement, you can use as absolute notation, such as schema.table.column.

The following section explains how to find out which schemas, tables, and files are available in your environment:

Querying a File System 

Files and directories are like standard SQL tables to Drill. You can specify a file system "database" as a prefix in queries when you refer to objects across databases. In Drill, a file system database consists of a storage plugin name followed by an optional workspace name, for example <storage plugin>.<workspace> or hdfs.logs. 
The following example shows a query on a file system database in a Hadoop distributed file system:
SELECT * FROM hdfs.logs.`AppServerLogs/20104/Jan/01/part0001.txt`;

The default dfs storage plugin instance registered with Drill has a default workspace. If you query data in the default workspace, you do not need to include the workspace in the query. Refer to Workspaces for more information.

Drill supports the following file types:

  • Plain text files, including:
    • Comma-separated values (CSV, type: text)
    • Tab-separated values (TSV, type: text)
    • Pipe-separated values (PSV, type: text)
  • Structured data files:
    • JSON (type: json)
    • Parquet (type: parquet)

The extensions for these file types must match the configuration settings for your registered storage plugins. For example, PSV files may be defined with a .tbl extension, while CSV files are defined with a .csv extension.

Instructions for querying different types of files are located in the following sections:

The Drill installation includes some sample JSON and Parquet files that you can query.

Querying Hive Tables

This is a simple exercise that provides steps for creating a Hive table and inserting data that you can query using Drill. Before you perform the steps, download the customers.csv file.

To create a Hive table and query it with Drill, complete the following steps:

  1. Issue the following command to start the Hive shell:

  2. Issue the following command from the Hive shell to import the CSV file and create a table:

  3. Issue the following command to load the customer data into the customers table:
    Hive> load data local inpath '/<directory path>/customers.csv' overwrite into table customers;
  4. Quit the Hive shell.
  5. Start Drill. Refer to Invoking SQLLine/Connecting To Schema for instructions.
  6. Issue the following query to Drill to get the first and last names of the first ten customers in in the Hive table:
    0: jdbc:drill:schema=hiveremote> SELECT firstname,lastname FROM hiveremote.`customers` limit 10;

    The query returns the following results:

    +------------+------------+
    | firstname  |  lastname  |
    +------------+------------+
    | Essie      | Vaill      |
    | Cruz       | Roudabush  |
    | Billie     | Tinnes     |
    | Zackary    | Mockus     |
    | Rosemarie  | Fifield    |
    | Bernard    | Laboy      |
    | Sue        | Haakinson  |
    | Valerie    | Pou        |
    | Lashawn    | Hasty      |
    | Marianne   | Earman     |
    +------------+------------+
    10 rows selected (1.5 seconds)
    0: jdbc:drill:schema=hiveremote>

Querying HBase Tables

This is a simple exercise that provides steps for creating a “students” table and a “clicks” table in HBase that you can query using Drill.

To create the HBase tables and query them with Drill, complete the following steps:

  1. Issue the following commands to create a ‘students’ table and a ‘clicks’ table with column families in HBase:
    echo "create 'students','account','address'" | hbase shell
    echo "create 'clicks','clickinfo','iteminfo'" | hbase shell
  2. Issue the following command with the provided data to create a testdata.txt file:
    cat > testdata.txt
    Sample Data
  3. Issue the following command to verify that the data is in the testdata.txt file:
    cat testdata.txt | hbase shell
  4. Start Drill. Refer to Invoking SQLLine/Connecting To Schema for instructions.
  5. Use Drill to issue the following SQL queries on the “students” and “clicks” tables:
    1. Issue the following query to see the data in the “students” table:
      SELECT * FROM hbase.`students`;

      The query returns binary results:

      Query finished, fetching results ...
      +----------+----------+----------+-----------+----------+----------+----------+-----------+
      |id    | name        | state       | street      | zipcode |
      +----------+----------+----------+-----------+----------+-----------+----------+-----------
      | [B@1ee37126 | [B@661985a1 | [B@15944165 | [B@385158f4 | [B@3e08d131 |
      | [B@64a7180e | [B@161c72c2 | [B@25b229e5 | [B@53dc8cb8 |[B@1d11c878 |
      | [B@349aaf0b | [B@175a1628 | [B@1b64a812 | [B@6d5643ca |[B@147db06f |
      | [B@3a7cbada | [B@52cf5c35 | [B@2baec60c | [B@5f4c543b |[B@2ec515d6 |
      Icon

      Since Drill does not require metadata, you must use the SQL CAST function in some queries to get readable query results. 

    2. Issue the following query, that includes the CAST function, to see the data in the “students” table:

      SELECT CAST(students.clickinfo.studentid as VarChar(20)), CAST(students.account.name as VarChar(20)), CAST (students.address.state as VarChar(20)), CAST (students.address.street as VarChar(20)), CAST (students.address.zipcode as VarChar(20)), FROM hbase.students;

      Note: Use the following format when you query a column in an HBase table:
      tablename.columnfamilyname.columnname
      For more information about column families, refer to 5.6. Column Family.

      The query returns the data:

      Query finished, fetching results ...
      +----------+-------+-------+------------------+---------+
      | studentid | name  | state | street           | zipcode |
      +----------+-------+-------+------------------+---------+
      | student1 | Alice | CA    | 123 Ballmer Av   | 12345   |
      | student2 | Bob   | CA    | 1 Infinite Loop  | 12345   |
      | student3 | Frank | CA    | 435 Walker Ct    | 12345   |
      | student4 | Mary  | CA    | 56 Southern Pkwy | 12345   |
      +----------+-------+-------+------------------+---------+


    3. Issue the following query on the “clicks” table to find out which students clicked on google.com:
      SELECT CAST(clicks.clickinfo.studentid as VarChar(200)), CAST(clicks.clickinfo.url as VarChar(200)) FROM hbase.`clicks` WHERE URL LIKE '%google%';

      The query returns the data:

      Query finished, fetching results ...
      +---------+-----------+-------------------------------+-----------------------+----------+----------+
      | clickid | studentid | time                          | url                   | itemtype | quantity |
      +---------+-----------+-------------------------------+-----------------------+----------+----------+
      | click1  | student1  | 2014-01-01 12:01:01.000100000 | http://www.google.com | image    | 1        |
      | click3  | student2  | 2014-01-01 01:02:01.000100000 | http://www.google.com | text     | 2        |
      | click6  | student3  | 2013-02-01 12:01:01.000100000 | http://www.google.com | image    | 1        |
      +---------+-----------+-------------------------------+-----------------------+----------+----------+

Cross Data Source Query

Because Drill has a de-centralized metadata, you can query across files, HBase, and Hive in the same query. Drill does not require a central metadata definition anywhere.

Labels
  • No labels