This document outlines the high-level ideas for the next major version of Sqoop.
Status: Proposal under discussion
JIRA issue: SQOOP-365
Sqoop is a data transfer tool for efficiently transferring bulk data between Hadoop and external structured datastores such as relational databases and enterprise data warehouses. Its implementation so far is based on the idea of using a map-only job to partition the overall dataset into slices and delegating the transfer of each slice from individual map tasks. Underneath the covers Sqoop uses connectors that affect the job configuration prior to it being submitted for execution and thus specifying different Input/Output formats, mappers etc. This ability to override job configuration and selective extend built in functionality by the connectors allows various connectors to be built and used for communicating with new systems. This design is the basis of Sqoop 1.x system.
With the healthy adoption of Sqoop in the enterprise, it has become clear that there are areas of improvement that must be addressed in order to better address data integration use-cases. Among these use-cases are those that deal with general usage of the system and those relating to the development of new connectors. This proposal summarizes design changes that will help Sqoop better cater to the data integration use-cases and become easier to manage and operate.
Ease of use
The current version of Sqoop provides a large set of command line options which are not intuitive or easy to understand. Some options are overloaded in different contexts while others do not apply in certain contexts. Understanding of these options is contingent upon the user’s understanding of different systems that Sqoop works with. This is not only confusing for the user but also misleading as there is no clear way to ensure that the outcome of user request is aligned with the user's intentions. A good example of such confusion is when a user is trying to use a specific connector but cannot easily tell if that specific connector was used or not.
Ease of Extension
Sqoop connectors are the main extension points for Sqoop. The implementation of individual connector impacts the overall functionality of Sqoop in the context of the connector. For example, some connectors may support a certain data format while others may not. This is primarily due to the tight coupling between data transfer and serialization format within Sqoop framework. A connector that selectively overrides one part of this functionality is responsible for ensuring that the overall functionality remains intact. This puts a lot of burden on the connector developer who may or may not understand the various niche features that Sqoop supports.
Ideally, a connector should only focus on connectivity and all other aspects of the system such as serialization, format conversion, integrating with Hive/HBase etc., should be uniformly available via the Sqoop framework. Thus the connector developer can focus on building connectors that do not require the understanding of the larger set of features and functionality offered by Sqoop.
Security and Separation of Concerns
The operation of Sqoop in its current form is command line based where the user is expected to provide complete details relating to the job including connection parameters that are often sensitive and require controlled access. There is also no provision for managing the overall number of connections Sqoop makes since many different users could be invoking it targeting the same external system.
To help alleviate these concerns, Sqoop should provide a secure mechanism by which administrators can create connection instances and specify necessary resource limits on it such as maximal number of connections. Operators can then use these predefined connection objects without requiring access to sensitive connection information. All operations performed by Sqoop should be logged to form an audit trail that tracks the various operations performed by the system.
High Level Architecture
Following are the highlights proposed for the architecture of next major revision of Sqoop.
Introducing a Reduce Phase
As stated above, the 1.x version of Sqoop uses map-only jobs where the mappers are responsible for both transporting data and transforming its format. This is limiting as not every connector is equally capable. This can be mitigated by introducing a reduce phase in Sqoop jobs which is solely responsible for data transformation as per user specification. This has many advantages:
- The connectors are only responsible for transporting data to and from a standard format. This makes the connector implementation simpler.
- Any Sqoop functionality such as Hive/HBase integration, data formats, etc., can be handled in the reduce phase, thereby making it available to all connectors uniformly. This will also provide a better user experience.
Adding an Interactive Web-based User Interface
Sqoop today provides cryptic and contextual command line arguments that are not easy to understand or utilize. Different connectors interpret these options differently. Moreover, some options are not understood for the same operation by different connectors, while some connectors have custom options that do not apply to others. This is very confusing for users and detrimental to effective use and adoption. We can remedy this by introducing an interactive user interface over the web and command line. Here is how it will work:
- Sqoop becomes a web-based application that exposes a simple user interface.
- Using this interface, a customer can walk-through an import/export setup via UI cues and making choices to eliminate redundant options.
- Various connectors are installed in the application in one central place and the user is not tasked with installing/configuring connectors on their own sandbox. These connectors exposes their necessary options to Sqoop framework which then translates it to the interactive user interface.
- The user interface is built on the underlying REST API that in itself can be used by a command line client that exposes similar functionality.
Introduce Connections as First Class Objects
When using Sqoop today, the user is expected to provide all the connection details, often times including the username/password for connecting to the database. Many users have cited security concerns with this approach and have asked for tighter implementation that can check unauthorized access to the enterprise reosurces. To address this, we will introduce Connections as First-Class objects. This will benefit the user in the following ways:
- Connections could be created once and used many times for various import/export Jobs.
- Connections will be defined based on the underlying connector, which insulates the end user from having to worry about how to select the right connector.
Introduce Administrator and Operator Roles
With the Connections as First Class objects, the introduction of Admin and Operator roles would help restrict create/edit access for Connections to Admins only where as Operators could use (execute) these objects. This model will allow integration with platform security and thus provide a simple intuitive implementation for the end users.
Rational for Architecture
- By creating an interactive user interface, the user will be required to only provide the information relevant to their use-case. This will greatly facilitate the user experience and allow users to use Sqoop effectively with minimal understanding of its details.
- Having a web-application run Sqoop allows Sqoop to be installed once and used from anywhere. Installation of connectors will no longer be done on the client machine. Users will be able to operate Sqoop from a remote host using a web browser or the Sqoop command line client.
- Having a REST API for operation and management will help Sqoop integrate better with external systems such as Apache Oozie.
- Introducing a reduce phase allows connectors to be focused only on connectivity and ensures that Sqoop functionality is uniformly available for all connectors. This facilitates ease of development of connectors.