You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Current »

Note

This information is relevant only for Sqoop release before 1.99.4

 

 

This page describes current design of Sqoop2 metadata structures.  It is divided into several sections describing different layers to help understand the design.

Top level structures

There are currently four top level structures:

Connector

Connectors will drive entire data move in Sqoop2. There will be multiple connector available in the system as each specific data storage might have it's own specific connector (one specific to Oracle, another to MySQL). Some data storage might have multiple connectors for different purpose - for example there might be MySQL JDBC Connector utilizing Java's JDBC interface and MySQL Fastpath connector utilizing MySQL native utilities (mysqldump and mysqlimport).

Because different connectors might have different configuration needs rather that having all configuration directive embedded in Sqoop2 framework we've decided that each connector will supply metadata structures that it needs to fill in order to properly do it's job. Each connector will supply two sets of metadata structures - one for connection object and second for job object (explained below).

Framework

Similarly as connector structure contains metadata required to perform specific actions, also Sqoop 2 framework requires some extra configuration for each connection and job. Main difference between connector and framework structure is that there will be multiple connectors whereas there will be always one single framework structure. 

Connection

Connection object contains metadata needed to by both connector and framework to manage connection to remote data storage. This connection is not related in any way to java.sql.Connection object. Because each connector might have different needs, each connection directly depends on connector for which the connection object was created. Connection objects will be created by administrators and will be saved in sqoop 2 metastore. They will be later reused by operators and job objects (explained below).

Job

Job objects directly depends on connection and holds configuration to specific job both from connector and framework perspective - they for example contain information if we need to do import or export or where on HDFS do we need to store our data. They will be filled by operators. Job itself will be executable.

Corresponding classes

This description will cover all related classes in bottom up fashion to provide reader better understanding of the structures.

MInput

Correspond to one configuration entity that is requested by connector (for example "JDBC Url" or "Username").

MForm

Related MInputs are gather together to create set of connected options (for example MForm "Connection to database" would consists of MInputs "JDBC URL", "username" and "password").

MJobForms

Represents lists of forms that are required for job.

MConnectionForms

Represents list of forms that are required for connection.

MConnector

Top level structure that contain one instance of MJobForms and MConnectionForms specifying which metadata one particular connector needs. All MForms are blank and do not contain any configuration. They serve only as a template that connector is supplying to framework in order to get required configuration options.

MFramework

Top level structure that contain one instance of MJobForms and MConnectionForms specifying which metadata are required from framework perspective. There is one single instance of this class across entire sqoop 2.

MConnection

Top level structure that contains one instance of MConnectionForms for one corresponding connector. Forms will contained filled values.

MJob

Top level structure that contains one instance of MJobForms for one corresponding connector. Forms will contained filled values.

Class relationship

 

Example

Let's assume some JDBC based Connector.

MConnector
  • For connection: Contains list of one MForm called "Connection" containing three inputs "JDBC URL", "Username" and "Password".
  • For job: Contains list of one MForm called "Source" containing single input "Table".
MFramework
  • For connection: Contains empty list - e.g. no values are required
  • For job: Contains one MForm called "Target" containing single input "HDFS Directory"

Administrators might create two different connections based on this example connector:

Connection 1:
  • Connector part: Contain values for connector "JDBC URL" contains "jdbc:mysql://development/test", "Username" contains "letest" and "Password" contains "letest".
  • Framework part: Contain values for framework. As framework did not specified any MForms, this will be empty.
Connection 2:
  • Connector part: Contain values for connector "JDBC URL" contains "jdbc:mysql://production/test", "Username" contains "production-user" and "Password" contains "aosdf792r7asfhas8sd-9a7(&(&@#&$(Vosfs9fya9d7(&SD(F*&S(*F&SDF&VChsdfhsdf" (Damn good password).
  • Framework part: Contain values for framework. As framework did not specified any MForms, this will be empty.

Operation did not yet have time to utilize connection 2 as it's still playing with connection 1. He however already created couple of jobs based on this connection: 

Job 1:
  • Is based on Connection 1
  • Connector part: Contain values for connector, "Table" contains "traffic_details".
  • Framework part: Contain values for framework, "HDFS Directory" contains "/storage/traffic_details"
Job 2:
  • Is based on Connection 1
  • Connector part: Contain values for connector, "Table" contains "log".
  • Framework part: Contain values for framework, "HDFS Directory" contains "/storage/log"
  • No labels