JIRA : SQOOP-1938
This document provides details of how the Sqoop MR Execution Engine works, its major components and details about the internals of the the implementation
Submission Engine will use the concrete apis of either YARN/ Mesos/ Oozie that handle the resource management for job execution and submit the job to the execution engine. ExecutionEngine is the actual job executor that will use the apis of Apache Hadoop MR or Apache Spark to execute the sqoop job
JobManager does the following 3 things
- Prepare the JobRequest object for the ExecutionEngine
- Submits the job via the SubmissionEngine submit API, waits for the submission engine to return
- Based on the result of the submit API, creates and saves the Job submission record into its repository( Derby/Postgres, depending on the configured store ) to store the history across multiple job runs
- Has a handle to the concrete execution engine which is the org.apache.hadoop.mapred.JobClient in our case
- Initialize API to set up the submission engine
- Submit API is blocking if using the hadoopLocalRunner and returns a boolean for success or failure of submission and async if non-local. In case of async, the update API is used subsequently to track the progress of the job submission
- Update API can be invoked to query the status of the running job and update the Job submission record that holds the history information of a sqoop job across multiple runs
- Stop API to abort a running job
- Destroy API to mirror the initialize to clean up the submission engine on exit
- Has a handle to the JobRequest object populated by the JobManager
- PrepareJob API to set up the necessary information required by the org.apache.hadoop.mapred.JobClient in our case
NOTE : The ExecutionEngine api is very bares bones and most of the functionality of job execution/ failure/ exception handling resulting from the MR engine happens inside the MRSubmissionEngine
Components of Sqoop using MR
We want to read records from FROM and write them to TO in Sqoop, We want to do this in a parallel way, so we use the MR engine. We spawn numExtractors ( a job config ) indicated map tasks and numLoaders ( a job config ) indicated reduce tasks. So this way we can read records/ messages/ rows in parallel and write them in parallel.
By default sqoop job is a map only job. It does not utilize the reducer by default, unless
|# Extractors||# Loaders||Outcome|
|Default||Default||Map only job with 10 map tasks|
|Number X||Default||Map only job with X map tasks|
|Number X||Number Y||Map-reduce job with X map tasks and Y reduce tasks|
|Default||Number Y||Map-reduce job with 10 map tasks and Y reduce tasks|
The purpose have been to provide ability to user to throttle both number of loader and extractors in an independent way (e.g. have different number of loaders then extractors) and to have default values that won't run reduce phase if not necessary.
Passing data into the sqoop job ( via the mapper)
There is various information such as the job configs, driver configs, schema of the data read and the schema of the data written required by the Extractor and Loader that has to be passed via the SqoopMapper. It is currently passed securely like this via the credential store or via the configuration
- Creates the ExtractorContext from the data stored in the configuration and credential store to pass to the connectors extract API
- Creates the SqoopSplit that holds the partition information for the data to be extracted
- Post extract call, records the Hadoop counters related to Extraction logic
- Passing data out of Mapper : DistributedCache can be used if we need to write any information from the extractor back to the sqoop repository
- Having a Writable class is required by Hadoop framework - we are using the current one as a wrapper for IntermediateDataFormat. Read more on IDF here
- We're not using a concrete implementation such as Text, so that we don't have to convert all records to String to transfer data between mappers and reducers.
- SqoopWritable delegates a lot of its functionality to the IntermediateDataFormat implementation used in the sqoop job, for instance the compareTo method on the Writable can used any custom logic of the underlying IDF for sorting records Extracted and then eventually used to write in the Load phase
- An InputSplit describes a unit of work that comprises a single map task in a MapReduce program, SqoopSplit extends InputSplit
- Instantiates a custom Partition class to be used for splitting the input, in our case it is the data Extracted in the extract phase
- Delegates to the Partition object to read and write data
- It is the Key to the SqoopInputFormat
InputFormat defines: how the data in FROM are split up and read. Provides a factory for RecordReader objects that read the file
SqoopInputFormat is a custom implementation of the MR InputFormat class. SqoopRecordReader is the custom implementation of the RecordReader. The InputSplit has defined a slice of work, but does not describe how to access it. The SqoopRecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The SqoopRecordReader is invoked repeatedly on the input until the entire SqoopSplit has been consumed. Each invocation of the SqoopRecordReader leads to another call to the run() method of the Mapper.
The (key, value) pairs provided by the mapper are passed on the Loader for the TO part. The way they are written is governed by the OutputFormat. SqoopNullOutputFormat extends the OutputFormat class. The goal of hadoop's NullOutputFormat : generates no output files on HDFS since HDFS may not always be the destination. In our case too HDFS is not always the destination, so we use SqoopNullOutputFormat, a custom class to to delegate writing to the Loader specified in the sqoop job, it relies on the SqoopOutputFormatLoadExecutor to pass the data to the Loader via the SqoopRecordWriter. Much like how the SqoopInputFormat actually reads individual records through the SqoopRecordReader implementation, the SqoopNullOutputFormat class is a factory for SqoopRecordWriter objects; these are used to write the individual records to the final destination ( in our case the TO part of the sqoop job). Notice the key to the SqoopNullOutputFormat is actually the
SqoopWritable,that the SqoopRecordWriter uses
SqoopDestroyerOutputCommitter is a custom outputcommiter that provides hooks to do final cleanup or in some cases the one-time operations we want to invoke when sqoop job finishes, i,e either fails or succeeds.
Extends the Reducer API and at this point only runs the progressService. It is invoked only when the numLoaders driver config is > 0. It primary use case is throttling.
Why do we have ability to run reduce phase and why it’s part of throttling?
The original idea was that you want to throttle “From” and “To” side independently. For example if I’m exporting data from HBase to relational database, I might want to have one extractor (=mapper) per HBase region - but number of regions very likely will be more then number of pumping transactions that I want to have on my database, so I might want to specify a different number of loaders to throttle that down. But having reduce phase means to serialize all data and transfer them across network, so we are not running reduce phase unless user explicitly sets different number of loaders then reducers.
- The LoaderContext is set up in the ConsumerThread.run(..) method.
- Loader's load method is invoked passing the SqoopOutputFormatDataReader and the LoaderContext
- The load method invokes the SqoopOutputFormatDataReader to read to records from the SqoopRecordWriter associated with the SqoopNullOutputFormat
ConsumerThreadto parallelize the extraction and loading process in addition to the parallelizing the extract only part using the numExtractors configured. More details are explained in the SQOOP-1938
TL;DR: Parallelize reads and writes rather than have them be sequential.
Most of the threading magic is for a pretty simple reason - each mapper does I/O in 2 places - one is writes to HDFS, the other is read from the DB (at that time, extend it to the new from/to architecture, you'd still have 2 I/O). By having a linear read-write code, you are essentially not reading anything while the write is happening, which seems like a pretty inefficient thing to do - you could easily read while the write is happening by parallelizing the reads and writes, which is what is being done. In addition, there is also some additional processing/handling that the output format does, which can cost time and CPU - at which point you could rather read from the DB.
Few related tickets proposed for enhancement