Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents
Introduction

Gobblin as it exists today is a framework that can be used to build different data integration applications like ingest, replication etc. Each of these applications are typically configured as a separate job and executed through a scheduler like Azkaban. This  This document outlines an approach to evolve Gobblin from a framework to a service. Specifically, our goal is to enable self service so that different users can automatically provision and execute various supported Gobblin applications limiting the need for development and operation teams to be involved during the provisioning process.

...

To accomplish provisioning of these pipelines, generally this would be the sequence of events might look like:

  1. Engage various stake holders to discuss the use-case, assess feasibility, and understand the process and associated timelines (~1 week)

  2. Internalize different configuration options and learning how to construct different Gobblin flows. Write one job config for each of the tasks outlined above (~1 week)

  3. Review the work, set-up a dark launch to test the system. Iteratively continue on this process until everything works as expected (~1 week)

  4. Work towards productionizing the flows (~ 1 week)

  5. Setup monitoring for validation purposes (~1 week)

...

With a specification that contains these details, GaaS can easily execute the user’s request leveraging just the existing technologies.  

  

...

Problem Statement

The requirements for a successful deployment of Gobblin as a Service (GaaS) are:

...

  • User Request: This is a request to manage (create, delete, update) a logical Flow that can take the form of: “Ingest Salesforce data to NoSQL DB and then load the data onto HDFS”. 
    It can be represented as a triplet:

...

  • <Flow Config, Source, List<Destination>>

  • Spec Executor: A Spec Executor is a data processing pipeline (typically a Gobblin application such as Gobblin Standalone Cluster running in Y-Network-Fabric OR Gobblin job running on Azkaban in X-Network-Fabric, etc.) that can process physical Gobblin Jobs. A Spec Executor typically has associated capabilities (ie. supported sources and sinks) which can be used to determine if a Gobblin Job can run on it by the GaaS.
    It can be represented as a triplet <Technology, Location, Communication Mechanism>. In our example above, one Spec Executor is <Gobblin-DB-Ingest, X-Network-Fabric, REST> while another is <Gobblin Standalone Cluster, Y-Network-Fabric, Kafka>. (Spec here is either Flow or Job that can be executed on Executor Instance)

  • Gobblin Job: This can be thought of as all the configuration information required to actually execute a physical flow (or also called as job ) that ingests, manipulates and moves data. In our example above, a DistcpNg job executing on Hadoop-1 that copies data between Hadoop-1 and Hadoop-2 are is an example of Gobblin jobsjob.
    It can be represented as: Pair <Spec Executor, Job Specification <Job Config, Job Template>>

Summarizing: given these definitions, GaaS takes in a User Request (logical flow) and converts it into a series of Gobblin Jobs and coordinates and monitors the execution of these Gobblin Jobs in a highly distributed manner. It entails several components or modules that are discussed in detail for the rest of this document.

  

...

Existing Solutions

At present, each Gobblin deployment runs independently and is not controlled and monitored by any central system.

So, Gobblin-as-a-Service intends to be the central orchestration and provisioning system that can manage jobs for all data integration tasks on any of the Gobblin deployment modes.

Relevant Gobblin deployment modes are listed in the Appendix.

 

...


...

Design

Note: Different text background indicate sub-tasks that can be designed and developed in parallel

Gobblin as a Service (GAAS) overall Architecture

...

In this section, we will zoom into a few major modules that Backend server on each machine / node will run. However, a comprehensive listing of all modules and its functionalities can be found in the Appendix under sub-sections:

  • Logical Entities

  • Key Components Overview

  • Key Components Description 

...

Note: The Topology module will also eventually expose REST APIs for managing topologies, but that is not going to be covered in this documentat the moment.

Technology: REST APIs will be based on RestLi and will adhere to it standards.

...

The field flowName contains a flow name generated by the client that should be unique within the flow group specified by flowGroup. <flowGroup, flowName> is the composite key that identifies a FlowConfig. The schedule field contains the flow schedule in the unix cron format. This field can be empty for a run-once job. The templateNames field identifies the templates that the flow properties are overlaid on top of. This field should contain a URI that references the templates in the flow catalog, such as FS:///usecase/templates/template1.template, etc. The properties field is a map of string keys to string values. Most of the Flow specific key-value pairs are specified in properties to keep the specification generic, extensible and in-sync with Job specification within Gobblin.

...

POST /flowconfigs

Curli example:

curli http://localhost:8080/flowconfigs -X POST -H 'X-RestLi-Method: create' -H 'X-RestLi-Protocol-Version: 2.0.0' --data '{"flowName" : "myflow1", "flowGroup" : "mygroup", "templateNames" : "FS:///mytemplate.template", "schedule" : "", "properties" : {"prop1" : "value1"}}'

...

POST /flowconfigs/(flowGroup:<groupName>,flowName:<flowName>)

Curli example:

curli "http://localhost:8080/flowconfigs/(flowGroup:mygroup,flowName:myflow1)" -X PUT -H 'X-RestLi-Method: update' -H 'X-RestLi-Protocol-Version: 2.0.0' --data '{"flowName" : "myflow1", "flowGroup" : "mygroup", "templateName" : "FS:///mytemplate.template", "schedule" : "", "properties" : {"prop1" : "value2"}}'

...

DELETE /flowcofigs/(flowGroup:<groupName>,flowName:<flowName>)

Curli example:

curli "http://localhost:8080/flowconfigs/(flowGroup:mygroup,flowName:myflow1)" -X DELETE -H 'X-RestLi-Protocol-Version: 2.0.0'

...

 GET /GobblinServiceFlows/(flowGroup:<groupName>,flowName:<flowName>)

Curli example:

curli "http://localhost:8080/flowconfigs/(flowGroup:mygroup,flowName:myflow)" -X GET -H 'X-RestLi-Protocol-Version: 2.0.0'

...

  • Authentication: Authentication will be handled via two mechanismmechanisms:

  1. RestLI: Implicitly handled by RestLI servlet filters , which will determine the user principal by talking to Auth server.

  2. LDAP: Communicate with the LDAP server, and determine the user principal.

...

  1. ACL Service: Extensible interfaces to authorize via different ACL serviceservices.

  2. Config Based ACLs: To start with, we can make use of ACLs hardcoded in config.

...

  1. Metrics Consumption (initially Kafka based with future possibilities of REST, etc) which will consume GobblinMetricsEvents and dump them into MetricsStore (Database such as MySQL, NoSQL DB or OLAP such as Pinot). This can be a simple Gobblin job that runs as a Gobblin application. The metrics emitted will include tags about the flow execution, debug information (such as link to execution and logs) and execution metrics.

  2. MetricAnalyzer that will analyze these metrics on the fly and aggregate them by Flows.

  3. SLAMonitor (for Flows) that will run a producer which will compute the expected Flow completions (using list of available Flows and their schedules) AND a consumer that takes them out of the queue whenever the MetricAnalyzer sees that an expected Flow execution has happened and is finished. In case the SLA is breached ie. the expected Flow completion has not happened until its timeout or a Flow execution has failed, then SLAMonitor will trigger an alert (such as PageDutyPagerDuty).

  4. SLAMonitor (for Datasets) will also run as described in #3, but will further filter out based on Datasets for a more granular tracking.

  5. SpecExecutor Health Monitor: We will also run a SpecExecutor health monitor to check for current state of SpecExecutor and notify via alerts if it is unhealthy. Eventually, when we make use of mutable TopologyCatalog, we can also make corresponding topology changes if we have an alternative SpecExecutor that can execute the same JobSpec.

...

  • FlowSpec: Data model representation that describes a Logical Data Flow.  

  • JobSpec: Data model representation that describes a physical Gobblin Job.   

  • TopologySpec: Data model representation that describes a topology ie. a SpecExecutor and its capabilities tuple (SpecExecutor is described later)

  • GobblinMetricsEvents: Standard events that gobblin-metric emits to help track various execution events and metrics.

  • Job Templates: Pre-configured files with properties that can be overridden for various possible physical Gobblin Jobs.

  • Template Packs: LinkedIn deployable multi-product that contains Packaged archives that contain Gobblin Job Templates. Each use-case / team can have its own template pack. Template Packs are collection of ACL controlled templates which are available only to the restricted list of principals as whitelisted by template pack owners.

...