This wiki documents the steps to get started with Gobblin-as-a-Service and single tenant Azkaban Orchestrator. 

Setup

Download code (master)

 wget https://github.com/apache/gobblin 

Build distribution 

 cd gobblin; ./gradlew clean build

Locate and extract the distribution

cd build/gobblin-distribution/distributions; tar xvf gobblin-distribution-*.tar.gz 

Move into the extracted directory and start the service after modifying configuration (as described in next section)

cd gobblin-dist; bin/gobblin-service.sh start

Configuration

Topology Catalog

To enable Gobblin-as-a-Service to talk to an Azkaban deployment, you need to modify the application.conf

vi conf/service/application.conf

Config to specify an Azkaban deployment looks like: 

topologySpecFactory.topologyNames=<azkaban-name>
topologySpecFactory.<azkaban-name>.gobblin.service.azkaban.username=<username>
topologySpecFactory.<azkaban-name>.gobblin.service.azkaban.password=<password>
topologySpecFactory.<azkaban-name>.gobblin.service.azkaban.server.url="<url>"
topologySpecFactory.<azkaban-name>.description="AzkabanTopology"
topologySpecFactory.<azkaban-name>.version="1"
topologySpecFactory.<azkaban-name>.uri="GobblinAzkaban"
topologySpecFactory.<azkaban-name>.specExecutorInstanceProducer.class="org.apache.gobblin.service.modules.orchestration.AzkabanSpecExecutorInstanceProducer"
topologySpecFactory.<azkaban-name>.specExecInstance.capabilities="any:any"

Flow Catalog

No change is required for Flow Catalog or its setup

Template Catalog

By default Template Catalog is set to use directory: /tmp/templateCatalog

To modify the default template catalog location, change config in conf/service/application.conf

vi conf/service/application.conf

And change the following value to your desired directory: 

gobblin.service.templateCatalogs.fullyQualifiedPath="file:///tmp/templateCatalog"

Adding Template

You can add new templates for the Gobblin-as-a-Service to use by adding them in the template catalog. An example template is below: 

# Azkaban specific properties
type=java

# Mandatory input properties (to be supplied via REST API call property bag)
gobblin.template.required_attributes="titles"

# Pre-defined properties
job.name=PullFromWikipedia
job.group=Wikipedia
job.description=A getting started example for Gobblin

source.class=org.apache.gobblin.example.wikipedia.WikipediaSource
source.page.titles=${titles}
wikipedia.api.rooturl="https://en.wikipedia.org/w/api.php"
wikipedia.avro.schema="{\"namespace\": \"example.wikipedia.avro\",\"type\": \"record\",\"name\": \"WikipediaArticle\",\"fields\": [{\"name\": \"revid\", \"type\": [\"double\", \"null\"]},{\"name\": \"pageid\", \"type\": [\"double\", \"null\"]},{\"name\": \"title\", \"type\": [\"string\", \"null\"]},{\"name\": \"user\", \"type\": [\"string\", \"null\"]},{\"name\": \"anon\", \"type\": [\"string\", \"null\"]},{\"name\": \"userid\",  \"type\": [\"double\", \"null\"]},{\"name\": \"timestamp\", \"type\": [\"string\", \"null\"]},{\"name\": \"size\",  \"type\": [\"double\", \"null\"]},{\"name\": \"contentformat\",  \"type\": [\"string\", \"null\"]},{\"name\": \"contentmodel\",  \"type\": [\"string\", \"null\"]},{\"name\": \"content\", \"type\": [\"string\", \"null\"]}]}"
gobblin.wikipediaSource.maxRevisionsPerPage=10
extract.namespace=org.apache.gobblin.example.wikipedia
writer.builder.class=org.apache.gobblin.writer.ConsoleWriterBuilder
data.publisher.type=org.apache.gobblin.publisher.NoopPublisher

Execution

Creating a new Flow

You can create a new Flow using the above template through the following curl command: 

curl http://localhost:<port>/flowconfigs -X POST -H 'X-RestLi-Method: create' -H 'X-RestLi-Protocol-Version: 2.0.0' --data '{"id": {"flowName":"<flow name>", "flowGroup":"<flow group>"},"schedule":{"cronSchedule":"<cron schedule>", "runImmediately": true}, "templateUris" : "FS:///wikipedia.template", "properties" : {"gobblin.flow.sourceIdentifier" : "any", "gobblin.flow.destinationIdentifier" : "any",  "titles" : "Apache"}}'

Note

  1. Notice that we have specified "titles" as a property in the curli command (as required by the template) 
  2. You should see an Azkaban project created when your flow executes



  • No labels