...
- Set up a single node Kafka broker as in the standalone mode
- Set up a single node Hadoop cluster in pseudo-distributed mode as explained here. Follow the instructions to set up YARN cluster.
Create a job config with the following properties:
job.name=GobblinHdfsMRQuickStart
job.group=GobblinHdfsMR
job.description=Gobblin quick start job for Hdfs
job.lock.enabled=false
launcher.type=MAPREDUCE
fs.uri=hdfs://localhost:9000
source.class=org.apache.gobblin.example.hadoop.HadoopTextFileSource
extract.namespace=org.apache.gobblin.example.hadoop
extract.table.name=test
extract.table.type=APPEND_ONLY
writer.fs.uri=hdfs://localhost:9000
state.store.fs.uri=hdfs://localhost:9000
source.hadoop.file.input.format.class=org.apache.hadoop.mapreduce.lib.input.TextInputFormat
source.hadoop.file.splits.desired=1
source.hadoop.file.input.paths=hdfs://localhost:9000/data/test
converter.classes=org.apache.gobblin.converter.string.ObjectToStringConverter
writer.builder.class=org.apache.gobblin.kafka.writer.KafkaDataWriterBuilder
writer.kafka.topic=MRTest
writer.kafka.producerConfig.bootstrap.servers=localhost:9092
writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.StringSerializer
data.publisher.type=org.apache.gobblin.publisher.NoopPublisher
mr.job.max.mappers=1
metrics.reporting.file.enabled=true
metrics.log.dir=/tmp/suvasude/metrics
metrics.reporting.file.suffix=txt
mr.job.root.dir=/gobblin-kafka/working
state.store.dir=/gobblin-kafka/state-store
task.data.root.dir=/jobs/kafkaetl/gobblin/gobblin-kafka/task-data
Run gobblin-mapreduce.sh:
- bin/gobblin-mapreduce.sh --conf <path-to-job-config-file>