This example illustrates how a CSV file on HDFS can be converted to Avro in Standalone mode. The final output is written back to HDFS in this example.
- First create a CSV file and write the file to HDFS location hdfs://localhost:9000/source
Create a job configuration file with the properties set as shown below.
job.name=CSVToAvroQuickStart
fs.uri=hdfs://localhost:9000
converter.classes=org.apache.gobblin.converter.csv.CsvToJsonConverter,org.apache.gobblin.converter.avro.JsonIntermediateToAvroConverter
writer.builder.class=org.apache.gobblin.writer.AvroDataWriterBuilder
source.class=org.apache.gobblin.source.extractor.filebased.TextFileBasedSource
source.filebased.data.directory=${fs.uri}/source
source.filebased.fs.uri=${fs.uri}
source.schema=[{"columnName":"ID","comment":"","isNullable":"true","dataType":{"type":"String"}},{"columnName":"NAME","comment":"","isNullable":"true","dataType":{"type":"String"}}]
source.skip.first.record=false
extract.table.name=CsvToAvro
extract.namespace=org.apache.gobblin.example
extract.table.type=APPEND_ONLY
converter.csv.to.json.delimiter=","
writer.output.format=AVRO
writer.destination.type=HDFS
writer.fs.uri=${fs.uri}
writer.staging.dir=/writer-staging
writer.output.dir=/output
state.store.dir=/state
state.store.fs.uri=hdfs://localhost:9000
state.store.dir=/gobblin-kafka/state-store
data.publisher.final.dir=/final
Run gobblin-standalone.shbin/gobblin-standalone.sh start --conf ~/gobblin/conf/ex.pull --workdir /tmp
- After the job finishes, the following messages should be in the job log:
INFO [TaskExecutor-0] org.apache.gobblin.runtime.Task 468 - Row quality checker finished with results:
INFO [ForkExecutor-0] org.apache.gobblin.runtime.fork.Fork 520 - Wrapping writer org.apache.gobblin.writer.PartitionedDataWriter@78e717a5
INFO [TaskExecutor-0] org.apache.gobblin.runtime.Task 486 - Task shutdown: Fork future reaped in 103 millis
INFO [TaskExecutor-0] org.apache.gobblin.runtime.Task 347 - Extracted 5 data records
INFO [TaskExecutor-0] org.apache.gobblin.runtime.Task 348 - Row quality checker finished with results:
INFO [JobScheduler-1] org.apache.gobblin.runtime.GobblinMultiTaskAttempt 144 - All assigned tasks of job job_CSVToAvroQuickStart_1512879419231 have completed in container
INFO [JobScheduler-1] org.apache.gobblin.runtime.GobblinMultiTaskAttempt 371 - Will commit tasks directly.
INFO [Task-committing-pool-0] org.apache.gobblin.publisher.TaskPublisher 48 - All components finished successfully, checking quality tests
- The avro output (converted from CSV input) should be available under /final/org/apache/gobblin/example/CsvToAvro directory on HDFS.