You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

This example illustrates how a CSV file on HDFS can be converted to Avro in Standalone mode. The final output is written back to HDFS in this example.

 

  • First create a CSV file and write the file to HDFS location hdfs://localhost:9000/source
  • Create a job configuration file with the properties set as shown below.
  • Run the example as 

    bin/gobblin-standalone.sh start --conf /path/to/job/conf --workdir /tmp

  • The output avro file should be under the "/final" dir on HDFS.

    job.name=CSVToAvroQuickStart
     
    fs.uri=hdfs://localhost:9000
     
    converter.classes=org.apache.gobblin.converter.csv.CsvToJsonConverter,org.apache.gobblin.converter.avro.JsonIntermediateToAvroConverter
    writer.builder.class=org.apache.gobblin.writer.AvroDataWriterBuilder
     
    source.class=org.apache.gobblin.source.extractor.filebased.TextFileBasedSource
    source.filebased.data.directory=${fs.uri}/source
    source.filebased.fs.uri=${fs.uri}
    source.schema=[{"columnName":"ID","comment":"","isNullable":"true","dataType":{"type":"String"}},{"columnName":"NAME","comment":"","isNullable":"true","dataType":{"type":"String"}}]
    source.skip.first.record=false
     
    extract.table.name=CsvToAvro
    extract.namespace=org.apache.gobblin.example
    extract.table.type=APPEND_ONLY
     
    converter.csv.to.json.delimiter=","
     
    writer.output.format=AVRO
    writer.destination.type=HDFS
    writer.fs.uri=${fs.uri}
    writer.staging.dir=/writer-staging
    writer.output.dir=/output
     
    state.store.dir=/state
    state.store.fs.uri=hdfs://localhost:9000
    state.store.dir=/gobblin-kafka/state-store
    data.publisher.final.dir=/final


  • Run gobblin-standalone.sh
    • bin/gobblin-standalone.sh start --conf ~/gobblin/conf/ex.pull --workdir /tmp

  • After the job finishes, the following messages should be in the job log:
    • INFO  [TaskExecutor-0] org.apache.gobblin.runtime.Task  468 - Row quality checker finished with results: 
      INFO  [ForkExecutor-0] org.apache.gobblin.runtime.fork.Fork  520 - Wrapping writer org.apache.gobblin.writer.PartitionedDataWriter@78e717a5
      INFO  [TaskExecutor-0] org.apache.gobblin.runtime.Task  486 - Task shutdown: Fork future reaped in 103 millis
      INFO  [TaskExecutor-0] org.apache.gobblin.runtime.Task  347 - Extracted 5 data records
      INFO  [TaskExecutor-0] org.apache.gobblin.runtime.Task  348 - Row quality checker finished with results: 
      INFO  [JobScheduler-1] org.apache.gobblin.runtime.GobblinMultiTaskAttempt  144 - All assigned tasks of job job_CSVToAvroQuickStart_1512879419231 have completed in container 
      INFO  [JobScheduler-1] org.apache.gobblin.runtime.GobblinMultiTaskAttempt  371 - Will commit tasks directly.
      INFO  [Task-committing-pool-0] org.apache.gobblin.publisher.TaskPublisher  48 - All components finished successfully, checking quality tests
  • The avro output (converted from CSV input) should be available under /final/org/apache/gobblin/example/CsvToAvro directory on HDFS. 

 

  • No labels