Its common to have duplicates introduced in the Kafka events your micro-services emit due to various reasons such as 

  • Client/Producer failures
  • Network errors
  • At-least once data moving across Kafka clusters


This can lead to overcounting/accuracy problems like : over reporting impressions and billing ads incorrectly, low fidelity data in dashboards. 

Here's how Hudi can help solve this problem with a simple option invoked on the DeltaStreamer tool. 


// Deltastreamer command to ingest kafka events, dedupe, ingest
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ 
  /path/to/hudi-utilities-bundle-*.jar` \
  --props s3://path/to/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
  --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
  --source-ordering-field time \
  --target-base-path s3:///hudi-deltastreamer/impressions --target-table uber.impressions \
  --op BULK_INSERT
  --filter-dupes


// kafka-source-properties
include=base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=datestr
# schema provider configs
hoodie.deltastreamer.schemaprovider.registry.url=http://localhost:8081/subjects/impressions-value/versions/latest
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=impressions
#Kafka props
metadata.broker.list=localhost:9092
auto.offset.reset=smallest
schema.registry.url=http://localhost:8081







  • No labels