Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Deletes are supported at a record level in Hudi with 0.5.1 release. This blog is a "how to" blog on how to delete records in hudi. Deletes can be done with 3 flavors, with Hudi Client:  Hudi RDD APIs, with Spark data source and with DeltaStreamer. 

Delete

...

using RDD Level APIs

If you have embedded HoodieWriteClient , then deletion is as simple as passing in a JavaRDD<HoodieKey> to   to the delete api. 

Code Block
languagejava
themeRDark
// Fetch list of HoodieKeys from elsewhere that needs to be deleted
// convert to JavaRDD if required. JavaRDD<HoodieKey> toBeDeletedKeys

List<WriteStatus> statuses = writeClient.delete(toBeDeletedKeys, commitTime); 

Deletion with Datasource

Now we will walk through an example of how to perform deletes on a sample dataset using the Datasource API. Quick Start has the same example as below. Feel free to check it out. 

...

Step 2 : Import as required and set up table name, etc for sample dataset

Code Block
languagejava
themeRDark
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_cow_table"
val basePath = "file:///tmp/hudi_cow_table"
val dataGen = new DataGenerator

...

Deletion with HoodieDeltaStreamer takes the same path as upsert and so it relies on a specific field called "_hoodie_is_deleted" of type boolean in each record.

  • If a record has the field value set to false

...

  • or it's not present, then it is considered a regular upsert

...

  • if not (if the value is set to

...

  • true ), then its considered to be deleted record.

This essentially means that the schema has to be changed for the source, to add this field and all incoming records are expected to have this field set. We will be working to relax this in future releases. But for now, this is what we have. 

Lets say the original schema is 

...