This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • HCatalog Streaming Mutation API

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add links to Streaming Data Ingest doc

...

The Streaming Mutation API, although similar to the Streaming API, has a number of differences and is built to enable very different use cases. Superficially, the Streaming API can only write new data whereas the mutation API can also modify existing data. However the two APIs are also based on very different transaction models. The Streaming API focuses on surfacing a continuous stream of new data into a Hive table and does so by batching small sets of writes into multiple short-lived transactions. Conversely the mutation API is designed to infrequently apply large sets of mutations to a data set in an atomic fashion: either all or none of the mutations will be applied. This instead mandates the use of a single long-lived transaction. This table summarises the attributes of each API:

AttributeStreaming APIMutation API
Ingest typeData arrives continuously.Ingests are performed periodically and the mutations are applied in a single batch.
Transaction scopeTransactions are created for small batches of writes.The entire set of mutations should be applied within a single transaction.
Data availabilitySurfaces new data to users frequently and quickly.Change sets should be applied atomically, either the effect of the delta is visible or it is not.
Sensitive to record orderNo, records do not have pre-existing lastTxnIds or bucketIds. Records are likely being written into a single partition (today's date for example).Yes, all mutated records have existing RecordIdentifiers and must be grouped by [partitionValues, bucketId] and sorted by lastTxnId. These record coordinates initially arrive in an order that is effectively random.
Impact of a write failureTransaction can be aborted and producer can choose to resubmit failed records as ordering is not important.Ingest for the respective group (partitionValues + bucketId) must be halted and failed records resubmitted to preserve sequence.
User perception of missing dataData has not arrived yet → "latency?""This data is inconsistent, some records have been updated, but other related records have not" consider here the classic transfer between bank accounts scenario.
API end point scopeA given HiveEndPoint instance submits many transactions to a specific bucket, in a specific partition, of a specific table.A set of MutationCoordinators writes changes to unknown set of buckets, of an unknown set of partitions, of specific tables (can be more than one), within a single transaction.

...