Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The problem we try to solve here is the cache consistency issue. We already build a HMS cache to cache the metadata. If we have multiple HMS in the cluster, the cache is not synchronized. That is, if metastore 1 changed a table/partition, metastore 2 won’t see the change immediately. There’s a background thread keep polling from the notification log and update changed entries, so the cache is eventually consistent. In this work, we want to make the cache full consistent. The idea is at read time, we will check if the cached entry is obsolete or not. However, we don’t want to penalize the read performance. We don’t want to introduce additional db call to compare the db version and cached version in order to do the check. The solution is we will use the transaction state of a query for the version check. A query will pull the transaction state of involved tables (ValidWriteIdList) from db (non-cached) anyway. So we don’t need additional db call to check the staleness of the cache.

.

Data structure change

The only data structure change is adding ValidWriteIdList into SharedCache.TableWrapper/SharedCache.PartitionWrapper, which represents the transaction state of the cached entrytable.

Image RemovedImage Added

Note there is no db table structure change, and we don’t store extra information in db. We don’t update TBLS.WRITE_ID field as we will use db as the fact of truth. We assume db always carry the latest copy and every time we fetch from db, we will tag it with the transaction state of the query.

...

  1. At the beginning of the query, Hive will retrieve the global transaction state and store in config (ValidTxnList.VALID_TXNS_KEY)
  2. Hive translate ValidTxnList to ValidWriteIdList of the table [1312:7,8,12] (The format for writeid is [hwm:exceptions], all writeids from 1 to hwm minus exceptions, are committed. In this example, writeid 1..6,9,10,11 are committed)
  3. Hive pass the ValidWriteIdList to HMS
  4. HMS compare ValidWriteIdList [1312:7,8,12] with the cached one [1211:7,8] using TxnIdUtils.compare, if it is fresh or newer (Fresh or newer means no transaction committed between two states), Metastore . In this example, [11:7,8] means writeid 1..6,9,10,11 are committed, the same as the requested writeid [12:7,8,12]), HMS return cached table entry
  5. If the cached ValidTxnList ValidWriteIdList is [12:7,12], the comparison fails because write id writeid 8 is committed since then. Metastore HMS will fetch the table from ObjectStore
  6. HMS will eventually catch up with the newer version from notification log. HMS will serve the request from cache since then

Here is another example of get_partitions_by_expr. In contrast with previous example, this The API is a list query not a point lookup:. There are a couple of similarities and differences to point out:

  1. HMS will still compare requested writeid and cached table writeid to decide if the request can serve from cache
  2. Every add/remove/alter/rename partition request will increment the table writeid. HMS will mark cached table entry invalid upon processing the first write message from notification log, and mark it valid and tag with the right writeid upon processing the commit message from notification log
  3. At the beginning of the query, Hive will retrieve the global transaction state and store in config (ValidTxnList.VALID_TXNS_KEY)
  4. Hive translate ValidTxnList to ValidWriteIdList of the table [13:7,8,12]
  5. Hive pass the ValidWriteIdList to HMS
  6. Metastore go through the cached partitions of the table, if any of the partition in cache does not compatible with the new ValidWriteIdList, Metastore will serve the list query from ObjectStore

Write

There is no change on HMS write side. HMS will write data into db and also put an entry in notification log.

Commit

When the transaction is committed, HMS will put writeid of the modified tables and partitions during the query into notification log. HMS will retrieve the writeid from db. As an optimization, HMS client may also pass a flag to indicate this is a read-only transaction, HMS would not pull writeid from db if it is read-only.

Cache update

In the previous discussion, we know if the cache is stale, metastore HMS will serve the request from ObjectStore. We need to catch up the cache with the latest change. This can be done by the existing notification log based cache update mechanism. A thread in HMS constantly poll from notification log, update the cache with the entries from notification log. The interesting entries in notification log are table/partition writes, and corresponding commit transaction message involving table/partition writes. When processing table/partition writes, HMS will put the table/partition entry in cache. However, the entry is not immediately usable until the commit message of the corresponding writes is processed, which contains and tag the ValidWriteIdList of for all the entries modified by the transaction.

Here is a complete flow for a cache update when write happen (and illustrated in the diagram):

  1. The ValidWriteIdList of cached table is initially [1311:7,8,12]
  2. HMS 1 get a alter_table request. HMS 1 puts alter_table message to notification log
  3. The transaction in HMS 1 get committed. HMS 1 puts commit message to notification log along with the writeid [1412:7,8,12]
  4. The cache update thread in HMS 2 will read the alter_table event from notification log, update the cache with the new version from notification log. However, the entry is not available for read as there’s no writeid associate with it yet
  5. A read for the entry on HMS 2 will fetch from db since the entry is not available for read
  6. The cache update thread will further read commit event from notification log, tag the entry with the writeid [1412:7,8,12]
  7. The next read from HMS 2 will server serve from cache


Bootstrap

The use cases discussed so far are driven by a query. However, during the HMS startup, there’s a cache prewarm. HMS will fetch everything from db to cache. There is no particular query drives the process, that means we don’t have ValidWriteIdList of the query. Prewarm needs to generate ValidWriteIdList by itself. To do that, for every table, HMS will query the current global transaction state ValidTxnList (HiveTxnManager.getValidTxns), and then convert it to table specific ValidWriteIdList (HiveTxnManager.getValidWriteIds). As an optimization, we don’t have to invoke HiveTxnManager.getValidTxns per table. We can invoke it every couple of minutes. If ValidTxnList is outdated, we will get an outdated ValidWriteIdList. Next time when Hive read this entry, Metastore will fetch from the db even though it is in fact fresh. There’s no correctness issue, only impact performance in some cases. The other possibility is the entry changes after we fetches ValidWriteIdList. This is not unlikely as fetching all partitions of the table may take some time. If that happens, the cached entry is actually newer than the ValidWriteIdList. The next time Hive reads it will trigger a db read though it is not necessary. Again, there’s no correctness issue, only impact performance in some cases.

...

Write id is not valid for external tables. And Hive won’t fetch ValidWriteIdList if the query source is external tables. Without ValidWriteIdList, HMS won’t able to check the staleness of the cache. To solve this problem, we will use the original eventually consistent model for external tables. That is, if the table is external table, Hive will pass null ValidWriteIdList to metastore API/CachedStore. Metastore cache won’t store ValidWriteIdList alongside the entry. When reading, CachedStore always retrieve the current copy. The original notification update will update the metadata of external tables, so we can eventually get updates from external changes.

Consistency Guarantee

Since the source of truth for cache is notification log, and the notification log is total ordered, the cache provides monotonic reads. The cost of that is we delay the update of cache until the notification log is caught up by the background thread. This guarantee the cache always move forward not backward. During the interim before HMS catch up the notification log, read will be served from db. The performance will suffer during this short period of time, but consider write operation is less often, the cost is minor.

...

Adding a serialized version of ValidWriteIdList to every read HMS API, and commit API.

hive_metastore.thriftOld API

New API

get_table(string dbname,string tbl_name)

get_table(string dbname,string tbl_name,string validWriteIdList)

commit_txn(CommitTxnRequest rqst)Adding new field txnWriteIds to rqst

Actually the reality is a bit more complex because of two reasonsActually we don’t need to add the new field into every read request because:

  1. Many APIs are using a request structure rather than taking individual parameters. So need to add ValidWriteIdList to the request structure instead
  2. Some APIs already take ValidWriteIdList to invalidate outdated transactional statistics. We don’t need to change the API signature, but will reuse the ValidWriteIdList to validate cached entries in CachedStore

For HMS write, if validWriteIdList=null, HMS won’t cache the entry at all if this is managed table, and will cache regardless of validWriteIdList if this is external table. For HMS read, if validWriteIdList=null, HMS will return null if it is managed table, and return the cached entry regardless if it is external table.

In commit_txn request, we will add an optional boolean field readonly, which indicate the query does not modify any entries in the cache, and HMS don’t need to fetch writeid for the transaction from db. Note this is a performance optimization which does not impact the correctness.

HMS read Thrift API will remain backward compatible for external table. That is, new server can deal with old client. If the old client issue a create_table call, server side will receive the request of create_table with validWriteIdList=null, and will cache or retrieve the entry regardless(with eventual consistency model). For managed table, validWriteIdList will be required and HMS server will throw an exception if validWriteIdList=null (for both read and write).

RawStore

ObjectStore will use the additional validWriteIdList field for all read methods to compare with cached ValidWriteIdList

...

All other components invoking HMS API directly (bypass Hive.java) will be changed to invoke the newer HMS API.   This includes HCatalog, Hive streaming, etc, and other projects using HMS client such as Impala.

For every read request and commit transaction request involving table/partitions, HMS client need to pass a validWriteIdList string in addition to the existing arguments. validWriteIdList can be null if it is external table, as HMS will return whatever in the cache for external table using eventual consistency. But if validWriteIdList=null for managed table, HMS will throw exception. validWriteIdList is a serialized form of [ValidReaderWriteIdList (|https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/common/ValidReaderWriteIdList.java#L119)]. Usually ValidReaderWriteIdList can be obtained from [HiveTxnManager(|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveTxnManager.java) using ] using the following code snippet:

ValidTxnList txnIds = txnMgr.getValidTxns(); // get global transaction state
txnWriteIds = txnMgr.getValidWriteIds(txnTables, txnString); // map global transaction state to table specific write id

Optionally, HMS client can set readonly flag in commit transaction request if this is a readonly transaction. This will save a db fetch for HMS for readonly query.