Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Implement the AzureJobCoordinator on top of current JobCoordinator.

  • Change the current Latch implementation to a distributed lock based implementation, and implement it for Zookeeper.

  • Implement the distributed Lock and Leader Election functionality with Lease Blobs in Azure. These are pluggable components. 

    • A blob in Azure storage is used for storing large amounts of unstructured data. 

    • A Lease Blob is an operation that establishes and manages a lock on a blob for write and delete operations. 

    • Acquiring a lease on a blob that has already been leased by someone results in failure. 

    • We will use this service to elect the leader when running in Azure as follows:

      • All the processors will try to acquire a lease on the same shared blob in Azure storage.

      • The processor that acquires the lease becomes the leader and automatically renews the lease at a constant time interval.

      • When the leader dies, it results in release of the lease, following which one of the worker processes will acquire a new lease on the blob. 


  • Azure does not have a watch/subscribe functionality in Azure storage. Therefore, any change in the list of processors in the system will be detected by monitoring each processor's heartbeats. This will happen as follows:
    • We will maintain the list of active processors in Azure Table Storage. 
    • Every entry in a table is uniquely identified by two of its components - PARTITION KEY and ROW KEY. 
    • The PARTITION KEY of a table can be considered as a group id. The ROW KEY identifies every entry in a group uniquely. 
    • In this design:
      • PARTITION KEY = Job Model Version
      • ROW KEY = Processor ID
    • Entries in the table will correspond to a ProcessorEntity, which consists of a liveness value and a boolean field to indicate whether it is the leader or not.
    • The processor table will maintain the current state of the system by actively maintaining the list of live processors. 
    • Every processor will constantly update its liveness value periodically to assert that it is alive. The leader will assume that the processor has died if it has not been modified(written to) since an allowed time (30 sec).



    • The following components will be stored on the lease blob:
      • Job Model (previous and current)
      • Job Model Version (previous and current)
      • Barrier State and Version
      • List of active processors
    • Any new processor will have their job model version as "0UNASSIGNED". 
    • During rebalancing:
      • The leader will:
        • Generate new job model
        • Update job model, job model version, barrier state, list of active processors on the blob
        • Start barrier state listener
      • The worker will:
        • Detect change in job model version
        • Pause container, update its job model
        • Add a new row entry with the updated job model version after getting the new job model
        • Start barrier state listener
    • When barrier is reached:
      • The leader will:
        • Update the state on the blob
        • Stop barrier listener
      • The worker will:
        • Resume container
        • Stop barrier listener
        • Delete its unassigned row entry from the table
    • Polling functions:
      • Leader:
        • Renew lease
        • Check list of live processors
        • Check barrier state from table
      • Worker:
        • Heartbeat (5 sec)
        • Check leader aliveness
        • Check job model version updates
        • Check barrier state from blob
    • It should be noted that the leader is also a worker.
    • When leader dies:
      • Leader election triggered. Every processor tries to become the leader.
      • The elected leader will start rebalancing. 
    • TO DO: In case of any orphaned processors, the processor will detect it and take necessary action.

                             
  • Implement an Azure Client class that provides a client side reference for access to the Azure Storage Account, Azure blob storage and Azure table storage.

  • Implement the checkpointing mechanism with Azure Storage.
  • Integrate all of this with the EventHubSystemProducer and EventHubSystemConsumer.

...