概述

  • 配置文件根据 DatabaseDescriptor 被解析。(有默认值的项目,默认值将被定义。)
  • Thrift 生成一个 Cassandra.java API 接口;实装 CassandraServer,并可以在 CassandraServerCassandraDaemon 之间通讯。
  • CassandraServer 把客户端请求翻译成一一对应的内部请求。然后 StorageProxy完成这些请求相应的具体工作。其结果由 CassandraServer 返回给客户端。
  • StorageService 处理 Cassandra 集群内部之间的请求和响应,可以看作它是集群内部的 CassandraDaemon。它把集群内部的原始 gossip 转换成对应的内部状态。(译者注:StroageProxy的方法最后调用StorageService来完成在集群内的访问)
  • AbstractReplicationStrategy控制每一个键值范围对应的第二份、第三份复制对应的节点。主本数据存储节点由该数据对应的Token以及其它变量所决定, 如果复制策略是机架不相关的,主本对应的复制存储在Token序列对应的接下来的 N-1 个节点上。如果复制策略是机架相关的,首先主本对应的一个复制是存储在另外一个机架上,而其它 N-2 (假设 N>2 )个复制和主本处于同一机架,存储在 Token 序列内依次的 N-2 个节点上。
  • MessagingService负责处理内部的连接池,以及在(通过一个多线程的Executorservice)合适的stage上运行内部命令。stage由StageManager进行管理, Stage有读stage, 写stage,和流处理stage等类型(在Cassandra bootstrap或者token的重新定义时,需要移动大量sstable的数据,这个时候进行的就是Streaming的处理)。 (译者注:stage不是阶段,也不是舞台,是一个有时空(Gossip)概念的定义,具体的内涵请查阅Cassandra源代码或者读SEDA方面的资料)。stage上运行的内部命令由StorageService定义,可参考"registerVerbHandlers"

写入的流程

  • StorageProxy 负责从 ReplicationStrategy 获取的节点副本的钥匙,然后发送 RowMutation 信息给他们(节点)。
    • 如果在环上节点的位置变更,TokenMetadata内"保留中的有效距离"被安上目的地关联,那些也能写入。
    • 如果节点都应该接受写下来,但其余的节点可以完成所要求的 ConsistencyLevel,为写下来节点将被发送到另一个节点,而不是用头(暗示),这与数据的关键应发送到副本节点时,它回来了。这就是所谓的 HintedHandoff,降低了“最终”在“最终的一致性。”请注意, HintedHandoff 只是一个优化; ArchitectureAntiEntropy 是更彻底恢复的一致性负责。
  • on the destination node, RowMutationVerbHandler uses Table.Apply to hand the write first to CommitLog.java, then to the Memtable for the appropriate ColumnFamily.
  • When a Memtable is full, it gets sorted and written out as an SSTable asynchronously by ColumnFamilyStore.switchMemtable
    • When enough SSTables exist, they are merged by ColumnFamilyStore.doFileCompaction
      • Making this concurrency-safe without blocking writes or reads while we remove the old SSTables from the list and add the new one is tricky, because naive approaches require waiting for all readers of the old sstables to finish before deleting them (since we can't know if they have actually started opening the file yet; if they have not and we delete the file first, they will error out). The approach we have settled on is to not actually delete old SSTables synchronously; instead we register a phantom reference with the garbage collector, so when no references to the SSTable exist it will be deleted. (We also write a compaction marker to the file system so if the server is restarted before that happens, we clean out the old SSTables at startup time.)
  • See ArchitectureSSTable and ArchitectureCommitLog for more details

读的流程

  • StorageProxy gets the nodes responsible for replicas of the keys from the ReplicationStrategy, then sends read messages to them
    • This may be a SliceFromReadCommand, a SliceByNamesReadCommand, or a RangeSliceReadCommand, depending
  • On the data node, ReadVerbHandler gets the data from CFS.getColumnFamily or CFS.getRangeSlice and sends it back as a ReadResponse
    • The row is located by doing a binary search on the index in SSTableReader.getPosition
    • For single-row requests, we use a QueryFilter subclass to pick the data from the Memtable and SSTables that we are looking for. The Memtable read is straightforward. The SSTable read is a little different depending on which kind of request it is:
      • If we are reading a slice of columns, we use the row-level column index to find where to start reading, and deserialize block-at-a-time (where "block" is the group of columns covered by a single index entry) so we can handle the "reversed" case without reading vast amounts into memory
      • If we are reading a group of columns by name, we still use the column index to locate each column, but first we check the row-level bloom filter to see if we need to do anything at all
    • The column readers provide an Iterator interface, so the filter can easily stop when it's done, without reading more columns than necessary
      • Since we need to potentially merge columns from multiple SSTable versions, the reader iterators are combined through a ReducingIterator, which takes an iterator of uncombined columns as input, and yields combined versions as output
  • If a quorum read was requested, StorageProxy waits for a majority of nodes to reply and makes sure the answers match before returning. Otherwise, it returns the data reply as soon as it gets it, and checks the other replies for discrepancies in the background in StorageService.doConsistencyCheck. This is called "read repair," and also helps achieve consistency sooner.
    • As an optimization, StorageProxy only asks the closest replica for the actual data; the other replicas are asked only to compute a hash of the data.

Deletes

Gossip

Failure detection

Further reading

https://c.statcounter.com/9397521/0/fe557aad/1/|stats

  • No labels