Top Level Packages
Current as of: v2.1 beta 2
Cassandra is written in Java, which uses packages to organize related functionality at the source code level. Understanding scope of individual packages can give insights into overall architecture.
Interpackage structure
Packages can be arranged in a sort of a stack representing their relationship to each other and how they fit into a whole.
(An editable .pptx source of this diagram is packageDiagram.pptx.)
The primary stack group includes packages that can be easily arranged in layers. The lower layers are closer to the operating system and hardware (more specifically, to their Java abstractions), the middle layers form the storage and distribution engine. The top layers are various interfaces exposing the service to external clients. A couple of omnipresent classes span all layers, and are deeply involved in all of them.
In terms of colors, red packages form the database substack. They are mainly responsible for handling data stored on the local node. Blue packages are the networking substack. These form what is commonly called the “Dynamo layer”. Their main responsibility is to distribute work and data among nodes forming the cluster. Even though these Dynamo packages are shown on the same level as the DB stack, most of them are aware of and depend on database classes. Green packages contain external-facing interfaces and tools through which a C* node communicates with the outside world. service
and config
packages get their own colors, because they are omnipackages that don’t fit into any single category.
Besides the primary stack, there are some support packages that don’t quite fit into the layer structure. They are best thought of as helper libraries that the primary stack calls into for certain tasks. Core support packages form a set of helpers that are essential to C* operation. Optional packages are responsible for nonessential features that are not always leveraged or enabled depending on cluster configuration. Clientside packages never run in the C* cluster at all; they are specialized client libraries or drivers that ship with Cassandra for convenience.
Finally, subordinate packages contain specialized helpers, that are really only used by a single top-level package. So they are best thought of as belonging to some other package, even though they appear on the top level for historical reasons.
Following is the list of all packages with along with a short description.
io
This large package talks to the file system on behalf of C*. The bulk of this work consists of creating and using SSTables
, which is the format that C* uses to store data. Other responsibilities include on-disk compression as well as some general-purpose I/O functionality for other code to use, including facilities for custom memory management. The main class here is SSTable
, which represents an abstract persistent container of sorted data. SSTableWriter
and SSTableReader
are derived from SSTable
and expose additional read and write functionality, and are used quite extensively by other packages (primarily db
).
db
This huge package takes up almost a quarter of the entire codebase and implements the database engine. It operates in familiar database terms including Cells
, Rows
, ColumnFamilies
(tables) and Keyspaces
(databases). db
heavily relies on io.SSTables
for data persistence, but also reaches into many other packages for various tasks. Internally, db
can be broken down into its own sublayers and also contains multiple subpackages for things such as data marshalling, commit log management, storage compaction and others. Overall, db
is large and complicated enough to deserve an architectural study of its own.
serializers
This small package is subordinate to db
, and contains utility methods for converting primitive types to byte buffers.
notifications
This tiny package contains interfaces and classes that allow other code to hook certain internal db
events with custom code. It can be used for things such as unit tests, but may also be hooked into with other external functionality.
cache
cache
is a smaller package containing primitives used to implement key and row caching. The class that actually orchestrates all that caching activity (CacheService
) lives in service, not cache
. cache
is primarily used for the benefit of db
, but db
almost never uses cache
directly, instead proxying through service.CacheService
. Without this extra indirection, cache
could easily be structured as a subpackage under db
.
cql3
Read and write APIs provided by db
are difficult to use directly, so C* provides a query language for easier access to the underlying data–Cassandra Query Language or CQL. The language is implemented in its own package cql3
(the third major release of the language). cql3
defines the language grammar and implements the QueryProcessor
as well as all the Statements
and related functionality available in CQL. It is interesting that cql3
is not a purely externally facing API; some internal code actually leverages it to store and retrieve system state information. In that respect, CQL is becoming a core component.
net
This package implements MessagingService
, which abstracts away most networking machinery from the rest of the codebase. Other packages can then set up communication protocols represented by different Verbs
and send custom Messages
carrying those verbs to remote nodes. Verbs
are handled on the receiving side with VerbHandlers
. net
provides only base types and functionality common to all messages; specialized implementation live in various other packages.
sink
Used by net
and service
, this tiny package is used to hook into messaging events. This is primarily useful for unit tests.
security
This tiny package, currently containing only one class SSLFactory
, is used to encrypt communication over the network.
gms
gms
(possibly standing for Gossip Message Service) implements the Gossiper
. Gossiper
is a peer-to-peer service that deals with disseminating cluster state information among member nodes. Gossiping consists of detecting unresponsive nodes using heart beat messages, and sharing liveness data among peers.
locator
locator
is responsible for two separate tasks. One is discovering cluster topology through a pluggable component called Snitch
. A few snitches are available out of the box, and some dynamic implementations heavily rely on Gossiper
to detect up-to-date cluster topology. The second responsibility is deciding how to optimally distribute replicas based on discovered topology (handled by a class called ReplicationStrategy
and its subclasses).
streaming
Another core networking package, streaming
is responsible for moving bulk data between the nodes in the cluster.
repair
This smaller package deals with running RepairSessions
, which redistribute data after a change in the cluster, or when corruption is detected in one of the existing nodes. Repair events are just one example where streaming
is used.
service
service
, although not the largest package, can be thought of as the skeleton upon which all other functionality builds. service
consists of an executable class CassandraDaemon
(this class contains the main()
function of the C* daemon), along with a set of core services, including StorageProxy
and StorageService
. A lot of, if not most, of inter-package communication within the codebase is brokered through one of those two. StorageService
is more involved in orchestrating Dynamo-level activities in the cluster, whereas StorageProxy
is more focused on handling data transfer.
service
is critical to most other modules, many of which are free to call into it from arbitrary places. A traditional weakness of such omnipresent uberpackages is that they attract all sort of miscellaneous functionality that doesn’t seem to belong anywhere else, and service
is no exception: expect to see a lot of random bits and pieces residing here.
config
Another omnipackage seemingly accessible from anywhere, config
is a repository for configurable settings as well as a static entry point into the data store (through a class Schema
which contains a reference to all keyspaces residing in the local cluster).
transport
This is one of external API providers. transport
implements a Server that listens for connecting clients that want to use C* Native protocol, which as of Cassandra 2.0 is the primary communication protocol both for external clients and within the cluster.
thrift
A logical peer to transport, this package contains a server implementation for Thrift-based communication. This functionality may be deprecated in future versions, but for now it remains fairly popular in legacy deployments.
scheduler
A small package that is used by the Thrift server to schedule incoming requests to attempt a level of QoS.
tools
This package contains the implementation of several administrative utilities shipped with C*, including the Node
tool, tools for import and export, as well as several utilities for SSTable
maintenance. All these tools are available under /bin
in a typical distribution.
dht
dht
(Distributed Hash Table) is a core support class that is responsible for partitioning data among the nodes in the cluster. It contains several pluggable implementations of AbstractPartitioner
class that handles the mechanics of data partitioning. In addition it defines Range
and Token
, which are primitives used by other packages to work with partition key ranges.
utils
This hefty package is a grab bag of miscellaneous classes typical to any software project, the proverbial "other" section. It is not the best place to look for architectural pillars, but it contains some clever code that is partially responsible for Cassandra’s impressive perf and reliability, including implementations of BloomFilter
and MerkleTree
among others.
concurrent
concurrent
deals with threading and thread pools. Interestingly, the few custom concurrency primitives that C* uses belong to a level 2 package under utils
, and not to this package.
exceptions
This tiny package consists of a set of custom exception classes used in C*. It is not comprehensive: many custom exceptions belong to other packages whence they are thrown.
triggers
This package implements support for triggers, which are optional user-defined actions invoked during writes. Trigger support is not as comprehensive as some of the relational database engines, so the package is quite small.
tracing
tracing
implements support for request tracing, whereupon some or all requests to Cassandra will cause verbose logging to be output for the purposes of debugging or performance tuning.
metrics
This package allows collecting quantitative data about various aspects of C* operation. Metric data can be accessed through Node
tool that ships with C*. Like tracing, this capability can be important for operational maintenance and troubleshooting.
auth
auth
enables support for authentication and authorization, providing a measure of access control for C* service.
cli
cli
implements a Command Line Interface client for interacting with a C* cluster from a remote node.
hadoop
hadoop
exposes C* in terms of MapReduce/Pig primitives, allowing integration with Hadoop clients. This package is expected to be used as an adapter in external Hadoop applications; there is no code here that actually runs server side.
client
A tiny package that provides helper functionality to code that runs against C* on the client side. Only hadoop
uses client
out of the box.
Packages listed by size
(Mar 2014)
db |
1.5M |
cql3 |
650K |
service |
570K |
io |
530K |
utils |
500K |
hadoop |
300K |
tools |
280K |
config |
230K |
streaming |
190K |
cli |
190K |
thrift |
180K |
transport |
180K |
locator |
140K |
gms |
130K |
dht |
100K |
net |
100K |
repair |
85K |
auth |
82K |
metrics |
74K |
cache |
55K |
serializers |
51K |
concurrent |
37K |
exceptions |
25K |
tracing |
15K |
scheduler |
13K |
triggers |
12K |
notifications |
8.0K |
security |
5.8K |
sink |
5.6K |
client |
4.4K |
This page was originally adapted from here.