Status

Current state: Adopted

Discussion thread: intro, cep proposal, demo

JIRACASSANDRA-18557

Released: 5.0.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).



Scope

The scope of this CEP is to add an extension to Storage Attached Index(CEP-7) to enable Vector Search natively in Cassandra.

Goals

  1. Implement approximate nearest neighbor (ANN) vector search capability in Apache Cassandra using storage-attached indexes (SAI).
  2. Support a vector of float32 embeddings as a new CQL type.
  3. Add ANN search to work with normal Cassandra data flow (insertion, updating, and deleting rows). The implementation should support adding a new vector in log(N) time, and ANN queries in M log(N) time where N is the number of vectors and M is the number of sstables.
  4. Compose with other SAI predicates.
  5. Scatter/gather across replicas, combining topK from each to get global topK.
  6. Enable Apache Cassandra to be the Vector Search component in ML platforms, and intuitive to use for Data Engineers new to Cassandra.

Non-Goals

  1. Supporting other data types besides float32.
  2. Search capabilities beyond ANN search; in particular, we are not attempting to build exact KNN search.
  3. Composability with old 2i or queries that involve ALLOW FILTERING.

Approach

The approach to implementing approximate nearest neighbor (ANN) vector search in Cassandra involves the following steps:

  1. Integrate Lucene's HNSW: The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library, which is the best ANN algorithm for Java and currently GA. HNSW provides a fast and efficient solution for finding approximate nearest neighbors in high-dimensional space, making it an ideal choice for this purpose.
  2. Create a new data type: To support the storage of high-dimensional vectors, a new data type will be introduced. This new data type will be represented internally as <TBD> and will enable the handling and storage of Float32 embeddings.
  3. Develop a new SAI index: A new storage-attached index (SAI) called VectorMemtableIndex will be developed to accommodate the ANN search functionality. This index will work in conjunction with the new data type and the HNSW library to enable efficient vector search capabilities within Apache Cassandra.
  4. Introduce a new CQL operator: A new CQL operator, ANN, will be introduced to perform ANN search queries using the newly implemented vector search functionality. This operator will allow users to easily perform ANN searches on their data with a simple and familiar query syntax.
  5. Distributed query:
    1. Coordinator queries all required replicas at once with the requested token range instead of vnode range to reduce round trips and num of local ANN searches;
      1. Required replicas are determined by requested token range and consistency level, just like normal range requests.
      2. Each replica will be queried with requested token range instead of vnode range to reduce number of round trips and num of local ANN searches.
      3. Coordinator will block for all contacted replicas to respond. This can cause memory issue in a large cluster.
    2. Replica collects rows with top-k scores and all tombstones based on local index results;
      1. Every sstable index produces top-k matches and SAI will union them to create result in primary key order.
      2. The processor uses PriorityQueue to store top-k rows and discard rows with lower scores. Note that it needs to collect all tombstones as well, it may hit TombstoneOverwhelmingException if there are too many tombstones.
      3. The processor will return top-k rows in primary order to coordinator
    3. Coordinator collects rows with top-k scores and discards rows with lower scores based on reconciled rows from replica responses.
      1. Data resolver reconciles responses from different replicas in primary key order. Short read protection, replica filtering protection and read repair are all disabled because mismatches may be caused by top-k rather than actual data inconsistency.
      2. The processor uses PriorityQueue to store top-k rows and discard rows with lower scores
      3. The processor will return top-k rows in primary order to client

About HNSW + Lucene

The Lucene implementation of HNSW (Hierarchical Navigable Small World) provides an efficient approximate nearest neighbor search for high-dimensional vectors in the context of the Lucene search library. At a high level, the Lucene implementation uses a Navigable Small-World graph, which is usually hierarchical but the current implementation has only a single layer. The decision to exclude the hierarchical structure was partly influenced by a paper claiming that there was little benefit from the hierarchy for high-dimensional vectors. However, later discussions noted that the implementation might benefit from the hierarchy, and it is worth considering adding the hierarchical layers to improve search performance and determinism later.

During the indexing process, each vector is added to the graph, connecting it to its nearest neighbors within the graph. The connections are determined using a distance metric (cosine similarity) that measures the similarity between vectors. The graph allows for efficient searches by navigating through the connections between nodes (vectors), moving closer to the query vector at each step. 

To retrieve data, the HNSW algorithm navigates through the connections between nodes (vectors) in the graph, moving closer to the query vector at each step. It exploits the graph structure to efficiently locate the most similar vectors to a given query vector.

Lucene generally works by creating an index that maps each word to its frequency count in documents, using term frequency and inverse document frequency. However, regarding HNSW implementation for high-dimensional vectors, Lucene focuses on constructing a navigable small-world graph to provide an efficient approximate nearest neighbor search rather than relying on the traditional term frequency-inverse document frequency approach.

Timeline

Vector Search will target the 5.0 release in conjunction with the release of SAI. 

  1. Add the new vector data type to trunk (already in progress at CASSANDRA-18504)
  2. Add vector search based on Lucene HNSW to a fork of the SAI feature branch
  3. Merge Vector Search feature branch into the main SAI feature branch
  4. Merge combined feature branches into 5.0 trunk and begin integration testing

Mailing list / Slack channels

ML Discussion threads: 

Related JIRA tickets

JIRA(s): 




Motivation

Vector search is a powerful technique for finding relevant content within large document collections and is particularly useful for AI applications. Adding vector search to Apache Cassandra will not only enhance its functionality but also attract new users from the AI space, expanding its reach and relevance.

A secondary motivation is to leverage the modularity of Storage-Attached Indexes(SAI) with new functionality. A large body of work has refactored how indexing works within Cassandra, as documented in CEP-7. The ability to quickly adopt new indexing techniques without significant code changes will help Cassandra adapt to the needs of developers as they evolve. This CEP will be the first extension to validate the extensibility of SAI.

Audience

This enhancement proposal targets existing Apache Cassandra users, AI developers, and organizations managing large datasets that can benefit from fast similarity search.

Proposed Changes

  1. Integrate Lucene's HNSW with the SAI framework.
  2. Create a new data type, VECTOR<type, dimension>.
  3. Add new functions cosine_similarity, dot_product_similarity, and euclidean_similarity.
  4. Develop a new SAI index, VectorMemtableIndex.
  5. Introduce a new CQL operator: ANN OF.

New or Changed Public Interfaces

The primary change is in CQL with a new VECTOR type and a new CQL operator, ANN. Storage Attached Indexing(SAI) will be extended to include a new indexing scheme, ‘ann_index,’ which will only index the VECTOR type. 

The following is an example of real-world usage with the new syntax in bold and italics. 


CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

CREATE TABLE test.foo(
i INT PRIMARY KEY,
j VECTOR<float, 3>
); 

CREATE CUSTOM INDEX ann_index ON foo(j) USING 'VectorMemtableIndex';

INSERT INTO test.foo (i, j) VALUES (1, [8, 2.3, 58]);
INSERT INTO test.foo (i, j) VALUES (2, [1.2, 3.4, 5.6]);
INSERT INTO test.foo (i, j) VALUES (5, [23, 18, 3.9]); 

SELECT * FROM test.foo WHERE j ANN OF [3.4, 7.8, 9.1] limit 1;

i  |j
---+---------------------------------------------------------
5  |[23, 18, 3.9] 

Compatibility, Deprecation, and Migration Plan

The proposed changes should maintain compatibility with existing Cassandra functionality. Further testing and evaluation will be needed to ensure the new features work smoothly with current setups.

Test Plan

  1. Verify that ANN search works with normal Cassandra data flow (insertion, updating, and deleting rows).
  2. Test the integration of Lucene's HNSW with the SAI framework.
    1. Verify correctness of ANN results across a wide range of dimensionality
    2. Verify single-partition search and validate ANN results
    3. Verify cross-partition search and validate ANN results
    4. Simulate corrupted stored data vs index data on disk
    5. Pass all existing SAI testing plans
  3. Test the new data type (VECTOR<type, dimension>) and CQL operator (ANN) with various use cases.
  4. Evaluate the performance of the new features and their impact on existing Cassandra setups.


It might also be beneficial to test against known ANN implementations using the ANN testing framework: https://github.com/erikbern/ann-benchmarks

Rejected Alternatives

  • Creating a different search engine outside of SAI specifically for ANN search. This would be prohibitively expensive and would compose less well with queries containing non-vector predicates.
  • Using a different tech instead of Lucene’s HNSW (faiss is a frequent suggestion).  Faiss is great technology, but there is no first-class support for Java bindings, only 3rd party over JNI, and most of its algorithms are optimized for batch processing.  Lucene’s HNSW allows incremental index updates, is already used in production by Elastic and Solr, and since we already plan to use Lucene for numeric SAI indexes it is a low-impact dependency to add.