IDIEP-141
Author
Sponsor
Created
Status
DRAFT


Motivation

Modern applications increasingly rely on vector representations of data (embeddings) to enable semantic similarity search and other AI-driven workloads (semantic search, recommendation systems, anomaly detection, RAG). Apache Ignite lacks native support for vector indexing and ANN search. Apache Ignite can integrate vector capabilities and provide to applications scalable and transactionally consistent storage for both structured and vector data. It will help users:

  1. eliminate the need to synchronize multiple databases;
  2. optimize workloads by relying on Ignite's memory-first architecture;
  3. optimize workloads by using hybrid queries, where filtering by metadata (structured data) and ranking by vector happen in a single execution layer;
  4. combine AI workloads (semantic search, recommendations) with other applications workloads.

Description

Ignite usage pattern is, usually, read-heavy workloads. Implementation for vector index/search must follow this pattern. Requirements are:

  1. Low latency for read operations.
  2. Index is partitioned (as Ignite's data is partitioned).
  3. ACID guarantees for data and vectors.  

Data type

Vectors are array of floats. There are open questions how to handle:

  1. Different types (precisions) of floats - float32, float16, int8, bit. 
  2. Vectors dimension must be the same within single index.

Proposed path:

  1. User specifies float[]  column in a QueryEntity.
  2. User defines an VectorIndex that describes vector data type, dimension.
    1. Ignite provides transformers from float to required type. 
    2. Ignite validates dimension while inserting it to index (and forbid insertion if a dimension differ).


QueryIndex
public final VectorIndex extends QueryIndex {
	
	/** Dimension. */
	private final int dimension;

	/** Data type. */ 
	private final VectorDataType type; 

	public VectorIndex(String field) {
		super(field, QueryIndexType.VECTOR);
	}
}


public enum VectorDataType {
    FP32, FP16, INT8, BINARY...
}




Open questions

  1. Usually vector dimensions are between 512 and 4096 bytes (fixed for an index).
    1. How to store them efficiently in memory and on disk?
    2. What is an algorithm for evicting data to disk?
  2. How to link a vector to a corresponding cache key?
    1. How to implement pre- and post- filter for data?
  3. How to provide ACID guarantees?
  4. Which algorithm use for ANN - HNSW, IVF, PQ? All?
  5. Should we implement two-layer (fast + slow) index architecture?
  6. Should we implement ANN from scratch or integrate existing one (java lib or JNI?)
  7. API for creating and selecting index (SQL, CacheQuery).
  8. How to rebalance index efficiently? 
  9. Which metrics expose for users?
  10. Params of index configuration
  11. How to reduce results from multiple shards?

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

// Links to various reference documents, if applicable.

Tickets

// Links or report with relevant JIRA tickets.

  • No labels