Document the state by adding a label to the FIP page with one of "discussion", "accepted", "released", "rejected".

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Lance is a powerful table format designed for performant AI workloads. This proposal is aimed to support tiering Fluss data to Lance, enabling the integration between Fluss and the multimodal AI data lake ecosystem.

Public Interfaces

  • Add lance  to DataLakeFormat  
  • Introduce a fluss-lake-lance  module
  • Introduce LanceLakeStorage  into  fluss-lake-lance module

Proposed Changes

Bucketing Function

Since Lance dataset does not have the bucketing and partitioning concept, we just use the Fluss default bucketing implementation during data tiering.

Please refer to this thread for further more information about the partitioning and bucketing support in Lance: https://github.com/lancedb/lance/discussions/4125

LanceLakeCatalog

Since Lance dataset does not have the bucketing and partitioning concept, we just ignore the bucketing and partitioning properties in Fluss table when creating the corresponding Lance dataset.


As of Lance v0.30.0, primary key functionality remains under active development. Additionally, the Java API currently only supports append operations; UPSERT and DELETE APIs are still in development. Consequently, Fluss v0.8 exclusively supports tiering log tables. During primary key table creation, if table.datalake.enabled is true, the system throws an unsupported exception.

Refer to the github issues listed below for further details:

The dataset created in Lance requires adding three system columns at the end of the table: __bucket(int datatype) , __offset(bigint datatype) , and __timestamp(timestamp_ltz datatype).

Special Data Types

Vector embeddings and large binary data are two core data types in AI. Vector embeddings convert words and sentences and other data into numbers that capture their meaning and relationships. Unstructured large multimodal data, like video, is typically stored as large binary data. Currently, Fluss does not support these two data types.​​


For large binary data, we should specify the corresponding column encoding strategy to blob. This allows Lance users to directly utilize the blob API: https://lancedb.github.io/lance/blob.html.


Note: The Arrow large_binary data type represents data larger than 2GB.

LanceLakeTieringFactory

LanceLakeWriter

LanceLakeWriter is used to write to Lance which is straight forward. We can refer com.lancedb.lance.spark.write.LanceArrowWriter in Lance-Spark repo.

For log table, LanceLakeWriter leverages Lance Fragment API to append data. The size of a fragment is controlled by the configuration option lance.batch_size.

LanceLakeCommitter

In Lance lake committer, it should commit the fragments written to the Lance dataset.

Note that Lance dataset initializes at version 1 upon creation. Whenever the lake committer requires the lake snapshot version, we return (lance_dataset_version - 1).

When lake committer needs to find out the bucket end offset of committed lake snapshot, it has to reconstruct this information by reading the entire artificial bucket and offset columns from lake as Lance format lacks bucketing and partitioning support. Lance will support store custom properties in snapshot: https://github.com/lancedb/lance/pull/4078#discussion_r2165688242. With this feature, we can record the bucket end offset in snapshot information and save the read cost. 

Compatibility, Deprecation, and Migration Plan

n/a

Test Plan

IT

Rejected Alternatives