ID	IEP-54
Author	Valentin Kulichenko Alexey Goncharuk
Sponsor	Valentin Kulichenko
Created	28 Aug 2020
Status	DRAFT

Motivation

The way Ignite works with data schemas is inconsistent:

The binary protocol creates schemas for anything that is serialized. These schemas are updated implicitly – the user doesn't have any control over them.
SQL engine has its own schema that is separate from the binary schema, although SQL runs on top of binary objects. SQL schema is created and updated explicitly by the user.
Caches themselves are basically schemaless – you're allowed to store multiple versions of multiple data types in a single cache.

This creates multiple usability issues:

SQL schema can be inconsistent or even incompatible with the binary schema. If either of them is updated, the second is not affected.
SQL can't be used by default. The user has to explicitly create the SQL schema, listing all fields and making sure the list is consistent with the content of binary objects.
Binary schemas are decoupled from caches. So, if a cache is destroyed, the binary schema is not removed.
Etc.

Description

The general idea is to have a one-to-one mapping between data schemas and caches/tables. There is a single unified schema for every cache, it is applied to both data storage itself and to the SQL.

When a cache is created, it is configured with a corresponding data schema. There must be an API and a tool to see the current version of the schema for any cache, as well as make updates to it. Schema updates are applied dynamically without downtime.

DDL should work on top of this API providing similar functionality. E.g. CREATE TABLE invocation translates to a cache creation with the schema described in the statement.

Anything stored in a cache/table must be compliant with the current schema. An attempt to store incompatible data should fail.

The binary protocol should be used only as the data storage format. All serialization that happens for communication only should be performed by a different protocol. The data storage format will be coupled with the schemas, while the communication is independent of them. As a bonus, this will likely allow for multiple optimizations on both sides, as serialization protocols will become more narrow purposed.

BinaryObject API should be reworked, as it will not represent actual serialized objects anymore. It should be replaced with something like BinaryRecord or DataRecord representing a record in a cache or table. Similar to the current binary objects, records will provide access to individual fields. A record can also be deserialized into a class with any subset of fields represented in the record.

Schema Definition API

There are several ways a schema can be defined. The initial entry point to the schema definition is SchemaBuilder java API:

TBD (see SchemaBuilders class for details)

The schema builder calls are transparently mapped to DDL statements so that all operations possible via a builder are also possible via DDL and vice versa.

Data restrictions

The Schema-first approach imposes certain natural requirements which are more strict than binary object serialization format:

The column type must be of one of a predefined set of available 'primitives' (including Strings, UUIDs, date & time values)
Arbitrary nested objects and collections are not allowed as column values. Nested POJOs should either be inlined into a schema or stored as BLOBs
Date & time values should be compressed with preserving natural order and decompression should be a trivial operation (like applying bitmask).

The suggested list of supported built-in data types is listed in the table below:

Type	Size	Description
Bitmask(n)	⌈n/8⌉ bytes	A fixed-length bitmask of n bits
Int8	1 byte	1-byte signed integer
Uint8	1 byte	1-byte unsigned integer
Int16	2 bytes	2-byte signed integer
Uint16	2 bytes	2-byte unsigned integer
Int32	4 bytes	4-byte signed integer
Uint32	4 bytes	4-byte unsigned integer
Int64	8 bytes	8-byte signed integer
Uint64	8 bytes	8-byte unsigned integer
Float	4 bytes	4-byte floating-point number
Double	8 bytes	8-byte floating-point number
Number([n])	Variable	Variable-length number (optionally bound by n bytes in size)
Decimal	Variable	Variable-length floating-point number
UUID	16 bytes	UUID
String	Variable	A string encoded with a given Charset
Date	3 bytes	A timezone-free date encoded as a year (1 sign bit + 14 bits), month (4 bits), day (5 bits)
Time	5 bytes	A timezone-free time encoded as padding (3 bits), hour (5 bits), minute (6 bits), second (6 bits), microseconds (20 bits)
Datetime	8 bytes	A timezone-free datetime encoded as (date, time)
Timestamp	10 bytes	Number of microseconds since Jan 1, 1970 00:00:00.000000 (with no timezone)
Binary	Variable	Variable-size byte array

Data Layout

Given a set of user-defined columns, this set is then rearranged so that fixed-sized columns go first. This sorted set of columns is used to form a row. Row layout is as follows:

Field	Size	Comments
Schema version	2 bytes.	short number. The possible values are: positive - regular row: key and value chunks are present; 0 - no value. If the flag is set, the value chunk is omitted, e.g. the row represents a tombstone or key-row to lookup by the key; negative - invalid schema version.
Key columns hash	4 bytes
Key chunk:
Key chunk size	4 bytes
Flags	1 byte
Variable-length columns offsets table size	0-2 bytes	Vartable is skipped (zero size) when the chunk contains one varlen column or doesn't contain varlen column. 1 byte size for table with TINY format (see table below) 2 bytes for table with MEDIUM and LARGE format (see table below)
Variable-length columns offsets table	Variable (number of non-null varlen columns * <format_size>)	<format_size> - depends on the Flags field. See the table below
Fix-sized columns values	Variable
Variable-length columns values	Variable
Value chunk:
Value chunk size	4 bytes
Flags	1 byte
Null-map	(number of columns / 8 ) or 0 bytes	Zero size if and only if schema has no nullable columns
Variable-length columns offsets table size	2 or 0 bytes	Vartable is skipped (zero size) when the chunk contains one varlen column or doesn't contain varlen column. 1 byte size for table with TINY format (see table below) 2 bytes for table with MEDIUM and LARGE format (see table below)
Variable-length columns offsets table	Variable (number of non-null varlen columns * <format_size>)	<format_size> - depends on the Flags field. See the table below
Fix-sized columns values	Variable
Variable-length columns values	Variable

For the small rows, the metadata sizes may introduce a very noticeable overhead, so it looks reasonable to write them in a more compact way using different techniques.

VarInt - variable size integer for sizes
different VarTable formats with byte/short/int offsets
skip writing VarTable and/or Null-map if possible.

The flags field is used to detect the format. We propose 3 formats for a vartable: tiny, medium, and large with offset fields sizes of byte, short, and int respectively.
Vartable length field is the size of byte for tiny format and the size of short for others.
Vertable length is calculated as: <count_of _not_null_varlen_fields> - 1. The offset for the first varlen field is not stored at the table. It is calculated as the begin of the varlen values block.

IMPORTANT: having multiple formats MUST guarantee the key (as well as value) chunk will be always written in a single possible way to allow comparing chunks of rows of the same version as just byte arrays.

The flags field is a bitmask with each bit treated as a flag, with the following flags available (from flag 0 being the LSB to flag 7 being MSB):

Flags Bits

Description

0, 1

VarTable formats:

(0, 0) - SKIPPED. VarTable for chunk is omitted (all column values in the chunk either of fix-sized type or null);
(0, 1) - TINY format (1 byte for offset), format_size = 1;
(1, 0) - MEDIUM format (2 bytes for offset), format_size = 2;
(1, 1) - LARGE format (4 bytes for offset), format_size = 4

2-7

Reserverd

Risks and Assumptions

n/a

Tickets

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution

Loading...

Refresh

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-54-Schema-first-approach-for-3-0-td49017.html

Reference Links

n/a

Page tree

Motivation

Description

Schema Definition API

Data restrictions

Data Layout

Risks and Assumptions

Tickets

Discussion Links

Reference Links

15 Comments

Andrey Mashenkov

Alexey Goncharuk

Andrey Mashenkov

Alexey Goncharuk

Andrey Mashenkov

Alexey Goncharuk

Andrey Mashenkov

Alexey Goncharuk

Aleksey Demakov

Andrey Mashenkov

Aleksey Demakov

Andrey Mashenkov

Aleksey Demakov

Aleksey Demakov

Andrey Mashenkov