IDIEP-54
Author
Sponsor
Created

  

Status
DRAFT


Motivation

The way Ignite works with data schemas is inconsistent:

  • The binary protocol creates schemas for anything that is serialized. These schemas are updated implicitly – the user doesn't have any control over them.
  • SQL engine has its own schema that is separate from the binary schema, although SQL runs on top of binary objects. SQL schema is created and updated explicitly by the user.
  • Caches themselves are basically schemaless – you're allowed to store multiple versions of multiple data types in a single cache.

This creates multiple usability issues:

  • SQL schema can be inconsistent or even incompatible with the binary schema. If either of them is updated, the second is not affected.
  • SQL can't be used by default. The user has to explicitly create the SQL schema, listing all fields and making sure the list is consistent with the content of binary objects.
  • Binary schemas are decoupled from caches. So, if a cache is destroyed, the binary schema is not removed.
  • Etc.

Description

The general idea is to have a one-to-one mapping between data schemas and caches/tables. There is a single unified schema for every cache, it is applied to both data storage itself and to the SQL.

When a cache is created, it is configured with a corresponding data schema. There must be an API and a tool to see the current version of the schema for any cache, as well as make updates to it. Schema updates are applied dynamically without downtime.

DDL should work on top of this API providing similar functionality. E.g. CREATE TABLE invocation translates to a cache creation with the schema described in the statement.

Anything stored in a cache/table must be compliant with the current schema. An attempt to store incompatible data should fail.

The binary protocol should be used only as the data storage format. All serialization that happens for communication only should be performed by a different protocol. The data storage format will be coupled with the schemas, while the communication is independent of them. As a bonus, this will likely allow for multiple optimizations on both sides, as serialization protocols will become more narrow purposed.

BinaryObject API should be reworked, as it will not represent actual serialized objects anymore. It should be replaced with something like BinaryRecord or DataRecord representing a record in a cache or table. Similar to the current binary objects, records will provide access to individual fields. A record can also be deserialized into a class with any subset of fields represented in the record.

Schema Definition API

There are several ways a schema can be defined. The initial entry point to the schema definition is SchemaBuilder java API:

TBD (see SchemaBuilders class for details)

The schema builder calls are transparently mapped to DDL statements so that all operations possible via a builder are also possible via DDL and vice versa.

Data restrictions

The Schema-first approach imposes certain natural requirements which are more strict than binary object serialization format:

  • The column type must be of one of a predefined set of available 'primitives' (including Strings, UUIDs, date & time values)
  • Arbitrary nested objects and collections are not allowed as column values. Nested POJOs should either be inlined into a schema or stored as BLOBs
  • Date & time values should be compressed with preserving natural order and decompression should be a trivial operation (like applying bitmask).

The suggested list of supported built-in data types is listed in the table below:

TypeSizeDescription
Bitmask(n)n/8 bytesA fixed-length bitmask of n bits
Int81 byte1-byte signed integer
Uint81 byte1-byte unsigned integer
Int162 bytes2-byte signed integer
Uint162 bytes2-byte unsigned integer
Int324 bytes4-byte signed integer
Uint324 bytes4-byte unsigned integer
Int648 bytes8-byte signed integer
Uint648 bytes8-byte unsigned integer
Float4 bytes4-byte floating-point number
Double8 bytes8-byte floating-point number
Number([n])VariableVariable-length number (optionally bound by n bytes in size)
DecimalVariableVariable-length floating-point number
UUID16 bytesUUID
StringVariableA string encoded with a given Charset
Date3 bytesA timezone-free date encoded as a year (1 sign bit + 14 bits), month (4 bits), day (5 bits)
Time5 bytesA timezone-free time encoded as padding (3 bits), hour (5 bits), minute (6 bits), second (6 bits), microseconds (20 bits)
Datetime8 bytesA timezone-free datetime encoded as (date, time)
Timestamp10 bytesNumber of microseconds since Jan 1, 1970 00:00:00.000000 (with no timezone)
BinaryVariableVariable-size byte array
Data Layout

Given a set of user-defined columns, this set is then rearranged so that fixed-sized columns go first. This sorted set of columns is used to form a row. Row layout is as follows:

Field

Size

Comments
Schema version2 bytes.

short number. The possible values are:

  • positive - regular row: key  and value chunks are present;
  • 0 - no value. If the flag is set, the value chunk is omitted, e.g. the row represents a tombstone or key-row to lookup by the key;
  • negative - invalid schema version.
Key columns hash4 bytes
Key chunk:

Key chunk size
4 bytes
Flags1 byte
Variable-length columns offsets table size0-2 bytes
  • Vartable is skipped (zero size) when the chunk contains one varlen column or doesn't contain varlen column.
  • 1 byte size for table with TINY format (see table below)
  • 2 bytes for table with MEDIUM and LARGE format (see table below)
Variable-length columns offsets tableVariable (number of non-null varlen columns * <format_size>)<format_size> - depends on the Flags field. See the table below
Fix-sized columns valuesVariable
Variable-length columns valuesVariable
Value chunk:

Value chunk size4 bytes
Flags1 byte
Null-map

(number of columns / 8 ) or 0 bytes

Zero size if and only if schema has no nullable columns
Variable-length  columns offsets table size2 or 0 bytes
  • Vartable is skipped (zero size) when the chunk contains one varlen column or doesn't contain varlen column.
  • 1 byte size for table with TINY format (see table below)
  • 2 bytes for table with MEDIUM and LARGE format (see table below)
Variable-length  columns offsets tableVariable (number of non-null varlen columns * <format_size>)<format_size> - depends on the Flags field. See the table below
Fix-sized columns valuesVariable
Variable-length columns valuesVariable

For the small rows, the metadata sizes may introduce a very noticeable overhead, so it looks reasonable to write them in a more compact way using different techniques.

  • VarInt - variable size integer for sizes
  • different VarTable formats with byte/short/int offsets
  • skip writing VarTable and/or Null-map if possible.

The flags field is used to detect the format. We propose 3 formats for a vartable: tiny, medium, and large with offset fields sizes of byte, short, and int respectively.
Vartable length field is the size of byte for tiny format and the size of short for others.
Vertable length is calculated as: <count_of _not_null_varlen_fields> - 1. The offset for the first varlen field is not stored at the table. It is calculated as the begin of the varlen values block.

IMPORTANT: having multiple formats MUST guarantee the key (as well as value) chunk will be always written in a single possible way to allow comparing chunks of rows of the same version as just byte arrays.

The flags field is a bitmask with each bit treated as a flag, with the following flags available (from flag 0 being the LSB to flag 7 being MSB):

Flags BitsDescription
0, 1

VarTable formats:

  • (0, 0) - SKIPPED.  VarTable for chunk is omitted  (all column values in the chunk either of fix-sized type or null);
  • (0, 1) - TINY format (1 byte for offset), format_size = 1;
  • (1, 0) - MEDIUM format (2 bytes for offset), format_size = 2;
  • (1, 1) - LARGE format (4 bytes for offset), format_size = 4
2-7Reserverd


Risks and Assumptions

n/a

Tickets

Key Summary T Created Updated Due Assignee Reporter P Status Resolution
Loading...
Refresh

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-54-Schema-first-approach-for-3-0-td49017.html

Reference Links

n/a


  • No labels

15 Comments

  1. Alexey Goncharuk , Valentin Kulichenko

    Please note, 

    1. Date type 'fields' can be written in order "year, month, day" to keep natural order. Time type - hours, minutes, seconds, millis. Preserving natural order will make comparison trivial.
    2. Time can be compressed to 4 bytes. 5 bits for hours (0-23) + 6 bits for minutes (0-59) + 6 bits for seconds (0-59) + 10 bits for millis (0-999) = 27 bits which perfectly fit into 4 bytes Java 'int' type and any 'field' can be get with applying bitmask + shift operation.
    3. 366 days don't fit into single byte. Do you mean you need 4 bytes for Date type?
    4. Date can be compressed to 3 bytes as well. We need 5 bits for days (1-31) + 4 bits for month (1-12) +  rest 15 bytes for year (0 - 32k). So, we still can address Hebrew calendar =).

    Is there any objections? 

  2. Andrey Mashenkov  good points, please go ahead with the corrections in the document!

  3. When an object is inserted into a table, we attempt to 'fit' object fields to the schema columns. If a Java object has some extra fields which are not present in the current schema, the schema is automatically updated to store additional extra fields that are present in the object.

    On the other hand, if an object has fewer fields than the current schema, the schema is not updated auto(such scenario usually means that an update is executed from an outdated client which did not yet receive a proper object class version). In other words, columns are never dropped during automatic schema evolution; a column can only be dropped by an explicit user command.

    What if user manually drop a column, but outdated client sends object of old version? 
    Such object will not fit to current schema, but we definitely shouldn't up schema version automatically.

    Seems, come important details is missed here. It is not clear how number of fields and schema version are connected and at what point we decide to up schema version.

    1. I do not think this issue is related to the schema management component.

      Schema version is updated on every column set change and is in no way related to the number of columns. Once the schema is updated, objects with the old schema must be rejected. Now, what a client does to such a reject is defined by the liveness policy - we can be either in 'strict' mode and reject such an update altogether, or we can be in 'live' mode - then the schema will be automatically expanded again. In the case you described there is a race condition between the column drop and the old object write, so this should be perfectly fine to expand the schema again if it is allowed by the policy.

  4. Assume we have an object with some fields of non-null types as well as nullable types.

    • What will be in "null-map"? Will we have unused bits for non-null types?
    • As I understand "null-map" here is used as a kind of compression mechanics for sparse data. Maybe, we could use "null-map" as kind of "default-value-map"?
      Raised flag in this map will mean a "default value" for column in terms of object schema, e.g. user-defined default value if is set via SQL DEFAULT keyword,  "NULL" for nullable, and '0' (zero) for primitives.
    1. Currenly the null map does not skip non-null columns. I do not think this optimization makes sense because we only save 1 bit per column, but such gaps complicate offsets calculation even more, so I think it just not worth it. As for the defaults map - this is a good point. I already created a ticket for omitting nullmap when schema only has non-null columns, so I will update this ticket and the IEP as well.

    1. *Would it worth to reserve 2 bytes for SchemaVersion?
    2. What about 'null-value' or tombstones? Do we need a flag for an empty value/tombstone?
    3. I found no information about schema evolution limitations.
      Seems, 'key' can't be changed at all. Regardless live-schema allows to convert key to proper version, any changes in 'key' structure can affect hashcode.

    1. Updated the doc to address these comments.

  5. Looking at the implementation this seems to be not true:

    DecimalVariableVariable-length floating-point number

    As far as I see our Decimals have fixed scale and therefore should be called fixed-point numbers.

    1. Fixed scale doesn't mean we should store the decimal value in a fixed-length byte  array.
      We use type with fixed precision/scale for rounding numbers to limit the footprint of serialized value.

      I think we can support 'unlimited' values as well, but I'm not sure values (2^831) of ~250 is ever useful.
      E.g. in .NET, decimal maximum value is of ~29 digits.

        1. Got it.
          Floating-point number may have more than one representation in different scales, therefore can't be used as keys or being indexed.
          So we need a way to get a single representation.

          SQL implies Decimals have high limit for precision/scale.
          The simple way is forcibly convert decimals to these limits, as we do.

          1. It seems that there is still some confusion.

            A floating point number has mantissa and exponent that can be positive or negative.

            A fixed point number has a fixed number of decimal places after the decimal point. Fixed scale never exceeds the total precision. Contrast this to floating point numbers where exponent may be much larger than the precision of mantissa.

            Some languages and libraries also offer rational numbers that are a pair of integers -- numerator/denominator. So you can work precisely with numbers like 1/3,  -5/6, etc. But this is relatively exotic.

            Floating and fixed point numbers (and rational too) can also be single-precision, double-precision, multiple-precision, arbitrary-precision (aka bignum). This is an orthogonal thing.

            For DECIMAL type the technically correct term is probably multiple-precision fixed point number. But in this document it is called floating-point  number. This is simply technically incorrect.

  6. Looking at the implementation this is not what we have now:

    Time5 bytesA timezone-free time encoded as padding (3 bits), hour (5 bits), minute (6 bits), second (6 bits), microseconds (20 bits)

    Now we have it either 4 or 6 bytes for milliseconds and nanosecond. The 5 byte variant is not actually supported.

    And consequently Datetime takes not 8 bytes but either 7 or 9.

    1. Seems, we forget to change the IEP.