ID | IEP-54 |
Author | |
Sponsor | |
Created |
|
Status | DRAFT |
The way Ignite works with data schemas is inconsistent:
This creates multiple usability issues:
The general idea is to have a one-to-one mapping between data schemas and caches/tables. There is a single unified schema for every cache, it is applied to both data storage itself and to the SQL.
When a cache is created, it is configured with a corresponding data schema. There must be an API and a tool to see the current version of the schema for any cache, as well as make updates to it. Schema updates are applied dynamically without downtime.
DDL should work on top of this API providing similar functionality. E.g. CREATE TABLE invocation translates to a cache creation with the schema described in the statement.
Anything stored in a cache/table must be compliant with the current schema. An attempt to store incompatible data should fail.
The binary protocol should be used only as the data storage format. All serialization that happens for communication only should be performed by a different protocol. The data storage format will be coupled with the schemas, while the communication is independent of them. As a bonus, this will likely allow for multiple optimizations on both sides, as serialization protocols will become more narrow purposed.
BinaryObject
API should be reworked, as it will not represent actual serialized objects anymore. It should be replaced with something like BinaryRecord
or DataRecord
representing a record in a cache or table. Similar to the current binary objects, records will provide access to individual fields. A record can also be deserialized into a class with any subset of fields represented in the record.
There are several ways a schema can be defined. The initial entry point to the schema definition is SchemaBuilder
java API:
TBD (see SchemaBuilders class for details)
The schema builder calls are transparently mapped to DDL statements so that all operations possible via a builder are also possible via DDL and vice versa.
The Schema-first approach imposes certain natural requirements which are more strict than binary object serialization format:
The suggested list of supported built-in data types is listed in the table below:
Type | Size | Description |
---|---|---|
Bitmask(n) | ⌈n/8⌉ bytes | A fixed-length bitmask of n bits |
Int8 | 1 byte | 1-byte signed integer |
Uint8 | 1 byte | 1-byte unsigned integer |
Int16 | 2 bytes | 2-byte signed integer |
Uint16 | 2 bytes | 2-byte unsigned integer |
Int32 | 4 bytes | 4-byte signed integer |
Uint32 | 4 bytes | 4-byte unsigned integer |
Int64 | 8 bytes | 8-byte signed integer |
Uint64 | 8 bytes | 8-byte unsigned integer |
Float | 4 bytes | 4-byte floating-point number |
Double | 8 bytes | 8-byte floating-point number |
Number([n]) | Variable | Variable-length number (optionally bound by n bytes in size) |
Decimal | Variable | Variable-length floating-point number |
UUID | 16 bytes | UUID |
String | Variable | A string encoded with a given Charset |
Date | 3 bytes | A timezone-free date encoded as a year (1 sign bit + 14 bits), month (4 bits), day (5 bits) |
Time | 5 bytes | A timezone-free time encoded as padding (3 bits), hour (5 bits), minute (6 bits), second (6 bits), microseconds (20 bits) |
Datetime | 8 bytes | A timezone-free datetime encoded as (date, time) |
Timestamp | 10 bytes | Number of microseconds since Jan 1, 1970 00:00:00.000000 (with no timezone) |
Binary | Variable | Variable-size byte array |
Given a set of user-defined columns, this set is then rearranged so that fixed-sized columns go first. This sorted set of columns is used to form a row. Row layout is as follows:
Field | Size | Comments |
---|---|---|
Schema version | 2 bytes. | short number. The possible values are:
|
Key columns hash | 4 bytes | |
Key chunk: | ||
Key chunk size | 4 bytes | |
Flags | 1 byte | |
Variable-length columns offsets table size | 0-2 bytes |
|
Variable-length columns offsets table | Variable (number of non-null varlen columns * <format_size>) | <format_size> - depends on the Flags field. See the table below |
Fix-sized columns values | Variable | |
Variable-length columns values | Variable | |
Value chunk: | ||
Value chunk size | 4 bytes | |
Flags | 1 byte | |
Null-map | (number of columns / 8 ) or 0 bytes | Zero size if and only if schema has no nullable columns |
Variable-length columns offsets table size | 2 or 0 bytes |
|
Variable-length columns offsets table | Variable (number of non-null varlen columns * <format_size>) | <format_size> - depends on the Flags field. See the table below |
Fix-sized columns values | Variable | |
Variable-length columns values | Variable |
For the small rows, the metadata sizes may introduce a very noticeable overhead, so it looks reasonable to write them in a more compact way using different techniques.
The flags field is used to detect the format. We propose 3 formats for a vartable: tiny, medium, and large with offset fields sizes of byte, short, and int respectively.
Vartable length field is the size of byte for tiny format and the size of short for others.
Vertable length is calculated as: <count_of _not_null_varlen_fields> - 1. The offset for the first varlen field is not stored at the table. It is calculated as the begin of the varlen values block.
IMPORTANT: having multiple formats MUST guarantee the key (as well as value) chunk will be always written in a single possible way to allow comparing chunks of rows of the same version as just byte arrays.
The flags field is a bitmask with each bit treated as a flag, with the following flags available (from flag 0 being the LSB to flag 7 being MSB):
Flags Bits | Description |
---|---|
0, 1 | VarTable formats:
|
2-7 | Reserverd |
n/a
n/a
15 Comments
Andrey Mashenkov
Alexey Goncharuk , Valentin Kulichenko
Please note,
Is there any objections?
Alexey Goncharuk
Andrey Mashenkov good points, please go ahead with the corrections in the document!
Andrey Mashenkov
What if user manually drop a column, but outdated client sends object of old version?
Such object will not fit to current schema, but we definitely shouldn't up schema version automatically.
Seems, come important details is missed here. It is not clear how number of fields and schema version are connected and at what point we decide to up schema version.
Alexey Goncharuk
I do not think this issue is related to the schema management component.
Schema version is updated on every column set change and is in no way related to the number of columns. Once the schema is updated, objects with the old schema must be rejected. Now, what a client does to such a reject is defined by the liveness policy - we can be either in 'strict' mode and reject such an update altogether, or we can be in 'live' mode - then the schema will be automatically expanded again. In the case you described there is a race condition between the column drop and the old object write, so this should be perfectly fine to expand the schema again if it is allowed by the policy.
Andrey Mashenkov
Assume we have an object with some fields of non-null types as well as nullable types.
Raised flag in this map will mean a "default value" for column in terms of object schema, e.g. user-defined default value if is set via SQL DEFAULT keyword, "NULL" for nullable, and '0' (zero) for primitives.
Alexey Goncharuk
Currenly the null map does not skip non-null columns. I do not think this optimization makes sense because we only save 1 bit per column, but such gaps complicate offsets calculation even more, so I think it just not worth it. As for the defaults map - this is a good point. I already created a ticket for omitting nullmap when schema only has non-null columns, so I will update this ticket and the IEP as well.
Andrey Mashenkov
Seems, 'key' can't be changed at all. Regardless live-schema allows to convert key to proper version, any changes in 'key' structure can affect hashcode.
Alexey Goncharuk
Updated the doc to address these comments.
Aleksey Demakov
Looking at the implementation this seems to be not true:
As far as I see our Decimals have fixed scale and therefore should be called fixed-point numbers.
Andrey Mashenkov
Fixed scale doesn't mean we should store the decimal value in a fixed-length byte array.
We use type with fixed precision/scale for rounding numbers to limit the footprint of serialized value.
I think we can support 'unlimited' values as well, but I'm not sure values (2^831) of ~250 is ever useful.
E.g. in .NET, decimal maximum value is of ~29 digits.
Aleksey Demakov
Fixed-point and fixed -length are different terms. https://en.wikipedia.org/wiki/Fixed-point_arithmetic
Andrey Mashenkov
Got it.
Floating-point number may have more than one representation in different scales, therefore can't be used as keys or being indexed.
So we need a way to get a single representation.
SQL implies Decimals have high limit for precision/scale.
The simple way is forcibly convert decimals to these limits, as we do.
Aleksey Demakov
It seems that there is still some confusion.
A floating point number has mantissa and exponent that can be positive or negative.
A fixed point number has a fixed number of decimal places after the decimal point. Fixed scale never exceeds the total precision. Contrast this to floating point numbers where exponent may be much larger than the precision of mantissa.
Some languages and libraries also offer rational numbers that are a pair of integers -- numerator/denominator. So you can work precisely with numbers like 1/3, -5/6, etc. But this is relatively exotic.
Floating and fixed point numbers (and rational too) can also be single-precision, double-precision, multiple-precision, arbitrary-precision (aka bignum). This is an orthogonal thing.
For DECIMAL type the technically correct term is probably multiple-precision fixed point number. But in this document it is called floating-point number. This is simply technically incorrect.
Aleksey Demakov
Looking at the implementation this is not what we have now:
Now we have it either 4 or 6 bytes for milliseconds and nanosecond. The 5 byte variant is not actually supported.
And consequently Datetime takes not 8 bytes but either 7 or 9.
Andrey Mashenkov
Seems, we forget to change the IEP.