ID | IEP-85 |
Author | |
Sponsor | |
Created |
|
Status | DRAFT |
// Define the problem to be solved.
Introducing versioned schema (IEP-54) allows to upgrade rows to the latest version on-fly and even to update a schema automatically in some simple cases, e.g. adding a new column.
So, a user may choose between two modes: Strict and Live for manual schema management and dynamic schema expansion correspondingly.
We may introduce an API that will infer the schema from a key-value pair using class fields and annotations. The inference happens on the calling site of the node invoking the table modification operation.
Table schema should be automatically exposed to the table configuration subtree so that simple schema changes are available via ignite
CLI and the schema can be defined during the table creation via ignite
CLI.
Unlike Ignite 2.x approach, where binary object schema ID is defined by a set of fields that are present in a binary object, for the schema-first approach we assign a monotonically growing identifier to each version of the cache schema. The ordering guarantees should be provided by the underlying metadata storage layer (for example, the current distributed metastorage implementation or consensus-based metadata storage). The schema identifier should be stored together with the data rows (but not necessarily with each row individually: we can store schema ID along with a page or larger chunks of data). The history of schema versions must be stored for a long enough period of time to allow upgrade all existing data stored in a given cache.
Given schema evolution history, a row migration from version N-k to version N is a straightforward operation. We identify fields that were dropped during the last k schema operations and fields that were added (taking into account default field values) and update the row based on the field modifications. Afterward, the updated row is written in the schema version N layout format. The row upgrade may happen on reading with an optional writeback or on the next update. Additionally, a row upgrade in the background is possible.
Since the row key hashcode is inlined to the row data for quick key lookups, we require that the set of key columns do not change during the schema evolution. In the future, we may remove this restriction, but this will require careful hashcode calculation adjustments since the hash code value should not change after adding a new column with default value. Removing a column from the key columns does not seem possible since it may produce duplicates, and checking for duplicates may require a full scan.
Additionally to adding and removing columns, it will be possible to allow column type migrations when type change is non-ambiguous (a type upcast, e.g. Int8 → Int16, or by means of a certain expression, e,g, Int8 → String using CAST expression). Type conversions that narrow the column range (e.g. Int16 → Int8) must only be allowed using explicit expressions that will allow Ignite to validate that no RangeOutOfBoundsException
is possible during the conversion.
For example, consider the following sequence of schema modifications expressed in SQL-like terms:
CREATE TABLE Person (id INT, name VARCHAR(32), lastname VARCHAR(32), taxid int); ALTER TABLE Person ADD COLUMN residence VARCHAR(2) DEFAULT "GB"; ALTER TABLE Person DROP COLUMN lastname, taxid; ALTER TABLE Person ADD COLUMN lastname DEFAULT "N/A";
This sequence of modifications will result in the following schema history
ID | Columns | Delta |
---|---|---|
1 | id, name, lastname, taxid | N/A |
2 | id, name, lastname, taxid, residence | + residence ("GB") |
3 | id, name, residence | -lastname, -taxid |
4 | id, name, residence, lastname | +lastname ("N/A") |
With this history, upgrading a row (1, "John", "Doe")
of version 1 to version 4 means erasing columns lastname
and taxid
and adding columns residence
with default "GB"
and lastname
(the column is returned back) with default "N/A"
resulting in row (1, "John", "GB", "N/A")
.
It's clear that given a fixed schema, we can generate an infinite number of classes that match the column of this schema. This observation can be used to simplify ORM for the end-users. For the APIs which return Java objects, the mapping from schema columns to the object fields can be constructed dynamically, allowing to deserialize a single row into instances of different classes.
For example, let's say we have a schema PERSON (id INT, name VARCHAR (32), lastname VARCHAR (32), residence VARCHAR (2), taxid INT)
. Each row of this schema can be deserialized into the following classes:
class Person { int id; String name; String lastName; }
class RichPerson { int id; String name; String lastName; String residence; int taxId; }
For each table, a user may specify a default Java class binding, and for each individual operation a user may provide a target class for deserialization:
Person p = table.get(key, Person.class);
Given the set of fields in the target class, Ignite may optimize the amount of data sent over the network by skipping fields that would be ignored during deserialization.
Update operation with object of truncated class is also possible, but missed fields will be treated as "not-set" as if it is done via SQL INSERT statement with some PERSON table fields missed. Missed field values will be implicitly set to DEFAULT column value regarding the row schema version.
table.insert(Person);
It may be impossible to insert an object/row with missed field if field is declared with NOT-NULL constraint and without DEFAULT (non-null) value specified.
Ignite will provide out-of-box mapping from standard platform types (Java, C#, C++) to built-in primitives. A user will be able to alter this mapping using some external mechanism (e.g. annotations to map long values to Number). Standard mapping is listed in the table below:
Built-in | Java | .NET | C++ |
---|---|---|---|
Bitmask(n) | BitSet | BitArray | std::bitset |
Int8 | byte (Byte if nullable) | sbyte | int8_t |
Uint8 | short with range constraints | byte | uint8_t |
Int16 | short (Short if nullable) | short | int16_t |
Uint16 | int with range constratints | ushort | uint16_t |
Int32 | int (Integer if nullable) | int | int32_t |
Uint32 | long with range constratints | uint | uint32_t |
Int64 | long (Long if nullable) | long | int64_t |
Uint64 | BigInteger with range constratints | ulong | uint64_t |
Float | float (Float if nullable) | float | usually float |
Double | double (Double if nullable) | double | usually double |
Number([n]) | BigInteger | BigInteger | no analogue in standard |
Decimal | BigDecimal | decimal | no analogue in standard |
UUID | UUID | Guid | no analogue in standard |
String | String | string | std::string |
Date | LocalDate | NodaTime.LocalDate | no analogue in standard |
Time | LocalTime | NodaTime.LocalTime | no analogue in standard |
Datetime | LocalDateTime | NodaTime.LocalDateTime | no analogue in standard |
Timestamp | Instant | NodaTime.Instant | no analogue in standard |
Binary | byte[] | byte[] | std::array<int8_t> |
Java has no native support for unsigned types. We still can introduce 'unsigned' flag to schema type or separate binary type-codes, and allow to map to the closest types of wider range. E.g. map Uint8 → short and recheck constraints during serialization.
If one will try to serialize object with 'short' value out of Uint8 range then it end up with exception (ColumnValueIsOutOfRangeException).
One of the important benefits of binary objects was the ability to store objects with different sets of fields in a single cache. We can accommodate for a very similar behavior in the schema-first approach.
When a tuple is inserted into a table, we attempt to 'fit' tuple fields to the schema columns. If the tuple has some extra fields which are not present in the current schema, the schema is automatically updated to store additional extra fields that are present in the tuple.
This will work in the same way any Java objects that are first-citizens: e.g. Java object or objects in terms of other languages which has implementaion.
On the other hand, if an object has fewer fields than the current schema, the schema is not updated auto(such scenario usually means that an update is executed from an outdated client which did not yet receive a proper object class version). In other words, columns are never dropped during automatic schema evolution; a column can only be dropped by an explicit user command.
// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.
// Links to discussions on the devlist, if applicable.
// Links to various reference documents, if applicable.
// Links or report with relevant JIRA tickets.