This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • KIP-481: SerDe Improvements for Connect Decimal type in JSON
Skip to end of metadata
Go to start of metadata

Status

Current stateUnder Discussion

Discussion thread: KIP-481 Discussion

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Most JSON data that utilizes precise decimal data represents it as a decimal number. Connect, on the other hand, only supports a binary BASE64 string encoding (see example below). This KIP intends to support both representations so that it can better integrate with legacy systems (and make the internal topic data easier to read/debug):

  • serialize the decimal field "foo" with value "10.2345" with the BASE64 setting: {"foo": "D3J5"}
  • serialize the decimal field "foo" with value "10.2345" with the NUMERIC setting: {"foo": 10.2345}

Public Interfaces

A new configuration for producers json.decimal.serialization.format will be introduced to the JsonConverter configuration to help control whether source converters will serialize decimals in numeric or binary formats. The valid values will be "BASE64" (default, to maintain compatibility) and "NUMERIC".

Proposed Changes

The changes will be scoped nearly entirely to the JsonConverter, which will be able to deserialize a DecimalNode when the schema is defined as a decimal. Namely, the converter will no longer throw an exception if the incoming data is a numeric node but the schema is specified decimal (logical type). If json.decimal.serialization.format is set to BASE64, the serialization path will remain the same. If it is set to NUMERIC, the JSON value being serialized will be a number instead of a text value.

Furthermore, the JsonDeserializer will now default floating point deserialization to BigDecimal to avoid losing precision. This may impact performance when deserializing doubles - a JMH microbenchmark on my local MBP, this estimated about 3x degradation for deserializing JSON floating points.

Compatibility, Deprecation, and Migration Plan

There are the following combinations that could occur during migration:

  • Legacy Source Converter, Upgraded Sink Converter: this scenario is okay, as the upgraded sink converter will be able to read the implicit BINARY format
  • Upgraded Source Converter with NUMERIC serialization, Upgraded Sink Converter: this scenario is okay, as the upgraded sink converter will be able to read (deserialize) the numeric serialization
  • Upgraded Source Converter with BASE64 serialization, Legacy Sink Converter: this scenario is okay as the upgraded source converter will be able to serialize binary data as today in BASE64 format
  • Upgraded Source Converter with NUMERIC serialization, Legacy Sink Converterthis is the only scenario that is not okay and will cause issues since the legacy sink converter cannot deserialize NUMERIC data. 

Because of this, users must take care to first ensure that all sink connectors have upgraded to the new converter code before upgrading source connectors to make use of the NUMERIC serialization format in JsonConverter.

There is also concern of data changing in the middle of the stream:

  • Legacy → Upgraded BASE64: this will not cause any change in the data in the topic
  • Legacy → Upgraded NUMERIC: this will cause a all new values to be serialized using NUMERIC format and will cause issues unless sink converters are upgraded
  • Upgraded BINARY → Upgraded NUMERIC: this is identical to above
  • Upgraded NUMERIC → Upgraded BASE64: this will not cause a new issue since if the numeric format was already working, all sink converters would be able to read binary format as well
  • Upgraded NUMERIC → (Rollback) Legacy: this is identical to above

Rejected Alternatives

  • The original KIP suggested supporting an additional representation - base10 encoded text (e.g. `{"asText":"10.2345"}`). While it is possible to automatically differentiate NUMERIC from BASE10 and BASE64, it is not always possible to differentiate between BASE10 from BASE64. Take, for example, the string "12" - this is both a valid decimal (12) and a valid hex string which represents a decimal (1.8). This causes issues because it is impossible to disambiguate between BASE10 and BASE64 without an additional config - furthermore, this makes the migration from one to the other nearly impossible because it would require that all consumers stop consuming and producers stop producing and atomically updating the config on all of them after deploying the new code, or waiting for the full retention period to pass - neither option is viable. The suggestion in the KIP is strictly an improvement over the existing behavior, even if it doesn't support all combinations.
  • Encoding the serialization in the schema for Decimal LogicalType. This is good because it means that the deserializer will be able to decode based on the schema and one converter can handle different topics encoded differently as long as the schema is in line. The problem is that this is specific only to JSON and changing the LogicalType is not the right place.
  • No labels