1 Motivation

CAST(bytes AS STRING) silently replaces any invalid UTF-8 byte with U+FFFD (?). The substitution is irreversible and produces no warning - the pipeline keeps running while data is permanently corrupted downstream.

A correct cast from BYTES → STRING → BYTES should be idempotent. The current silent substitution breaks this property, and with it, some engine optimizations aren't possible. For example, the injective guarantees needed for upsert key propagation through byte-to-string casts, see FLINK-39125.

The fix has two parts:

  1. Add two helper functions so users can handle invalid bytes explicitly before the behavior changes.
  2. Change CAST(bytes AS STRING) to raise an error on invalid UTF-8.

Apache Spark 4.0 took the same approach, introducing functions like is_valid_utf8 and make_valid_utf8 (SPARK-49957). The community is now also actively discussing making CAST(binary AS STRING) strict by default (SPARK-54586).


2 Public Interfaces

Two new built-in scalar SQL functions:

IS_VALID_UTF8(bytes) → BOOLEAN

Returns TRUE if the byte sequence is valid UTF-8, FALSE otherwise. Returns NULL for NULL input.

IS_VALID_UTF8(x'48656C6C6F')  -- TRUE  ("Hello")
IS_VALID_UTF8(x'80')          -- FALSE (invalid start byte)

MAKE_VALID_UTF8(bytes) → STRING

Decodes to string, replacing invalid bytes with U+FFFD. This makes the current implicit CAST behavior an explicit, opt-in choice.

MAKE_VALID_UTF8(x'48656C6C6F')  -- 'Hello'
MAKE_VALID_UTF8(x'80')          -- '?' (replacement character, intentional)

Both return NULL for NULL input.

StringData.fromUtf8Bytes(byte[]) - new connector API

A new method on StringData (@PublicEvolving) for connector authors who want to validate UTF-8 at ingestion time:

  • StringData.fromUtf8Bytes(bytes) - validates and throws on invalid UTF-8
  • StringData.fromBytes(bytes) - unchanged, no validation (backward compatible)

3 Proposed Changes

3.1 New functions

The two functions above are added as built-in scalar functions, accepting BYTES / VARBINARY input.

3.2 Change CAST(bytes AS STRING) default behavior

CAST(bytes AS STRING) will raise a runtime error when the input contains invalid UTF-8 sequences, eliminating the silent corruption class of bugs entirely.

A direct consequence: since TRY_CAST catches runtime errors and returns NULL, TRY_CAST(bytes AS STRING) will return NULL for invalid inputs after this change.

Full behavior summary after this FLIP:

ExpressionInvalid UTF-8 inputSpark equivalent
CAST(bytes AS STRING)throws errorvalidate_utf8
TRY_CAST(bytes AS STRING)returns NULLtry_validate_utf8
IS_VALID_UTF8(bytes)returns FALSE (utility predicate)is_valid_utf8
MAKE_VALID_UTF8(bytes)returns string with U+FFFD (explicit opt-in to old behavior)make_valid_utf8

Users migrating from the current behavior:

Desired behaviorReplace with
Null on invalid (safe pipeline)TRY_CAST(bytes AS STRING)
Replace with ? (current CAST behavior, opt-in)MAKE_VALID_UTF8(bytes)
Route bad records to DLQIS_VALID_UTF8 in a WHERE clause

DLQ routing example:

-- valid records
INSERT INTO sink
SELECT CAST(raw_bytes AS STRING) FROM source
WHERE IS_VALID_UTF8(raw_bytes);

-- invalid records routed separately
INSERT INTO dead_letter_queue
SELECT raw_bytes FROM source
WHERE NOT IS_VALID_UTF8(raw_bytes);

3.3 Connector API (StringData)

StringData.fromBytes() is @PublicEvolving and used directly by many connectors - it bypasses the SQL CAST path entirely, so the CAST change does not affect it. It is left unchanged.

A new method StringData.fromUtf8Bytes(byte[]) is added as a strict alternative: it validates the bytes and throws on invalid UTF-8. Connector authors can migrate to it at their own pace. StringByteArrayConverter is @Internal and a change can follow adding a strict variant as well.


4 Compatibility, Deprecation, and Migration Plan

The change to CAST(bytes AS STRING) is a breaking change. Existing pipelines that rely on silent U+FFFD substitution will start throwing errors.

Migration path - SQL users:

  • CAST(bytes AS STRING)MAKE_VALID_UTF8(bytes) to preserve current behavior
  • CAST(bytes AS STRING)TRY_CAST(bytes AS STRING) for null-on-invalid semantics
  • CAST(bytes AS STRING)CAST(bytes AS STRING) new functionality to now catch errors and fix data upstream

Migration path - connector authors:

  • StringData.fromBytes(bytes) is unchanged and keeps working
  • StringData.fromUtf8Bytes(bytes) is available as an opt-in strict alternative; no forced migration

Rollout:

  1. Ship the two functions first (additive, no impact).
  2. Then, change CAST to raise an error. A configuration option restores the old silent-substitution behavior for users that need time to migrate:
table.exec.legacy-bytes-to-string-cast = true  (default: false)

5 Test Plan

Tests cover IS_VALID_UTF8, MAKE_VALID_UTF8, the new CAST error behavior, and StringData.fromUtf8Bytes() for the following cases:

  • NULL input
  • Valid ASCII
  • Valid multi-byte UTF-8 (emoji, CJK)
  • Invalid start byte (0x80)
  • Truncated multi-byte sequence
  • Overlong encoding (0xC0 0x80)
  • Surrogate pair bytes
  • Empty byte array

6 Rejected Alternatives

Adding TRY_VALIDATE_UTF8 as a dedicated function

We considered adding a TRY_VALIDATE_UTF8 function (null-on-invalid) to mirror Spark's API. Since TRY_CAST(bytes AS STRING) already returns NULL for invalid inputs after the CAST behavior change, a dedicated function would be redundant. TRY_CAST is the idiomatic path.

Single function with a mode argument (SAFE_CAST_UTF8(bytes, 'NULL' | 'REPLACE'))

Mode string arguments are not type-safe and make queries harder to read. The three-function approach also matches the Spark 4.0 API, which lowers the adoption barrier for users coming from that ecosystem.

Naming alternatives for the two functions

IS_VALID_UTF8 - no strong alternative found; the name is self-evident.

MAKE_VALID_UTF8 - alternatives considered:

  • FROM_UTF8 (Trino-style) - short and clean, but replacement behavior is implicit
  • DECODE_UTF8 (aligns with Flink's URL_DECODE) - signals conversion, but replacement is still implicit
  • SANITIZE_UTF8 - "sanitize" is widely understood in data/security as "clean up bad data"

MAKE_VALID_UTF8 was kept for alignment with Spark's naming, which makes the intent (explicitly choosing lossy replacement) most apparent.


  • No labels