DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
1 Motivation
CAST(bytes AS STRING) silently replaces any invalid UTF-8 byte with U+FFFD (?). The substitution is irreversible and produces no warning - the pipeline keeps running while data is permanently corrupted downstream.
A correct cast from BYTES → STRING → BYTES should be idempotent. The current silent substitution breaks this property, and with it, some engine optimizations aren't possible. For example, the injective guarantees needed for upsert key propagation through byte-to-string casts, see FLINK-39125.
The fix has two parts:
- Add two helper functions so users can handle invalid bytes explicitly before the behavior changes.
- Change
CAST(bytes AS STRING)to raise an error on invalid UTF-8.
Apache Spark 4.0 took the same approach, introducing functions like is_valid_utf8 and make_valid_utf8 (SPARK-49957). The community is now also actively discussing making CAST(binary AS STRING) strict by default (SPARK-54586).
2 Public Interfaces
Two new built-in scalar SQL functions:
IS_VALID_UTF8(bytes) → BOOLEAN
Returns TRUE if the byte sequence is valid UTF-8, FALSE otherwise. Returns NULL for NULL input.
IS_VALID_UTF8(x'48656C6C6F') -- TRUE ("Hello")
IS_VALID_UTF8(x'80') -- FALSE (invalid start byte)
MAKE_VALID_UTF8(bytes) → STRING
Decodes to string, replacing invalid bytes with U+FFFD. This makes the current implicit CAST behavior an explicit, opt-in choice.
MAKE_VALID_UTF8(x'48656C6C6F') -- 'Hello'
MAKE_VALID_UTF8(x'80') -- '?' (replacement character, intentional)
Both return NULL for NULL input.
StringData.fromUtf8Bytes(byte[]) - new connector API
A new method on StringData (@PublicEvolving) for connector authors who want to validate UTF-8 at ingestion time:
StringData.fromUtf8Bytes(bytes)- validates and throws on invalid UTF-8StringData.fromBytes(bytes)- unchanged, no validation (backward compatible)
3 Proposed Changes
3.1 New functions
The two functions above are added as built-in scalar functions, accepting BYTES / VARBINARY input.
3.2 Change CAST(bytes AS STRING) default behavior
CAST(bytes AS STRING) will raise a runtime error when the input contains invalid UTF-8 sequences, eliminating the silent corruption class of bugs entirely.
A direct consequence: since TRY_CAST catches runtime errors and returns NULL, TRY_CAST(bytes AS STRING) will return NULL for invalid inputs after this change.
Full behavior summary after this FLIP:
| Expression | Invalid UTF-8 input | Spark equivalent |
|---|---|---|
CAST(bytes AS STRING) | throws error | validate_utf8 |
TRY_CAST(bytes AS STRING) | returns NULL | try_validate_utf8 |
IS_VALID_UTF8(bytes) | returns FALSE (utility predicate) | is_valid_utf8 |
MAKE_VALID_UTF8(bytes) | returns string with U+FFFD (explicit opt-in to old behavior) | make_valid_utf8 |
Users migrating from the current behavior:
| Desired behavior | Replace with |
|---|---|
| Null on invalid (safe pipeline) | TRY_CAST(bytes AS STRING) |
Replace with ? (current CAST behavior, opt-in) | MAKE_VALID_UTF8(bytes) |
| Route bad records to DLQ | IS_VALID_UTF8 in a WHERE clause |
DLQ routing example:
-- valid records
INSERT INTO sink
SELECT CAST(raw_bytes AS STRING) FROM source
WHERE IS_VALID_UTF8(raw_bytes);
-- invalid records routed separately
INSERT INTO dead_letter_queue
SELECT raw_bytes FROM source
WHERE NOT IS_VALID_UTF8(raw_bytes);
3.3 Connector API (StringData)
StringData.fromBytes() is @PublicEvolving and used directly by many connectors - it bypasses the SQL CAST path entirely, so the CAST change does not affect it. It is left unchanged.
A new method StringData.fromUtf8Bytes(byte[]) is added as a strict alternative: it validates the bytes and throws on invalid UTF-8. Connector authors can migrate to it at their own pace. StringByteArrayConverter is @Internal and a change can follow adding a strict variant as well.
4 Compatibility, Deprecation, and Migration Plan
The change to CAST(bytes AS STRING) is a breaking change. Existing pipelines that rely on silent U+FFFD substitution will start throwing errors.
Migration path - SQL users:
CAST(bytes AS STRING)→MAKE_VALID_UTF8(bytes)to preserve current behaviorCAST(bytes AS STRING)→TRY_CAST(bytes AS STRING)for null-on-invalid semanticsCAST(bytes AS STRING)→CAST(bytes AS STRING)new functionality to now catch errors and fix data upstream
Migration path - connector authors:
StringData.fromBytes(bytes)is unchanged and keeps workingStringData.fromUtf8Bytes(bytes)is available as an opt-in strict alternative; no forced migration
Rollout:
- Ship the two functions first (additive, no impact).
- Then, change
CASTto raise an error. A configuration option restores the old silent-substitution behavior for users that need time to migrate:
table.exec.legacy-bytes-to-string-cast = true (default: false)
5 Test Plan
Tests cover IS_VALID_UTF8, MAKE_VALID_UTF8, the new CAST error behavior, and StringData.fromUtf8Bytes() for the following cases:
NULLinput- Valid ASCII
- Valid multi-byte UTF-8 (emoji, CJK)
- Invalid start byte (
0x80) - Truncated multi-byte sequence
- Overlong encoding (
0xC0 0x80) - Surrogate pair bytes
- Empty byte array
6 Rejected Alternatives
Adding TRY_VALIDATE_UTF8 as a dedicated function
We considered adding a TRY_VALIDATE_UTF8 function (null-on-invalid) to mirror Spark's API. Since TRY_CAST(bytes AS STRING) already returns NULL for invalid inputs after the CAST behavior change, a dedicated function would be redundant. TRY_CAST is the idiomatic path.
Single function with a mode argument (SAFE_CAST_UTF8(bytes, 'NULL' | 'REPLACE'))
Mode string arguments are not type-safe and make queries harder to read. The three-function approach also matches the Spark 4.0 API, which lowers the adoption barrier for users coming from that ecosystem.
Naming alternatives for the two functions
IS_VALID_UTF8 - no strong alternative found; the name is self-evident.
MAKE_VALID_UTF8 - alternatives considered:
FROM_UTF8(Trino-style) - short and clean, but replacement behavior is implicitDECODE_UTF8(aligns with Flink'sURL_DECODE) - signals conversion, but replacement is still implicitSANITIZE_UTF8- "sanitize" is widely understood in data/security as "clean up bad data"
MAKE_VALID_UTF8 was kept for alignment with Spark's naming, which makes the intent (explicitly choosing lossy replacement) most apparent.