Skip to content

Clickhouse sink: ArrowStream format fails schema validation on MATERIALIZED/DEFAULT/ALIAS columns #24667

@vinzee

Description

@vinzee

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Summary

The ArrowStream format for the ClickHouse sink (introduced in PR #24373) fetches the
full table schema from system.columns but does not filter out MATERIALIZED,
DEFAULT, or ALIAS columns. These server-computed columns are included in the
Arrow schema and validated as if the client must provide them. When events arrive
without values for these columns, the encoder raises SchemaConstraintViolation
and drops the events.

Root Cause

In the clickhouse sink, src/sinks/clickhouse/arrow/schema.rs uses the following schema fetch query:

SELECT name, type
FROM system.columns
WHERE database = {db:String} AND table = {tbl:String}
ORDER BY position

This returns ALL columns, including MATERIALIZED/DEFAULT/ALIAS. The ColumnInfo
struct only deserializes name and type -- there is no default_kind field.

When the Arrow encoder serializes events, any column present in the schema but
absent from the event is null. Non-nullable columns (which MATERIALIZED columns
inherently are) trigger SchemaConstraintViolation.

Reproduction

  1. Create a ClickHouse table with MATERIALIZED and DEFAULT columns:
CREATE TABLE test.logs (
    Timestamp DateTime64(9),
    TimestampTime DateTime DEFAULT toDateTime(Timestamp),
    ServiceName String,
    Body String,
    ResourceAttributes Map(String, String),
    `k8s.cluster.name` String MATERIALIZED ResourceAttributes['k8s.cluster.name']
) ENGINE = MergeTree ORDER BY (ServiceName, Timestamp);
  1. Configure Vector's ClickHouse sink to write to this table.

  2. Send events that only include the non-computed columns (Timestamp, ServiceName,
    Body, ResourceAttributes). The client does NOT send TimestampTime or
    k8s.cluster.name because the server computes them.

  3. Vector rejects the events with SchemaConstraintViolation before they reach
    ClickHouse.

Expected Behavior

Vector should skip schema validation for columns with default_kind of
MATERIALIZED or DEFAULT when checking for null constraints. These columns
are server-computed and are not expected in the client payload.

The ClickHouse system table system.columns has a default_kind field that
indicates whether a column is MATERIALIZED, DEFAULT, ALIAS, or empty
(regular column). Vector's schema fetcher should use this to exclude
server-computed columns from null validation.

Workaround

  1. Set batch_encoding.allow_nullable_fields: true (confirmed working):
    This forces ALL Arrow fields to nullable, bypassing any non-null constraints. Vector will no longer catch genuine null-value errors client-side for legitimately non-nullable columns.

  2. Use format: json_each_row (the default) instead of arrow_stream.

  3. Use Vector 0.52.0-debian which does not have the ArrowStream format
    (and therefore no client-side schema validation).

Configuration

sinks:
  ...
  clickhouse_cloud:
    type: clickhouse
    batch_encoding:
        codec: arrow_stream
        allow_nullable_fields: true
    format: arrow_stream
    ...

Version

nightly-2026-02-14-debian

Debug Output

SchemaConstraintViolation(Null value for non-nullable field 'TimestampTime')
SchemaConstraintViolation(Null value for non-nullable field 'k8s.cluster.name')

Example Data

No response

Additional Context

No response

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions