A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Summary
The ArrowStream format for the ClickHouse sink (introduced in PR #24373) fetches the
full table schema from system.columns but does not filter out MATERIALIZED,
DEFAULT, or ALIAS columns. These server-computed columns are included in the
Arrow schema and validated as if the client must provide them. When events arrive
without values for these columns, the encoder raises SchemaConstraintViolation
and drops the events.
Root Cause
In the clickhouse sink, src/sinks/clickhouse/arrow/schema.rs uses the following schema fetch query:
SELECT name, type
FROM system.columns
WHERE database = {db:String} AND table = {tbl:String}
ORDER BY position
This returns ALL columns, including MATERIALIZED/DEFAULT/ALIAS. The ColumnInfo
struct only deserializes name and type -- there is no default_kind field.
When the Arrow encoder serializes events, any column present in the schema but
absent from the event is null. Non-nullable columns (which MATERIALIZED columns
inherently are) trigger SchemaConstraintViolation.
Reproduction
- Create a ClickHouse table with MATERIALIZED and DEFAULT columns:
CREATE TABLE test.logs (
Timestamp DateTime64(9),
TimestampTime DateTime DEFAULT toDateTime(Timestamp),
ServiceName String,
Body String,
ResourceAttributes Map(String, String),
`k8s.cluster.name` String MATERIALIZED ResourceAttributes['k8s.cluster.name']
) ENGINE = MergeTree ORDER BY (ServiceName, Timestamp);
-
Configure Vector's ClickHouse sink to write to this table.
-
Send events that only include the non-computed columns (Timestamp, ServiceName,
Body, ResourceAttributes). The client does NOT send TimestampTime or
k8s.cluster.name because the server computes them.
-
Vector rejects the events with SchemaConstraintViolation before they reach
ClickHouse.
Expected Behavior
Vector should skip schema validation for columns with default_kind of
MATERIALIZED or DEFAULT when checking for null constraints. These columns
are server-computed and are not expected in the client payload.
The ClickHouse system table system.columns has a default_kind field that
indicates whether a column is MATERIALIZED, DEFAULT, ALIAS, or empty
(regular column). Vector's schema fetcher should use this to exclude
server-computed columns from null validation.
Workaround
-
Set batch_encoding.allow_nullable_fields: true (confirmed working):
This forces ALL Arrow fields to nullable, bypassing any non-null constraints. Vector will no longer catch genuine null-value errors client-side for legitimately non-nullable columns.
-
Use format: json_each_row (the default) instead of arrow_stream.
-
Use Vector 0.52.0-debian which does not have the ArrowStream format
(and therefore no client-side schema validation).
Configuration
sinks:
...
clickhouse_cloud:
type: clickhouse
batch_encoding:
codec: arrow_stream
allow_nullable_fields: true
format: arrow_stream
...
Version
nightly-2026-02-14-debian
Debug Output
SchemaConstraintViolation(Null value for non-nullable field 'TimestampTime')
SchemaConstraintViolation(Null value for non-nullable field 'k8s.cluster.name')
Example Data
No response
Additional Context
No response
References
A note for the community
Problem
Summary
The ArrowStream format for the ClickHouse sink (introduced in PR #24373) fetches the
full table schema from
system.columnsbut does not filter outMATERIALIZED,DEFAULT, orALIAScolumns. These server-computed columns are included in theArrow schema and validated as if the client must provide them. When events arrive
without values for these columns, the encoder raises
SchemaConstraintViolationand drops the events.
Root Cause
In the clickhouse sink, src/sinks/clickhouse/arrow/schema.rs uses the following schema fetch query:
This returns ALL columns, including MATERIALIZED/DEFAULT/ALIAS. The
ColumnInfostruct only deserializes
nameandtype-- there is nodefault_kindfield.When the Arrow encoder serializes events, any column present in the schema but
absent from the event is null. Non-nullable columns (which MATERIALIZED columns
inherently are) trigger
SchemaConstraintViolation.Reproduction
Configure Vector's ClickHouse sink to write to this table.
Send events that only include the non-computed columns (Timestamp, ServiceName,
Body, ResourceAttributes). The client does NOT send TimestampTime or
k8s.cluster.name because the server computes them.
Vector rejects the events with SchemaConstraintViolation before they reach
ClickHouse.
Expected Behavior
Vector should skip schema validation for columns with
default_kindofMATERIALIZEDorDEFAULTwhen checking for null constraints. These columnsare server-computed and are not expected in the client payload.
The ClickHouse system table
system.columnshas adefault_kindfield thatindicates whether a column is
MATERIALIZED,DEFAULT,ALIAS, or empty(regular column). Vector's schema fetcher should use this to exclude
server-computed columns from null validation.
Workaround
Set
batch_encoding.allow_nullable_fields: true(confirmed working):This forces ALL Arrow fields to nullable, bypassing any non-null constraints. Vector will no longer catch genuine null-value errors client-side for legitimately non-nullable columns.
Use
format: json_each_row(the default) instead ofarrow_stream.Use Vector
0.52.0-debianwhich does not have the ArrowStream format(and therefore no client-side schema validation).
Configuration
Version
nightly-2026-02-14-debian
Debug Output
Example Data
No response
Additional Context
No response
References
ArrowStreamformat #24373 (ArrowStream format): enhancement(clickhouse sink): AddArrowStreamformat #24373