You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/swarms/feature/node failure/check restart clickhouse on swarm node
Affected files:
swarms/tests/node_failure.py
swarms/tests/steps/swarm_node_actions.py
Description
The check restart clickhouse on swarm node test fails consistently on ClickHouse Antalya 26.1 (100% fail rate) while passing on Antalya 25.8. The test verifies that a swarm cluster query fails with exitcode=32 (Attempt to read after eof) when a ClickHouse process is killed on a swarm node during query execution.
On 26.1, the query either succeeds when it shouldn't (with SEGV signal) or hangs indefinitely (with KILL signal), instead of failing quickly with an EOF error as it does on 25.8.
Analysis
Test behavior by version and signal
Version
Signal
Behavior
Test result
25.8 + KILL
Query fails in ~5s with Code: 32. DB::Exception: Attempt to read after eof
OK (consistent)
26.1 + SEGV (current main branch)
Query completes successfully — tasks are redistributed to surviving node, returns 100 clickhouse2 with exitcode 0
Fail (AssertionError — exitcode 0 ≠ 32)
26.1 + KILL (PR 1520 / newer regression commits)
Query hangs for 600s until bash timeout
Error (ExpectTimeoutError)
Two separate issues
1. SEGV on 26.1: query survives node failure (AssertionError) — historical, test already updated
With SEGV, the ClickHouse process takes time to die (core dump generation). During this window, the TaskDistributor detects the node going down and redistributes its pending tasks to the surviving node. The query completes successfully with only one node's results. The test code was updated in commit f1827080b to use KILL instead of SEGV, so this failure mode will no longer occur once the regression commit hash is updated on the main branch.
2. KILL on 26.1: query hangs instead of failing (ExpectTimeoutError) — potential ClickHouse bug
With KILL, the process dies instantly (no cleanup). On 25.8, this causes an immediate Attempt to read after eof error (~5s). On 26.1, the initiator does not detect the broken connection and the query hangs for the full 600s bash timeout. This is a behavioral regression in 26.1.
What changed in 26.1
All Altinity swarm-specific PRs (#780, #866, #1014, #1042, #1201, etc.) are labeled antalya-25.8 or earlier — the swarm code is already present in 25.8 where the test passes. The PRs #1395 and #1414 that frontported this code to 26.1 contain only code that was already in 25.8.
The difference is in the upstream ClickHouse base code (26.1 vs. ~25.5). Key files that differ between the two branches:
src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp — new parallel_replicas_for_cluster_engines, cluster_table_function_split_granularity, changed distributed_processing derivation
-- 26.1: 100% failure rateSELECT result, count() FROM`gh-data`.clickhouse_regression_results
WHERE test_name ='/swarms/feature/node failure/check restart clickhouse on swarm node'AND clickhouse_version LIKE'26.1%'GROUP BY result
-- Fail: ~25, Error: ~8-- 25.8: passes consistently (flaky rare failure)SELECT result, count() FROM`gh-data`.clickhouse_regression_results
WHERE test_name ='/swarms/feature/node failure/check restart clickhouse on swarm node'AND clickhouse_version LIKE'25.8%'GROUP BY result
-- OK: ~15, Fail: 1
Commit context
The test code was updated in commit f1827080b (2026-03-10, "Update node failure tests to use kill instead segv") to use signal="KILL", delay=30, delay_before_execution=5. The main branch antalya-26.1 still uses the old regression commit a54216bbc which has signal="SEGV", delay=0. PR 1520 is updating the regression commit hash for antalya-26.1, and the failures with KILL (ExpectTimeoutError) can be seen in that PR's CI runs.
Next Steps
Validate manually that KILL on 26.1 causes a hang (reproduce the ExpectTimeoutError scenario)
Investigate why the upstream 26.1 base code doesn't propagate EOF from a killed node's connection
Potentially file a bug in Altinity/ClickHouse if the KILL hang is confirmed as a regression
Impact: Long-running object-storage cluster queries can stall until client timeout (300s in swarms harness) after a swarm node is killed/restarted, instead of failing fast with EOF (Code 32).
Anchor: src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp, function StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica.
Trigger: Run /swarms/feature/node failure/check restart clickhouse on swarm node with signal=KILL, delay_before_execution=5, delay=30 on a 26.1 build
Why defect: The reschedule path computes and enqueues with file->getPath(), while normal task assignment and dequeue paths use a different identity, getAbsolutePathFromObjectInfo(...).value_or(getIdentifier()). This breaks queue identity invariants after replica loss.
Transition: ConnectionLost -> RemoteQueryExecutor::processPacket -> task_iterator->rescheduleTasksFromReplica() -> requeue under non-canonical key -> downstream task distribution cannot reliably reconcile the same object identity -> initiator-side query waits and times out.
Proof sketch: In 26.1 code, rescheduleTasksFromReplica uses file->getPath() for both getReplicaForFile(...) and unprocessed_files.emplace(...), while getPreQueuedFile and getMatchingFileFromIterator use absolute and identifier-based keys. In the 25.8 working baseline, reschedule used absolute-path-aware identity, getAbsolutePath().value_or(getPath()), preserving key consistency.
Move Altinity/clickhouse-regression#124
Affected test:
/swarms/feature/node failure/check restart clickhouse on swarm nodeAffected files:
swarms/tests/node_failure.pyswarms/tests/steps/swarm_node_actions.pyDescription
The
check restart clickhouse on swarm nodetest fails consistently on ClickHouse Antalya 26.1 (100% fail rate) while passing on Antalya 25.8. The test verifies that a swarm cluster query fails withexitcode=32(Attempt to read after eof) when a ClickHouse process is killed on a swarm node during query execution.On 26.1, the query either succeeds when it shouldn't (with
SEGVsignal) or hangs indefinitely (withKILLsignal), instead of failing quickly with an EOF error as it does on 25.8.Analysis
Test behavior by version and signal
KILLCode: 32. DB::Exception: Attempt to read after eofSEGV(current main branch)100 clickhouse2with exitcode 0KILL(PR 1520 / newer regression commits)Two separate issues
1. SEGV on 26.1: query survives node failure (AssertionError) — historical, test already updated
With
SEGV, the ClickHouse process takes time to die (core dump generation). During this window, the TaskDistributor detects the node going down and redistributes its pending tasks to the surviving node. The query completes successfully with only one node's results. The test code was updated in commitf1827080bto useKILLinstead ofSEGV, so this failure mode will no longer occur once the regression commit hash is updated on the main branch.2. KILL on 26.1: query hangs instead of failing (ExpectTimeoutError) — potential ClickHouse bug
With
KILL, the process dies instantly (no cleanup). On 25.8, this causes an immediateAttempt to read after eoferror (~5s). On 26.1, the initiator does not detect the broken connection and the query hangs for the full 600s bash timeout. This is a behavioral regression in 26.1.What changed in 26.1
All Altinity swarm-specific PRs (#780, #866, #1014, #1042, #1201, etc.) are labeled
antalya-25.8or earlier — the swarm code is already present in 25.8 where the test passes. The PRs #1395 and #1414 that frontported this code to 26.1 contain only code that was already in 25.8.The difference is in the upstream ClickHouse base code (26.1 vs. ~25.5). Key files that differ between the two branches:
src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp— newparallel_replicas_for_cluster_engines,cluster_table_function_split_granularity, changeddistributed_processingderivationsrc/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp— removedhas_concurrent_next()check, reorderedhandleReplicaLoss()operations, changed file identifier resolutionsrc/Storages/IStorageCluster.cpp— metadata handling, join detection changessrc/Storages/ObjectStorage/StorageObjectStorageSource.cpp/.h— structural changesCore connection handling code is identical between versions:
RemoteQueryExecutor.cpp,MultiplexedConnections.cpp,ReadBufferFromPocoSocket.cpp,PacketReceiver.cpp,ConnectionTimeouts.cpp/hDatabase evidence
Commit context
The test code was updated in commit
f1827080b(2026-03-10, "Update node failure tests to use kill instead segv") to usesignal="KILL",delay=30,delay_before_execution=5. The main branchantalya-26.1still uses the old regression commita54216bbcwhich hassignal="SEGV",delay=0. PR 1520 is updating the regression commit hash forantalya-26.1, and the failures withKILL(ExpectTimeoutError) can be seen in that PR's CI runs.Next Steps
KILLon 26.1 causes a hang (reproduce the ExpectTimeoutError scenario)Altinity/ClickHouseif the KILL hang is confirmed as a regressionReferences
Confirmed defects
Impact: Long-running object-storage cluster queries can stall until client timeout (300s in swarms harness) after a swarm node is killed/restarted, instead of failing fast with EOF (Code 32).
Anchor: src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp, function StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica.
Trigger: Run /swarms/feature/node failure/check restart clickhouse on swarm node with signal=KILL, delay_before_execution=5, delay=30 on a 26.1 build
Why defect: The reschedule path computes and enqueues with file->getPath(), while normal task assignment and dequeue paths use a different identity, getAbsolutePathFromObjectInfo(...).value_or(getIdentifier()). This breaks queue identity invariants after replica loss.
Transition: ConnectionLost -> RemoteQueryExecutor::processPacket -> task_iterator->rescheduleTasksFromReplica() -> requeue under non-canonical key -> downstream task distribution cannot reliably reconcile the same object identity -> initiator-side query waits and times out.
Proof sketch: In 26.1 code, rescheduleTasksFromReplica uses file->getPath() for both getReplicaForFile(...) and unprocessed_files.emplace(...), while getPreQueuedFile and getMatchingFileFromIterator use absolute and identifier-based keys. In the 25.8 working baseline, reschedule used absolute-path-aware identity, getAbsolutePath().value_or(getPath()), preserving key consistency.
Root-cause PR: Introduced by PR 26.1 Antalya port - improvements for cluster requests #1414 (26.1 Antalya port - improvements for cluster requests, commit cc2dea7...) in the new 26.1 reschedule implementation. Not fixed by PR Antalya 26.1: Fix rescheduleTasksFromReplica #1568, which only addresses erase-order and UAF safety.
Bug code location:
File: src/Storages/ObjectStorage/StorageObjectStorageStableTaskDistributor.cpp
Function: StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica
Problematic lines:
for (const auto & file : processed_file_list_ptr->second)
{
auto file_replica_idx = getReplicaForFile(file->getPath());
unprocessed_files.emplace(file->getPath(), std::make_pair(file, file_replica_idx));
connection_to_files[file_replica_idx].push_back(file);
}
Conflicting canonical key logic in the same class:
auto file_identifier = send_over_whole_archive
? next_file->getPathOrPathToArchiveIfArchive()
: getAbsolutePathFromObjectInfo(next_file).value_or(next_file->getIdentifier());
Smallest logical repro:
Fix direction (short): Backport PR Fix file identifier in rescheduleTasksFromReplica #1493 identifier-consistency changes (getFileIdentifier style unified key derivation) in addition to PR Antalya 26.1: Fix rescheduleTasksFromReplica #1568.