AMBARI-26614: Config and config-group changes are slow to propagate to agents on large clusters#4135
Merged
Merged
Conversation
Contributor
Author
|
I guess I found some bugs. Please delay merge this PR. Thanks for review. |
6c2a30e to
ecd4c36
Compare
…o agents on large clusters Reduce the work done on each config / config-group change and fix a config staleness issue uncovered while doing so. - Scope the agent-config push: when a config type changes, recompute/push only to hosts whose installed components depend on (or whose service owns) that type (ConfigHelper.getHostsAffectedByConfigTypes + AmbariManagementControllerImpl). Types not owned by any installed service (e.g. cluster-env) still fan out to all hosts. Host-set scoping only — each host's config payload content is unchanged. - Pre-resolve the changed cluster's desired configs on the request thread before the parallel per-host recompute. Resolving them lazily inside the ForkJoinPool worker (cachedClustersDesiredConfigs.computeIfAbsent(clusterId, id -> cl.getDesiredConfigs(false))) runs outside the request's transaction/persistence context, so it reads the pre-change committed snapshot and pushes the previous config value to agents — a deterministic off-by-one that leaves a host one change behind until an ambari-server/agent restart. Seeding the cache on the request thread (read-your-writes) keeps the parallel recompute while making the workers never call getDesiredConfigs. - Scope config-group changes to the group's member hosts instead of recomputing every cluster host (ConfigGroupResourceProvider + updateAgentConfigs(Cluster, List<Long>)), with hostId-based group lookup. Covered by ConfigGroupResourceProviderTest. - Avoid redundant work per change: cache each cluster's desired configs once per batch, drop unnecessary agent-payload sorting (ClusterConfigs) and orphaned unescape logic.
eubnara
commented
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Speeds up propagation of config / config-group changes to agents on large
clusters (AMBARI-26614). Previously every change recomputed the per-host agent
config for all cluster hosts, sequentially and redundantly, and a config-group
change touching only a few member hosts still pushed config to every host.
previous and new members). New
ConfigHelper.updateAgentConfigs(Cluster, List<Long> hostIds)overload; the cluster-wideupdateAgentConfigs(Set<String>)is kept as a wrapper.Cluster#getDesiredConfigsper batch instead of recomputing per host.ThreadPoolsForkJoinPool(no new/leaked pool); wait for completion and surface per-host failures.
ClusterConfigsSortedMap → Map; agents don't depend on key order).
escape/unescape pair for config names; AMBARI-23538 removed the escaping
half, leaving the unescaping half with nothing to undo. It still ran on every
key for every host on every push (wasted work on the hot path), and since
nothing escapes the keys anymore it corrupted any key containing a backslash
sequence. Output is unchanged for normal keys.
Cluster-wide config changes still target all hosts; the hash-based
"send only if changed" mechanism is unchanged.
How was this patch tested?
Existing unit tests pass:
ConfigHelperTest,ConfigGroupResourceProviderTest.ConfigHelperTestwas updated to stubServiceComponentHost#getHost()for theconfig-group-by-hostId lookup path. Manually verified on a large cluster that a
config-group change pushes config only to its member hosts.