Skip to content

AMBARI-26614: Config and config-group changes are slow to propagate to agents on large clusters#4135

Merged
JiaLiangC merged 1 commit into
apache:trunkfrom
eubnara:AMBARI-26614
Jun 10, 2026
Merged

AMBARI-26614: Config and config-group changes are slow to propagate to agents on large clusters#4135
JiaLiangC merged 1 commit into
apache:trunkfrom
eubnara:AMBARI-26614

Conversation

@eubnara

@eubnara eubnara commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Speeds up propagation of config / config-group changes to agents on large
clusters (AMBARI-26614). Previously every change recomputed the per-host agent
config for all cluster hosts, sequentially and redundantly, and a config-group
change touching only a few member hosts still pushed config to every host.

  • Scope config-group create/update to the affected member hosts (union of
    previous and new members). New ConfigHelper.updateAgentConfigs(Cluster, List<Long> hostIds) overload; the cluster-wide
    updateAgentConfigs(Set<String>) is kept as a wrapper.
  • Cache Cluster#getDesiredConfigs per batch instead of recomputing per host.
  • Parallelize per-host recomputation on the existing ThreadPools ForkJoinPool
    (no new/leaked pool); wait for completion and surface per-host failures.
  • Drop the unnecessary deep sorting of the agent payload (ClusterConfigs
    SortedMap → Map; agents don't depend on key order).
  • Look up config groups by host id instead of scanning by hostname.
  • Remove the leftover config-name unescaping. AMBARI-23463 added an
    escape/unescape pair for config names; AMBARI-23538 removed the escaping
    half, leaving the unescaping half with nothing to undo. It still ran on every
    key for every host on every push (wasted work on the hot path), and since
    nothing escapes the keys anymore it corrupted any key containing a backslash
    sequence. Output is unchanged for normal keys.

Cluster-wide config changes still target all hosts; the hash-based
"send only if changed" mechanism is unchanged.

How was this patch tested?

Existing unit tests pass: ConfigHelperTest, ConfigGroupResourceProviderTest.
ConfigHelperTest was updated to stub ServiceComponentHost#getHost() for the
config-group-by-hostId lookup path. Manually verified on a large cluster that a
config-group change pushes config only to its member hosts.

@JiaLiangC JiaLiangC left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eubnara

eubnara commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

I guess I found some bugs. Please delay merge this PR. Thanks for review.

@eubnara eubnara force-pushed the AMBARI-26614 branch 2 times, most recently from 6c2a30e to ecd4c36 Compare June 9, 2026 14:36
…o agents on large clusters

Reduce the work done on each config / config-group change and fix a config
staleness issue uncovered while doing so.

- Scope the agent-config push: when a config type changes, recompute/push only to
  hosts whose installed components depend on (or whose service owns) that type
  (ConfigHelper.getHostsAffectedByConfigTypes + AmbariManagementControllerImpl).
  Types not owned by any installed service (e.g. cluster-env) still fan out to all
  hosts. Host-set scoping only — each host's config payload content is unchanged.

- Pre-resolve the changed cluster's desired configs on the request thread before the
  parallel per-host recompute. Resolving them lazily inside the ForkJoinPool worker
  (cachedClustersDesiredConfigs.computeIfAbsent(clusterId, id -> cl.getDesiredConfigs(false)))
  runs outside the request's transaction/persistence context, so it reads the
  pre-change committed snapshot and pushes the previous config value to agents — a
  deterministic off-by-one that leaves a host one change behind until an
  ambari-server/agent restart. Seeding the cache on the request thread (read-your-writes)
  keeps the parallel recompute while making the workers never call getDesiredConfigs.

- Scope config-group changes to the group's member hosts instead of recomputing
  every cluster host (ConfigGroupResourceProvider + updateAgentConfigs(Cluster, List<Long>)),
  with hostId-based group lookup. Covered by ConfigGroupResourceProviderTest.

- Avoid redundant work per change: cache each cluster's desired configs once per batch,
  drop unnecessary agent-payload sorting (ClusterConfigs) and orphaned unescape logic.
@JiaLiangC JiaLiangC merged commit 2f794b1 into apache:trunk Jun 10, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants