Conversation
…ned Deployments
Introduces a new `TemporalWorkerOwnedResource` (TWOR) CRD that lets users attach
arbitrary namespaced Kubernetes resources (HPA, PDB, WPA, custom CRDs, etc.) to
each per-Build-ID versioned Deployment managed by a TemporalWorkerDeployment.
Key design points:
- One copy of the attached resource is created per active Build ID, owned by the
corresponding versioned Deployment — Kubernetes GC deletes it automatically when
the Deployment is removed, requiring no explicit cleanup logic.
- Resources are applied via Server-Side Apply (create-or-update), so the controller
is idempotent and co-exists safely with other field managers (e.g. the HPA controller).
- Two-layer auto-population for well-known fields:
Layer 1: `scaleTargetRef: null` and `matchLabels: null` in spec.object are
auto-injected with the versioned Deployment's identity and selector labels.
Layer 2: Go template expressions (`{{ .DeploymentName }}`, `{{ .BuildID }}`,
`{{ .Namespace }}`) are rendered in all string values before apply.
- Generated resource names use a hash-suffix scheme (`{prefix}-{8-char-hash}`) to
guarantee uniqueness per (twdName, tworName, buildID) triple even when the prefix
is truncated; the buildID is always represented in the hash regardless of name length.
- `ComputeSelectorLabels` is now the single source of truth for selector labels used
both in Deployment creation and in owned-resource matchLabels injection.
- Partial-failure isolation: all owned resources are attempted on each reconcile even
if some fail; errors are collected and surfaced together.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract getOwnedResourceApplies into planner package so it can be tested without a live API client - Add OwnedResourceApply type and OwnedResourceApplies slice to Plan - Thread twors []TemporalWorkerOwnedResource through GeneratePlan - Add TestGetOwnedResourceApplies (8 cases: nil/empty inputs, N×M cartesian, nil Raw skipped, invalid template skipped) - Add TestGetOwnedResourceApplies_ApplyContents (field manager, kind, owner reference, deterministic name) - Add TestGetOwnedResourceApplies_FieldManagerDistinctPerTWOR - Add two TWOR cases to TestGeneratePlan for end-to-end count check - Add helpers: createTestTWOR, createDeploymentWithUID, createTestTWORWithInvalidTemplate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both the controller plan field and the planner Plan field now share the same name, making the copy-assignment self-documenting: plan.ApplyOwnedResources = planResult.ApplyOwnedResources Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Users don't need to template the k8s namespace (they already know it
when creating their TWOR in that namespace). The Temporal namespace is
more useful since it configures where the worker connects to.
- TemplateData.Namespace → TemplateData.TemporalNamespace
- RenderOwnedResource gains a temporalNamespace string parameter
- getOwnedResourceApplies threads the value from
spec.WorkerOptions.TemporalNamespace down to RenderOwnedResource
- Update all tests: {{ .Namespace }} → {{ .TemporalNamespace }}
- GoTemplateRendering test now uses distinct k8s ns ("k8s-production")
and Temporal ns ("temporal-production") to make the difference clear
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements the admission webhook for TemporalWorkerOwnedResource with: - Pure spec validation: apiVersion/kind required, metadata.name/namespace forbidden, banned kinds (Deployment/StatefulSet/Job/Pod/CronJob by default), minReplicas≠0, scaleTargetRef/matchLabels absent-or-null enforcement - API checks: RESTMapper namespace-scope assertion, SubjectAccessReview for the requesting user and controller SA (with correct SA group memberships) - ValidateUpdate enforces workerRef.name immutability and uses verb="update" - ValidateDelete checks delete permissions on the underlying resource - Helm chart: injects POD_NAMESPACE and SERVICE_ACCOUNT_NAME via downward API, BANNED_KINDS from ownedResources.bannedKinds values Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- cmd/main.go: register TemporalWorkerOwnedResourceValidator unconditionally - webhook.yaml: rewrite to always create the webhook Service and TWOR ValidatingWebhookConfiguration; TWD validating webhook remains optional behind webhook.enabled - certmanager.yaml: fix service DNS names, remove fail guard, default enabled - manager.yaml: move cert volume mount and webhook port outside the webhook.enabled gate so the webhook server always starts - values.yaml: default certmanager.enabled to true, clarify that webhook.enabled only controls the optional TWD webhook Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add helm/crds/temporal.io_temporalworkerownedresources.yaml so Helm installs the CRD before the controller starts - Add temporalworkerownedresources get/list/watch/patch/update rules to the manager ClusterRole so the controller can watch and update status - Add authorization.k8s.io/subjectaccessreviews create permission for the validating webhook's SubjectAccessReview checks - Add editor and viewer ClusterRoles for end-user RBAC on TWOR objects Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TemporalWorkerOwnedResource supports arbitrary user-defined resource types (HPA, PDB, custom CRDs) that are not known at install time. Add a wildcard rule to the manager ClusterRole so the controller can create/get/patch/update/delete any namespaced resource on behalf of TWOR objects. Security note: the TWOR validating webhook is a required admission control that verifies the requesting user has permission on the embedded resource type before the TWOR is admitted, so the controller's broad permissions act as executor, not gatekeeper. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the wildcard ClusterRole rule with a configurable list of explicit resource type rules. Default to HPA and PDB — the two primary documented TWOR use cases. Wildcard mode is still available as an opt-in via ownedResources.rbac.wildcard=true for development clusters or when users attach many different custom CRD types. Operators add entries to ownedResources.rbac.rules for each additional API group their TWOR objects will use (e.g. keda.sh/scaledobjects). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidates bannedKinds and rbac under a single top-level key for clarity. Update all template references accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TWORName, TWORNamespace, BuildID to OwnedResourceApply so the executor knows which status entry to update after each apply attempt. Refactor the apply loop in execplan.go to collect per-(TWOR, BuildID) results (success or error) and then, after all applies complete, write OwnedResourceVersionStatus entries back to each TWOR's status subresource. This means: - Applied=true + ResourceName set on success - Applied=false + Message set on failure - All Build IDs for a TWOR are written atomically in one status update - Apply errors and status write errors are both returned via errors.Join so the reconcile loop retries on either kind of failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the double-nested errors.Join with a single call over the concatenated slice, which is equivalent and more readable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a TemporalWorkerDeployment is reconciled, ensure each TemporalWorkerOwnedResource referencing it has an owner reference pointing back to the TWD (controller: true). This lets Kubernetes garbage-collect TWOR objects automatically when the TWD is deleted. The patch is skipped when the reference is already present (checked via metav1.IsControlledBy) to avoid a write on every reconcile loop. Uses client.MergeFrom to avoid conflicts with concurrent modifications. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
genplan should only read state and build a plan — not perform writes. Instead of patching TWORs directly in generatePlan, build (base, patched) pairs in genplan.go (pure computation) and let executePlan apply them, consistent with how all other writes are structured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add TWOROwnerRefPatch type and EnsureTWOROwnerRefs to planner.Plan - Add getTWOROwnerRefPatches to planner package, unit-tested in planner_test.go (TestGetTWOROwnerRefPatches) - GeneratePlan now accepts a twdOwnerRef and populates EnsureTWOROwnerRefs; genplan.go builds the OwnerReference from the TWD object and passes it through - Owner ref patch failures in execplan.go now log-and-continue so that a deleted TWOR (race between list and patch) cannot block the more important owned-resource apply step Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ing it pre-built Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…bhook validator - Add +kubebuilder:object:generate=false to TemporalWorkerOwnedResourceValidator (client.Client interface field was blocking controller-gen) - Regenerate zz_generated.deepcopy.go: adds GateInputSource, GateWorkflowConfig, OwnedResourceVersionStatus deepcopy that were missing from the manual edit - Regenerate CRD manifests: adds type:object to spec.object in TWOR CRD, field ordering change in TWD CRD - Remove now-unused metav1 import from genplan.go (was missed in prior commit) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests the full reconciliation loop: create TWOR with HPA spec → controller applies one HPA per active Build ID via SSA → asserts scaleTargetRef is auto-injected with the correct versioned Deployment name → asserts TWOR.Status.Versions shows Applied: true → asserts TWD controller owner reference is set on the TWOR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…functions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etup - docs/owned-resources.md: full TWOR reference (auto-injection, RBAC, webhook TLS, examples) - examples/twor-hpa.yaml: ready-to-apply HPA example for the helloworld demo - helm/webhook.yaml + values.yaml: add certmanager.caBundle for BYO TLS without cert-manager - internal/demo/README.md: add cert-manager install step and TWOR demo walkthrough - README.md + docs/README.md: add cert-manager prerequisite, TWOR feature bullet, and doc link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Truncate owned resource names to 47 chars (safe for Deployments; avoids per-resource-type special cases if Deployment is ever un-banned) - Fix docs: replace "active Build ID" with "worker version with running workers" throughout; "active" is reserved for Ramping/Current versions - Fix docs: owned-resource deletion is due to versioned Deployment sunset, not a separate "version delete" operation - Fix docs: scaleTargetRef injection applies to any resource type with that field, not just HPA; clarify webhook rejects non-null values because controller owns them - Fix docs: remove undocumented/untested BYO TLS path; cert-manager is required - Fix docs: expand TWOR abbreviation to full name throughout; remove ⏳ autoscaling bullet from README and clarify TWOR is the path for metric/backlog-based autoscaling - Add note on how to inspect the banned kinds list (BANNED_KINDS env var) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ission Implements all envtest-capable test scenarios identified in docs/test-coverage-analysis.md: **TWOR integration tests (tests 1–7)** in internal/tests/internal/twor_integration_test.go: - Owner reference on versioned Deployment points to TWOR's owning Deployment - matchLabels auto-injection (null sentinel → selector labels) - Multiple TWORs on the same TWD each produce independent resources - Go template variables (DeploymentName, BuildID, TemporalNamespace) - Multiple active build IDs each get their own owned resource instance - Partial SSA failure is isolated per resource (other versions still apply) - SSA idempotency: repeated reconciles produce no spurious updates **Rollout integration tests (tests 8, 9, 10, 13)** in internal/tests/internal/rollout_integration_test.go: - Progressive rollout auto-promotes to Current after the 30s pause expires - Controller repairs stale ConnectionSpecHashAnnotation on a versioned Deployment - Gate input from ConfigMap: blocks Deployment creation until ConfigMap exists - Three successive rollouts accumulate two deprecated versions in status **Webhook integration tests (tests 14–18)** in api/v1alpha1/temporalworkerownedresource_webhook_integration_test.go: - Banned kind rejected via the real HTTP admission path - SAR pass: admin user + controller SA with HPA RBAC → creation allowed - SAR fail: impersonated user without HPA permission → rejected - SAR fail: controller SA without HPA RBAC → rejected - workerRef.name immutability enforced via real HTTP update Supporting changes: - config/webhook/manifests.yaml: add TWOR ValidatingWebhookConfiguration - api/v1alpha1/webhook_suite_test.go: register TWOR webhook; add corev1/rbacv1/authorizationv1 to scheme; set controller SA env vars - docs/test-coverage-analysis.md: corrections to webhook and autoscaling sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…de review feedback - Convert runTWORTests and runRolloutTests into tworTestCases()/rolloutTestCases() table slices that run through the standard testTemporalWorkerDeploymentCreation runner, eliminating duplicate validation paths for TWD status and Temporal state - Add WithTWDMutatorFunc and WithPostTWDCreateFunc hooks to TestCaseBuilder to support gate-ConfigMap blocking and other pre/post-create mutations without polluting the builder API - Remove all test-number references (tests 1–7, Test 8, etc.) from comments; replace with self-contained scenario descriptions that don't depend on the coverage doc - Improve WithValidatorFunction doc comment to precisely state execution order: runs after both verifyTemporalWorkerDeploymentStatusEventually and verifyTemporalStateMatchesStatusEventually have confirmed the expected TWD state Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mark TWOR gaps 1-7, rollout gaps 8-10/13, and webhook tests 14-18 as implemented. Update subtest counts (28→39 integration, 0→5 webhook suite). Update priority recommendations to reflect remaining gaps.
Mirror of gate-input-from-configmap using SecretKeyRef. Controller blocks Deployment creation while the Secret is absent; creating the Secret unblocks the reconcile loop and the version promotes to Current. Also updates the test-coverage-analysis.md to mark Gate input from Secret as covered (was the last open envtest gap).
| @@ -0,0 +1,372 @@ | |||
| # Test Coverage Analysis | |||
There was a problem hiding this comment.
I will be deleting this!
Also, it is AI generated, and I have only read the top half carefully. Use it for your information during review ("The Controller's Contract" might be an especially helpful section for us all to understand and agree on), but don't spend too much time on the details here. Not intended to be a part of the main repo.
Without this, 'go test ./...' in CI (where etcd is not installed) caused BeforeSuite to crash immediately instead of skipping gracefully. - BeforeSuite: Skip() early when KUBEBUILDER_ASSETS is unset - AfterSuite: return early when testEnv is nil (BeforeSuite was skipped) - test-unit: add envtest prerequisite and set KUBEBUILDER_ASSETS so the webhook integration tests actually run (not skip) in the unit test job
- use-errors-new: replace fmt.Errorf with errors.New for static error
string in worker_controller.go TWOR field indexer
- unchecked-type-assertion (ownedresources.go):
* rendered.(map[string]interface{}) → two-value form + error return
* raw["metadata"].(map[string]interface{}) → check ok, skip nil branch
* meta["labels"].(map[string]interface{}) → check ok, skip nil branch
- cognitive-complexity: split autoInjectFields (complexity 33) into
three focused helpers:
* buildScaleTargetRef — constructs the scaleTargetRef map
* labelsAsInterface — converts string labels to interface map
* autoInjectInValue — handles recursion into nested maps and slices
…controller into temporal-worker-owned-resource
| "k8s.io/apimachinery/pkg/runtime" | ||
| ) | ||
|
|
||
| // WorkerDeploymentReference references a TemporalWorkerDeployment in the same namespace. |
There was a problem hiding this comment.
Temporal namespace or Kubernetes namespace?
| type TemporalWorkerOwnedResourceSpec struct { | ||
| // WorkerRef references the TemporalWorkerDeployment to attach this resource to. | ||
| // +kubebuilder:validation:Required | ||
| WorkerRef WorkerDeploymentReference `json:"workerRef"` |
There was a problem hiding this comment.
Should be WorkerDeploymentRef since this refers to the deployment not the worker.
| ) | ||
|
|
||
| // WorkerDeploymentReference references a TemporalWorkerDeployment in the same namespace. | ||
| type WorkerDeploymentReference struct { |
There was a problem hiding this comment.
The naming of objects in these new files is inconsistent in a couple ways. Some of the resources are prefixed with Temporal while others are not. Other resources are prefixed with TemporalWorkerOwned but actually refer to things that are owned by the TemporalWorkerDeployment, not the Worker.
| // Object is the Kubernetes resource template to attach to each versioned Deployment. | ||
| // One copy of this resource is created per active Build ID, owned by the corresponding | ||
| // versioned Deployment (so it is garbage collected when the Deployment is deleted). |
There was a problem hiding this comment.
If it's a template, this field really should be called Template :)
| // String values in the spec may contain Go template expressions: | ||
| // {{ .DeploymentName }} - the controller-generated versioned Deployment name | ||
| // {{ .TemporalNamespace }} - the Temporal namespace the worker connects to | ||
| // {{ .BuildID }} - the Build ID for this version |
There was a problem hiding this comment.
Allowing template variable interpolation is a dangerous game to get into because you now need to guard on the controller side (as opposed to the API machinery/schema-side) against malformed or malicious content. I'd recommend not supporting template variable interpolation for this first PR.
| Certain resource kinds are blocked by default to prevent misuse (e.g., using `TemporalWorkerOwnedResource` to spin up arbitrary workloads). The default banned list is: | ||
|
|
||
| ``` | ||
| Deployment, StatefulSet, Job, Pod, CronJob |
There was a problem hiding this comment.
Should DaemonSet be excluded as well? What about Service? At some point, you might find that the exclusion list basically includes a bunch of things and that specifying particular supported resource kinds (plus, perhaps CRDs) is a better approach...
| # Self-signed issuer and certificate for the webhook server TLS. | ||
| # Requires cert-manager to be installed in the cluster. | ||
| # See https://cert-manager.io/docs/installation/ for installation instructions. |
There was a problem hiding this comment.
Note that you can add cert-manager as a conditional Helm chart dependency.
https://helm.sh/docs/chart_best_practices/dependencies/#conditions-and-tags
| // OwnedResourceFieldManager returns the SSA field manager string for a given TWOR. | ||
| // | ||
| // The field manager identity has two requirements: | ||
| // 1. Stable across reconcile loops — the API server uses it to track which fields | ||
| // this controller "owns". Changing it abandons the old ownership records and | ||
| // causes spurious field conflicts until the old entries expire. | ||
| // 2. Unique per TWOR instance — if two different TWORs both render resources into | ||
| // the same namespace, their field managers must differ so each can own its own | ||
| // set of fields without conflicting with the other. | ||
| // | ||
| // Using "twc/{namespace}/{name}" satisfies both: it never changes for a given TWOR | ||
| // and is globally unique within the cluster (namespace+name is a unique identifier). | ||
| // The string is capped at 128 characters to stay within the Kubernetes API limit. | ||
| func OwnedResourceFieldManager(twor *temporaliov1alpha1.TemporalWorkerOwnedResource) string { | ||
| fm := "twc/" + twor.Namespace + "/" + twor.Name | ||
| return TruncateString(fm, 128) | ||
| } |
There was a problem hiding this comment.
Why not just have the field manager be a const temporal-worker-controller? The general pattern is for the field manager to be a constant string named after the controller itself.
Examples:
| for k, v := range selectorLabels { | ||
| existingLabels[k] = v | ||
| } | ||
| meta["labels"] = existingLabels |
There was a problem hiding this comment.
I recommend using the unstructured library instead of manually manipulating map[string]interface{}.
For example, things like this:
if spec, ok := raw["spec"].(map[string]interface{}); ok {
autoInjectFields(spec, data.DeploymentName, selectorLabels)
}can be this:
spec, ok, _ := unstructured.NestedMap(raw, "spec")
if ok {
autoInjectFields(spec, data.DeploymentName, selectorLabels)
}things like this:
raw["metadata"] = metacan be this:
unstructured.SetNestedMap(raw, "metadata", meta)etc...
| meta["ownerReferences"] = []interface{}{ | ||
| map[string]interface{}{ | ||
| "apiVersion": appsv1.SchemeGroupVersion.String(), | ||
| "kind": "Deployment", | ||
| "name": deployment.Name, | ||
| "uid": string(deployment.UID), | ||
| "blockOwnerDeletion": blockOwnerDeletion, | ||
| "controller": isController, | ||
| }, | ||
| } | ||
|
|
||
| raw["metadata"] = meta | ||
|
|
||
| obj := &unstructured.Unstructured{Object: raw} | ||
| return obj, nil |
There was a problem hiding this comment.
The unstructured.Unstructured struct already has methods for things like SetNamespace() or SetName(), SetLabels() along with things like SetOwnerReferences(). So you might consider creating obj at the top of this function and using those object methods instead of the manual construction of various fields.
What was changed
New CRD:
TemporalWorkerOwnedResource(TWOR)A new
TemporalWorkerOwnedResourceCRD that lets users attach arbitrary namespaced Kubernetes resources (HPAs, PDBs, KEDA ScaledObjects, etc.) to aTemporalWorkerDeployment. The controller creates one copy of the attached resource per active worker version, with auto-injection ofscaleTargetRefandselector.matchLabelsto point at the correct versioned Deployment.Key behaviors:
{twdName}-{tworName}-{buildID}(uniquely truncated to 47 chars, DNS-safe)spec.scaleTargetRef(when explicitly null) to reference the versioned Deployment -> enables per-version HPA autoscalingselector.matchLabels(when explicitly null) with the correct per-version labels -> enables per-version PDB targeting, and arbitrary CRDs that useselector.matchLabelsto target versioned DeploymentsTemporalWorkerOwnedResource.status.versions[*](Applied, Message, BuildID){{ .DeploymentName }}, {{ .TemporalNamespace }}in any string fieldValidating Webhook
A TemporalWorkerOwnedResourceValidator webhook enforces:
metadata.name/metadata.namespaceforbidden (controller sets these)Deployment,StatefulSet,Job,Pod,CronJobby default; configurable viaBANNED_KINDSenv var)minReplicas≠ 0 (required forapproximate_task_queue_backlogmetric-based autoscaling to work when queue is idle)scaleTargetRefandselector.matchLabelsmust be absent or null (controller owns injection)workerRef.nameis immutable after creationHelm chart updates
BANNED_KINDS,POD_NAMESPACE,SERVICE_ACCOUNT_NAMEenv vars)ClusterRole; editor/viewer roles; configurable attached-resource RBAC)ownedResourceConfig.bannedKinds,ownedResourceConfig.rbac.wildcard, default false,ownedResourceConfig.rbac.rules, default: HPA + PDB)Integration tests
New integration test subtests added to the existing envtest suite, all running through the shared
testTemporalWorkerDeploymentCreationtable-test runner:TemporalWorkerOwnedResource(7 tests): Deployment owner ref, matchLabels injection, multiple TWORs on sameTemporalWorkerDeployment, template variable rendering, multiple active versions, apply failure → Applied:false, SSA idempotencyworkerRef.name immutability
Why?
HPA autoscaling for versioned Temporal workers requires a separate HPA per worker version, each targeting only that version's Deployment with the correct scaleTargetRef and label selectors. Without this CRD, users have no way to create per-version resources that the controller lifecycle-manages alongside the versioned Deployments.
Checklist
Closes Enable CRUD of controller-managed scaling objects (and other custom scalers) #207
How was this tested:
Any docs updates needed?
docs/test-coverage-analysis.mdupdated to reflect all newly covered test scenarios*docs/twor.mdadded: concept overview, HPA example with cert-manager setup, RBAC configuration guide