Skip to content

Commit 7d548af

Browse files
committed
Enhancement: Add drift detection and automatic reconciliation
Proposal for drift detection feature.
1 parent eeb37af commit 7d548af

1 file changed

Lines changed: 277 additions & 0 deletions

File tree

enhancements/drift-detection.md

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# Enhancement: Drift Detection and Automatic Reconciliation
2+
3+
| Field | Value |
4+
|-------|-------|
5+
| **Status** | implementable |
6+
| **Author(s)** | @eshulman |
7+
| **Created** | 2026-02-03 |
8+
| **Last Updated** | 2026-02-03 |
9+
| **Tracking Issue** | TBD |
10+
11+
## Summary
12+
13+
This enhancement introduces drift detection and automatic reconciliation for ORC managed resources. The feature enables ORC to periodically check OpenStack resources for changes made outside of ORC (via CLI, dashboard, or other tools) and automatically restore them to match the desired state defined in the Kubernetes specification.
14+
15+
Additionally, managed resources that are deleted externally from OpenStack will be automatically recreated by ORC, ensuring the declared state is maintained.
16+
17+
## Motivation
18+
19+
In production environments, OpenStack resources may be modified outside of ORC through various means:
20+
21+
- Direct OpenStack CLI/SDK operations
22+
- OpenStack Horizon dashboard
23+
- Other automation tools or controllers
24+
- Manual emergency interventions
25+
- Third-party integrations
26+
27+
Without drift detection, these changes go unnoticed until they cause issues, leading to configuration drift between the declared Kubernetes state and the actual OpenStack state. This undermines the declarative model that ORC provides.
28+
29+
Similar Kubernetes controllers for cloud resources have implemented drift detection:
30+
31+
- **AWS Controllers for Kubernetes (ACK)**: Drift detection is **enabled by default** with a 10-hour resync period. Uses a detect-then-correct approach: periodically describes the AWS resource and only updates if drift is found. Configuration is set per-controller by authors, not configurable per-resource by users. No per-resource opt-out mechanism documented. ([ACK Drift Recovery docs](https://aws-controllers-k8s.github.io/community/docs/user-docs/drift-recovery/))
32+
33+
- **Azure Service Operator (ASO)**: Drift detection is **enabled by default** with a 1-hour resync period. Uses a PUT-on-every-reconcile approach rather than detect-then-correct. Provides **per-resource opt-out** via `reconcile-policy` annotation for adopted resources users don't want fully managed. **Global configuration** via `AZURE_SYNC_PERIOD` environment variable. Rate limiting via token-bucket algorithm and `MAX_CONCURRENT_RECONCILES` for parallelism control. ([ASO Controller Settings](https://azure.github.io/azure-service-operator/guide/aso-controller-settings-options/), [ASO Change Detection ADR](https://azure.github.io/azure-service-operator/design/adr-2022-11-change-detection/))
34+
35+
**Key design observations:**
36+
- Both projects enable drift detection by default
37+
- ASO provides more user-facing configuration options (global and per-resource)
38+
- Neither project documents behavior for externally-deleted resources
39+
40+
## Goals
41+
42+
- **Ensure state consistency**: Managed resources in OpenStack should match the desired state declared in Kubernetes
43+
- **Detect external modifications**: Identify when OpenStack resources are modified outside of ORC
44+
- **Automatic correction**: Restore drifted resources to their desired state without manual intervention
45+
- **Resource recreation**: Recreate managed resources that are deleted externally from OpenStack
46+
- **Configurable frequency**: Allow operators to tune the resync interval based on their requirements
47+
- **Hierarchical configuration**: Support configuration at ORC-wide and per-resource levels, at minimum
48+
- **Minimal API impact**: Avoid excessive OpenStack API calls that could trigger rate limiting
49+
50+
## Non-Goals
51+
52+
- **Real-time drift detection**: Event-driven detection of changes (would require OpenStack webhooks or very short polling intervals)
53+
- **Drift reporting without correction**: Alerting on drift without taking corrective action. This applies to both mutable fields (which are corrected, not just reported) and immutable fields (which are ignored, not reported). May be considered as a future enhancement.
54+
- **Selective field reconciliation**: Allowing some fields to drift while correcting others
55+
- **Conflict resolution with merge semantics**: Merging external changes with desired state
56+
- **Drift correction for unmanaged resources**: Unmanaged resources are not modified by ORC; however, periodic resync will refresh their status to reflect the current OpenStack state
57+
58+
## Proposal
59+
60+
### Periodic Resync Mechanism
61+
62+
The drift detection mechanism works by periodically triggering reconciliation of resources. Unlike event-driven reconciles (triggered by Kubernetes spec/status changes), drift detection uses a time-based trigger to catch changes made directly in OpenStack. For managed resources, this includes drift correction; for unmanaged resources, this refreshes the status only.
63+
64+
1. **Trigger**: After a resource reaches a stable state (Progressing=False), ORC schedules a resync after `resyncPeriod` duration
65+
2. **Fetch**: On resync, ORC fetches the current state of the OpenStack resource
66+
3. **Compare**: The current state is compared against the desired state in the Kubernetes spec
67+
4. **Update**: If drift is detected, ORC updates the OpenStack resource to match the desired state
68+
5. **Reschedule**: After successful reconciliation, the next resync is scheduled
69+
70+
#### Implementation Details
71+
72+
At the end of a successful reconciliation (when no other reschedule is pending), the controller schedules the next resync:
73+
74+
```go
75+
// If periodic resync is enabled and we're not already rescheduling for
76+
// another reason, schedule the next resync to detect drift.
77+
if resyncPeriod > 0 {
78+
needsReschedule, _ := reconcileStatus.NeedsReschedule()
79+
if !needsReschedule {
80+
reconcileStatus = reconcileStatus.WithRequeue(resyncPeriod)
81+
}
82+
}
83+
```
84+
85+
This ensures the controller automatically triggers reconciliation after the configured period.
86+
87+
Additionally, `shouldReconcile` must be updated to allow periodic resync. Currently it returns `false` when `Progressing=False` and generation is current, which would discard resync requests. The updated logic checks the last sync timestamp:
88+
89+
```go
90+
func shouldReconcile(obj orcv1alpha1.ObjectWithConditions, resyncPeriod time.Duration) bool {
91+
// ... existing checks ...
92+
93+
// At this point, Progressing is False and generation is up to date.
94+
// For periodic resync, check if enough time has passed since the last sync.
95+
if resyncPeriod > 0 {
96+
if lastSync := obj.GetLastSyncTime(); lastSync != nil {
97+
return time.Since(lastSync.Time) >= resyncPeriod
98+
}
99+
return true // First sync after feature enablement
100+
}
101+
return false
102+
}
103+
```
104+
105+
**Note**: Using `Progressing.LastTransitionTime` is not suitable because it only updates when the condition value changes, not on every reconcile. A dedicated `LastSyncTime` status field is required (see Status Changes below).
106+
107+
**Resources in terminal error are not resynced**: When a resource is in a terminal error state (e.g., invalid configuration, unrecoverable OpenStack error), periodic resync is not scheduled. Terminal errors indicate issues that cannot be resolved through automatic retry and require manual intervention to fix the underlying problem. This prevents wasted reconciliation cycles on resources that are known to be in an unrecoverable state.
108+
109+
### API Changes
110+
111+
A `resyncPeriod` field is added at the spec level, making it available to both managed and unmanaged resources:
112+
113+
```yaml
114+
apiVersion: openstack.k-orc.cloud/v1alpha1
115+
kind: Network
116+
metadata:
117+
name: critical-network
118+
spec:
119+
cloudCredentialsRef:
120+
secretName: openstack-clouds
121+
cloudName: openstack
122+
managementPolicy: managed
123+
resyncPeriod: 1h # Periodic resync every hour
124+
resource:
125+
description: Critical application network
126+
```
127+
128+
**Default**: Disabled (`0`). Set a positive duration like `10h` to enable.
129+
130+
### Status Changes
131+
132+
A new `lastSyncTime` field is added to the status of all ORC resources:
133+
134+
```yaml
135+
status:
136+
lastSyncTime: "2026-02-03T10:30:00Z" # Last successful reconciliation with OpenStack
137+
id: "abc123"
138+
# ... other status fields
139+
```
140+
141+
This field is updated at the end of every successful reconciliation that fetches the resource from OpenStack. It is required because:
142+
143+
1. **Controller restarts**: Without persisted state, the controller would lose track of when resources were last synced, potentially causing a thundering herd of reconciliations on restart.
144+
2. **Accurate timing**: The `Progressing.LastTransitionTime` only updates when the condition value changes, not on every reconcile, making it unsuitable for tracking sync intervals.
145+
146+
The `shouldReconcile` function uses this field to determine if enough time has passed since the last sync to trigger a periodic resync.
147+
148+
### Behavior by Management Policy
149+
150+
The periodic resync behavior differs based on `managementPolicy`:
151+
152+
| Policy | On Resync |
153+
|--------|-----------|
154+
| `managed` | Fetch from OpenStack → correct drift → update status |
155+
| `unmanaged` | Fetch from OpenStack → update status only (no writes to OpenStack) |
156+
157+
This allows unmanaged/imported resources to keep their `status.resource` in sync with the actual OpenStack state without ORC modifying the resource.
158+
159+
### Configuration Hierarchy
160+
161+
Drift detection supports a two-level configuration hierarchy:
162+
163+
| Level | Scope | Configuration Location | Precedence |
164+
|-------|-------|----------------------|------------|
165+
| ORC-wide | All resources across all types | CLI flag | Lowest |
166+
| Per-resource | Individual resource instance | `spec.resyncPeriod` on the CR | Highest |
167+
168+
**Resolution order**: Per-resource → ORC-wide → Built-in default (disabled)
169+
170+
#### ORC-wide Configuration Options
171+
172+
A CLI flag sets the global default:
173+
174+
```
175+
--default-resync-period=10h
176+
```
177+
178+
For per-resource-type configuration, platform teams can use [kro (Kube Resource Orchestrator)](https://kro.run/) to wrap ORC resources with organizational defaults without changes to ORC itself.
179+
180+
### Resource Recreation on External Deletion
181+
182+
When a resource with `managementPolicy=managed` is deleted from OpenStack but the ORC object still exists:
183+
184+
1. On the next reconciliation, ORC attempts to fetch the resource by the ID stored in `status.id`
185+
2. If not found and the resource was originally created by ORC (not imported), ORC recreates it
186+
3. The new resource ID is stored in `status.id`
187+
188+
#### Implementation Changes
189+
190+
Currently, `GetOrCreateOSResource` returns a terminal error when fetching a resource by `status.id` results in a 404. To support resource recreation, this logic must be updated to:
191+
192+
1. Check if `managementPolicy == managed` and the resource was not imported (no `importID` or `importFilter`)
193+
2. If both conditions are met, clear `status.id` and proceed to the creation path instead of returning an error
194+
3. If the resource was imported or is unmanaged, retain the existing terminal error behavior
195+
196+
This ensures that managed resources created by ORC are automatically recreated, while imported or unmanaged resources correctly fail with a terminal error when deleted externally.
197+
198+
**Behavior when drift detection is disabled** (`resyncPeriod: 0`): Periodic resyncs do not occur, so discovery of external deletion depends on other triggers (spec change, controller restart). When discovered, ORC will still recreate managed resources (not a terminal error). The difference is timing of discovery, not the recreation behavior itself.
199+
200+
For **imported resources** that are deleted externally, this is always a terminal error regardless of drift detection settings, because the resource was not created by ORC and recreating it would not restore the original resource.
201+
202+
**Note on dependent resources**: OpenStack enforces referential integrity for most resources (e.g., Networks cannot be deleted while Subnets exist). If resources are deleted through means that bypass these checks (direct database manipulation, OpenStack bugs), drift detection preserves ORC's existing reconciliation behavior:
203+
204+
- **Parent resource (e.g., Network)**: On next reconciliation, `GetOSResourceByID` returns 404 → terminal error ("resource has been deleted from OpenStack").
205+
- **Dependent resource update path (e.g., Subnet update)**: The controller doesn't check if its parent dependency is in terminal error. It fetches the resource by `status.id`, and if successful, proceeds with the update. The result depends on what OpenStack returns for that specific operation and would preserve the existing error handling behavior.
206+
- **Dependent resource create/recreate path**: The controller checks `IsAvailable(parent)` before proceeding. If the parent is in terminal error, the dependent waits on the dependency (not terminal, just waiting).
207+
208+
These behaviors exist regardless of drift detection—drift detection only changes scheduling, not reconciliation logic. Resolving such inconsistencies requires manual intervention.
209+
210+
### Field Coverage
211+
212+
Drift detection covers all **mutable fields** that ORC actuators implement update operations for. Before this feature is considered stable, all actuator implementations must be audited to ensure they cover all mutable fields.
213+
214+
## Risks and Edge Cases
215+
216+
### Split-Brain Scenarios
217+
218+
**Risk**: Multiple controllers or systems may be managing the same OpenStack resources, leading to conflicts where changes are repeatedly overwritten.
219+
220+
**Mitigation**:
221+
- Document that ORC should be the sole manager of resources it creates
222+
- Report conflicts in resource conditions for observability
223+
224+
### API Rate Limiting
225+
226+
**Risk**: Frequent resync across many resources could trigger OpenStack API rate limiting.
227+
228+
**Mitigation**:
229+
- Disabled by default; when enabled, recommend conservative intervals (e.g., 10 hours)
230+
- Add random jitter to resync times to avoid thundering herd: since reconciliation already uses "requeue after X duration", jitter simply adds a random offset (e.g., ±10%) to the resync period, spreading resyncs over time rather than having them fire simultaneously
231+
- Allow operators to disable or lengthen resync for stable resources
232+
233+
### Controller Resource Consumption
234+
235+
**Risk**: Frequent reconciliation increases CPU and memory usage on the ORC controller.
236+
237+
**Mitigation**:
238+
- Disabled by default; when enabled, conservative intervals limit reconciliation frequency
239+
240+
### Conflicts with External Systems
241+
242+
**Risk**: If resources are intentionally managed by external systems (e.g., autoscalers, other controllers), drift correction can cause unexpected behavior.
243+
244+
**Mitigation**:
245+
- Allow `resyncPeriod: 0` to disable drift detection
246+
- Use `managementPolicy: unmanaged` for externally managed resources
247+
- Document the implications clearly in the user guide
248+
249+
### Upgrade/Downgrade Considerations
250+
251+
**Risk**: Users upgrading to a version with drift detection may experience unexpected reconciliations.
252+
253+
**Mitigation**: Drift detection is disabled by default (opt-in), so users upgrading will not experience any behavior change unless they explicitly enable it. Document the new feature in release notes.
254+
255+
## Alternatives Considered
256+
257+
### Event-Driven Drift Detection
258+
259+
Use OpenStack notifications (Oslo messaging) to detect changes in real-time.
260+
261+
**Rejected because**: Requires OpenStack notification infrastructure, complex to implement, not all deployments have notifications enabled.
262+
263+
### Drift Detection Without Correction
264+
265+
Detect and report drift without automatically correcting it.
266+
267+
**Out of scope for this enhancement**: While drift notification has value for observability, it is better addressed as a separate alerting effort. This enhancement focuses on drift correction; reporting-only mode could be added as a future management policy option.
268+
269+
### Watch-Based Detection
270+
271+
Implement a watcher that periodically lists all resources from OpenStack and compares.
272+
273+
**Rejected because**: List operations can be expensive, harder to implement with proper filtering, and per-resource reconciliation integrates naturally with controller-runtime.
274+
275+
## Implementation History
276+
277+
- 2026-02-03: Enhancement proposed

0 commit comments

Comments
 (0)