Skip to content

Nodeset rabbitmquser finalizer management and status tracking via configmap#1781

Open
lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
lmiccini:nodeset_rmqu_finalizer_configmap
Open

Nodeset rabbitmquser finalizer management and status tracking via configmap#1781
lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
lmiccini:nodeset_rmqu_finalizer_configmap

Conversation

@lmiccini
Copy link
Contributor

@lmiccini lmiccini commented Jan 27, 2026

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lmiccini
Once this PR has been reviewed and has the lgtm label, please assign rabi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/56ac80bd0e7547ad88350eb0206886b5

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 18m 47s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 23m 38s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 37m 31s
adoption-standalone-to-crc-ceph-provider FAILURE in 3h 01m 55s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 51m 23s
openstack-operator-docs-preview POST_FAILURE in 2m 32s

@stuggi stuggi requested a review from slagle January 28, 2026 08:13
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/db62c9cd33b34a538c7eccf243769b6a

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 02m 26s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 20m 56s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 36m 03s
adoption-standalone-to-crc-ceph-provider FAILURE in 1h 46m 57s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 34m 08s
openstack-operator-docs-preview POST_FAILURE in 3m 15s

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from 3885c4a to c1fe8f8 Compare February 7, 2026 18:56
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b5d3972863e64857b2da5055f867ef55

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 20m 43s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 21m 41s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 36m 22s
adoption-standalone-to-crc-ceph-provider FAILURE in 2h 05m 30s
✔️ openstack-operator-tempest-multinode SUCCESS in 1h 43m 01s
✔️ openstack-operator-docs-preview SUCCESS in 3m 14s

@lmiccini
Copy link
Contributor Author

lmiccini commented Feb 8, 2026

/retest

@lmiccini
Copy link
Contributor Author

lmiccini commented Feb 8, 2026

recheck

@lmiccini
Copy link
Contributor Author

lmiccini commented Feb 8, 2026

/test openstack-operator-build-deploy-kuttl-4-18

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch 2 times, most recently from cbfbb7c to f52529a Compare February 8, 2026 15:01
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from f52529a to 017d2ca Compare February 10, 2026 06:41
@lmiccini
Copy link
Contributor Author

/test functional

@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from 017d2ca to 97bb482 Compare February 12, 2026 09:50
Add dataplane-specific logic to track and manage RabbitMQ user finalizers
for OpenStackDataPlaneNodeSet services, enabling safe credential rotation
across multi-cluster deployments.

Key features:
- Per-nodeset finalizers on shared RabbitMQ users
- Incremental deployment support with proper finalizer timing
- Nova-operator rabbitmq_user_name field integration for simplified tracking
- Automatic cleanup of temporary cleanup-blocked finalizers
- Comprehensive test coverage for rotation and multi-cluster scenarios

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@lmiccini lmiccini force-pushed the nodeset_rmqu_finalizer_configmap branch from 97bb482 to b1d9350 Compare February 12, 2026 15:46
@lmiccini
Copy link
Contributor Author

/test openstack-operator-build-deploy-kuttl-4-18

1 similar comment
@lmiccini
Copy link
Contributor Author

/test openstack-operator-build-deploy-kuttl-4-18

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 13, 2026

@lmiccini: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl 1698305 link true /test openstack-operator-build-deploy-kuttl
ci/prow/openstack-operator-build-deploy-kuttl-4-18 b1d9350 link true /test openstack-operator-build-deploy-kuttl-4-18

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@slagle
Copy link
Contributor

slagle commented Feb 17, 2026

Is preventing the deletion of in use rabbitmq users the point of this PR? Why do we need these finalizers to enable "safe rotation"?

I'm concerned about the size and complexity of this PR. Personally, this is difficult to review. We might want to come up with a simpler design that we code without AI, and then let AI build on top of that. I'm having a hard time reasoning about all the different changes here.

This also adds some service specific code to the dataplane (nova, neutron, ironic). While we have some instances of that, we have really tried to avoid that in the past, and do things generically and let CRD fields drive the generic code.

I'm just brainstorming, but a simpler solution might be:

  • We know the Secret/ConfigMaps in use at service deployment time.
  • Services have a field whose value we use to inspect the Secret/ConfigMap and we save the value found (such as transportURL) on the NodeSet or Deployment Status when the Deployment succeeds
  • rabbitmq user deletion checks NodeSet or Deployment Status and if it find that user in use, blocks the deletion.

For example, the nova Service has in the spec:

serviceTrackingFields:

  • dataSource: # ConfigMapRef or SecretRef
    fieldPattern: "nova-transport-url-pattern"

Then during Service Deployment, there is similar logic to GetNovaCellRabbitMqUserFromSecret, we get the value of the user and save it on the NodeSet and/or Deployment Status. If we attempt to rotate or delete the user, and that user is still set on a Status, the operation is blocked.

I would also delay solving the problem of enforcing that all nodes in the nodeset have been updated by a Deployment. This is a wider problem that should be solved separately from the user rotation problem.

Copy link
Contributor

@slagle slagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment

@slagle
Copy link
Contributor

slagle commented Feb 17, 2026

Or even simpler...we already have the Secret and ConfigMap hashes saved in the Deployment statuses. If the rabbitmq user rotation see that those hashes are out of date, the rotation, or at least the old user deletion part of the rotation is blocked.

@lmiccini
Copy link
Contributor Author

lmiccini commented Feb 18, 2026

Is preventing the deletion of in use rabbitmq users the point of this PR? Why do we need these finalizers to enable "safe rotation"?

I'm concerned about the size and complexity of this PR. Personally, this is difficult to review. We might want to come up with a simpler design that we code without AI, and then let AI build on top of that. I'm having a hard time reasoning about all the different changes here.

This also adds some service specific code to the dataplane (nova, neutron, ironic). While we have some instances of that, we have really tried to avoid that in the past, and do things generically and let CRD fields drive the generic code.

I'm just brainstorming, but a simpler solution might be:

* We know the Secret/ConfigMaps in use at service deployment time.

* Services have a field whose value we use to inspect the Secret/ConfigMap and we save the value found (such as transportURL) on the NodeSet or Deployment Status when the Deployment succeeds

* rabbitmq user deletion checks NodeSet or Deployment Status and if it find that user in use, blocks the deletion.

For example, the nova Service has in the spec:

serviceTrackingFields:

* dataSource:  # ConfigMapRef or SecretRef
  fieldPattern: "nova-transport-url-pattern"

Then during Service Deployment, there is similar logic to GetNovaCellRabbitMqUserFromSecret, we get the value of the user and save it on the NodeSet and/or Deployment Status. If we attempt to rotate or delete the user, and that user is still set on a Status, the operation is blocked.

I would also delay solving the problem of enforcing that all nodes in the nodeset have been updated by a Deployment. This is a wider problem that should be solved separately from the user rotation problem.

Thanks @slagle , appreciate you taking the time.
The logic is more or less what you are proposing here.
We add finalizers to the rabbitmq users so that each service can "signal" they are in use, and do garbage collection when no finalizer is present, following the same pattern that we use in other places, to avoid having leftover credentials that could pose a security risk.

The additional stuff "on top" is required because we could have different rabbitmq users for nova_compute, neutron and ironic agents running in the dataplane, so I try to track which node in a nodeset ran a deployment for the aforementioned services and store that in a configmap that we update until all have reconciled to the hashes that you mention in the last comment. Here how it could look like:

[zuul@localhost ~]$ oc get configmap openstack-edpm-ipam-service-tracking -o yaml
apiVersion: v1
data:
  neutron.secretHash: 6e657574726f6e2d646863702d6167656e742d6e657574726f6e2d636f6e6669673a313737303632353235383b6e657574726f6e2d7372696f762d6167656e742d6e657574726f6e2d636f6e6669673a313737303632353235383b
  neutron.updatedNodes: '[]'
  nova.secretHash: 6e6f76612d63656c6c312d636f6d707574652d636f6e6669673a313737303634333733313b
  nova.updatedNodes: '["edpm-compute-0","edpm-compute-1"]'

If I understand correctly you would like to flip this around and have infra-operator track each nodeset rabbitmq usage instead? Not sure having infra-operator introspect dataplane objects is my preferred approach, especially because we have no way of knowing if one additional service will be added tomorrow that could use rabbitmq, so we would have to play catch up with the dataplane. That said, I can try to prototype something and see how ugly it gets.
Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants