Skip to content

HDDS-15208. OM should learn to finalize from SCM#10236

Open
sodonnel wants to merge 12 commits into
apache:HDDS-14496-zdufrom
sodonnel:HDDS-15208
Open

HDDS-15208. OM should learn to finalize from SCM#10236
sodonnel wants to merge 12 commits into
apache:HDDS-14496-zdufrom
sodonnel:HDDS-15208

Conversation

@sodonnel
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

When OM is started, it checks if finalization is needed. If so, it spawns a background service which polls SCM to see if it is time to finalize or not. If SCM indicates it should finalize, then it will finalize the OMs via Ratis and then shutdown the background service to avoid any further polling.

The poll interval defaults to 60 seconds, but its configurable via ozone.om.upgrade.finalization.check.interval.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15208

How was this patch tested?

Existing integration tests were modified to remove the calls to the old OM finalize command, and instead trigger finalize via SCM. New unit tests have been added for the new polling service.

@github-actions github-actions Bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label May 11, 2026
@errose28 errose28 self-requested a review May 11, 2026 15:44
Copy link
Copy Markdown
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. Overall looks good, just left some minor comments inline.

Comment on lines +115 to +118
conf.setInt(OMStorage.TESTING_INIT_APPARENT_VERSION_KEY, OzoneManagerVersion.ATOMIC_REWRITE_KEY.serialize());
conf.setInt(SCMStorageConfig.TESTING_INIT_LAYOUT_VERSION_KEY,
HDDSLayoutFeature.HADOOP_PRC_PORTS_IN_DATANODEDETAILS.layoutVersion());
conf.set(OMConfigKeys.OZONE_OM_UPGRADE_FINALIZATION_CHECK_INTERVAL, "10ms");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these versions changed? The OM version used here is older than the unified versioning framework but not an OMLayoutFeature so it is not meant to be written to the version file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was making changes to try to get this to work locally, but was failing due to the annotation / aspectj stuff and then forgot to change it back. I've reset it and will see if it passes the CI now.

Comment on lines +237 to +238
conf.setInt(SCMStorageConfig.TESTING_INIT_LAYOUT_VERSION_KEY,
HDDSLayoutFeature.HADOOP_PRC_PORTS_IN_DATANODEDETAILS.layoutVersion());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change the HDDS version in this and other OM tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SCM does not need finalized then OM will immediately finalize on the first poll of SCM and it makes it difficult to control the flow in the test. Therefore we need to make it so that SCM is unfinalized, then we finalize it and that triggers the OM finalize.

Comment on lines +218 to +219
cluster.getStorageContainerLocationClient().finalizeUpgrade();
OMUpgradeTestUtils.waitForFinalization(omClient);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure these will always be called together and only in an integration test scenario. Should we combine them into a single method in MiniOzoneCluster to finalize and wait, similar to waitForClusterToBeReady?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know - the discoverability of these sort of methods isn't great and there may be places we want to have something in between.

omClient.finalizeUpgrade("finalize-test");
System.out.println("Finalization Messages : " + response.msgs());
// Send the finalize command to SCM which triggers the OM finalize when SCM reports it is complete.
cluster.getStorageContainerLocationClient().finalizeUpgrade();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audit log check on the next line should be after we have waited for OM to finish finalizing. Also I think we want to switch this to the system audit log instead of the user audit log since it's triggered by a background process, not a user call to the OM.

Copy link
Copy Markdown
Contributor Author

@sodonnel sodonnel May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this work (now and in preexisting tests) because the audit log check retries until it finds what it needs. I swapped the lines around.

I didn't know there was a system audit log. I guess its still kind of triggered by the use, as someone has to send a finalize command which then propagates through the system.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should change to the system audit in this PR, or in the one that makes the OM command a noop. If we change it here, it may break things in the existing command which we want to remove. I am open to either way.

.build();
OMClientRequest omClientRequest = OzoneManagerRatisUtils.createClientRequest(omRequest, ozoneManager);
omRequest = omClientRequest.preExecute(ozoneManager);
OzoneManagerRatisUtils.submitRequest(ozoneManager, omRequest, ClientId.randomId(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably get the response and log if it fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that makes sense. I kind of assumed failure would result in an exception, but that probably isn't always the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants