HDDS-15208. OM should learn to finalize from SCM#10236
Conversation
errose28
left a comment
There was a problem hiding this comment.
Thanks for adding this. Overall looks good, just left some minor comments inline.
| conf.setInt(OMStorage.TESTING_INIT_APPARENT_VERSION_KEY, OzoneManagerVersion.ATOMIC_REWRITE_KEY.serialize()); | ||
| conf.setInt(SCMStorageConfig.TESTING_INIT_LAYOUT_VERSION_KEY, | ||
| HDDSLayoutFeature.HADOOP_PRC_PORTS_IN_DATANODEDETAILS.layoutVersion()); | ||
| conf.set(OMConfigKeys.OZONE_OM_UPGRADE_FINALIZATION_CHECK_INTERVAL, "10ms"); |
There was a problem hiding this comment.
Why were these versions changed? The OM version used here is older than the unified versioning framework but not an OMLayoutFeature so it is not meant to be written to the version file.
There was a problem hiding this comment.
I think I was making changes to try to get this to work locally, but was failing due to the annotation / aspectj stuff and then forgot to change it back. I've reset it and will see if it passes the CI now.
| conf.setInt(SCMStorageConfig.TESTING_INIT_LAYOUT_VERSION_KEY, | ||
| HDDSLayoutFeature.HADOOP_PRC_PORTS_IN_DATANODEDETAILS.layoutVersion()); |
There was a problem hiding this comment.
Why do we need to change the HDDS version in this and other OM tests?
There was a problem hiding this comment.
If SCM does not need finalized then OM will immediately finalize on the first poll of SCM and it makes it difficult to control the flow in the test. Therefore we need to make it so that SCM is unfinalized, then we finalize it and that triggers the OM finalize.
| cluster.getStorageContainerLocationClient().finalizeUpgrade(); | ||
| OMUpgradeTestUtils.waitForFinalization(omClient); |
There was a problem hiding this comment.
I'm pretty sure these will always be called together and only in an integration test scenario. Should we combine them into a single method in MiniOzoneCluster to finalize and wait, similar to waitForClusterToBeReady?
There was a problem hiding this comment.
I don't know - the discoverability of these sort of methods isn't great and there may be places we want to have something in between.
| omClient.finalizeUpgrade("finalize-test"); | ||
| System.out.println("Finalization Messages : " + response.msgs()); | ||
| // Send the finalize command to SCM which triggers the OM finalize when SCM reports it is complete. | ||
| cluster.getStorageContainerLocationClient().finalizeUpgrade(); |
There was a problem hiding this comment.
The audit log check on the next line should be after we have waited for OM to finish finalizing. Also I think we want to switch this to the system audit log instead of the user audit log since it's triggered by a background process, not a user call to the OM.
There was a problem hiding this comment.
I guess this work (now and in preexisting tests) because the audit log check retries until it finds what it needs. I swapped the lines around.
I didn't know there was a system audit log. I guess its still kind of triggered by the use, as someone has to send a finalize command which then propagates through the system.
There was a problem hiding this comment.
I wonder if we should change to the system audit in this PR, or in the one that makes the OM command a noop. If we change it here, it may break things in the existing command which we want to remove. I am open to either way.
| .build(); | ||
| OMClientRequest omClientRequest = OzoneManagerRatisUtils.createClientRequest(omRequest, ozoneManager); | ||
| omRequest = omClientRequest.preExecute(ozoneManager); | ||
| OzoneManagerRatisUtils.submitRequest(ozoneManager, omRequest, ClientId.randomId(), |
There was a problem hiding this comment.
We should probably get the response and log if it fails.
There was a problem hiding this comment.
Yea that makes sense. I kind of assumed failure would result in an exception, but that probably isn't always the case.
What changes were proposed in this pull request?
When OM is started, it checks if finalization is needed. If so, it spawns a background service which polls SCM to see if it is time to finalize or not. If SCM indicates it should finalize, then it will finalize the OMs via Ratis and then shutdown the background service to avoid any further polling.
The poll interval defaults to 60 seconds, but its configurable via ozone.om.upgrade.finalization.check.interval.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15208
How was this patch tested?
Existing integration tests were modified to remove the calls to the old OM finalize command, and instead trigger finalize via SCM. New unit tests have been added for the new polling service.