HDDS-14927. Add Quasi-Closed Container Tracking in Recon.#10198
HDDS-14927. Add Quasi-Closed Container Tracking in Recon.#10198ArafatKhan2198 wants to merge 7 commits intoapache:masterfrom
Conversation
devmadhuu
left a comment
There was a problem hiding this comment.
Thanks @ArafatKhan2198 for the patch. Please see comments in line.
| */ | ||
| @GET | ||
| @Path("/quasiClosed") | ||
| public Response getQuasiClosedContainers( |
There was a problem hiding this comment.
Can we not use existing /unhealthy API endpoint, why need new API end point ?
There was a problem hiding this comment.
My Idea was that The /unhealthy endpoint reads from the Derby UNHEALTHY_CONTAINERS table which tracks replication health. QUASI_CLOSED is a lifecycle state, not a replication health problem, so I feel It doesn't belong in that table or that flow.
| @Path("/quasiClosed") | ||
| public Response getQuasiClosedContainers( | ||
| @DefaultValue(DEFAULT_FETCH_COUNT) @QueryParam(RECON_QUERY_LIMIT) int limit, | ||
| @DefaultValue(PREV_CONTAINER_ID_DEFAULT_VALUE) @QueryParam(RECON_QUERY_PREVKEY) long prevKey) { |
There was a problem hiding this comment.
Also this new API endpoint uses prevKey as parameter, but frontend is sending minContainerId, so API will always fallback to default value of prevKey as 0 and pagination is broke here completely. Every "next page" click will return the first page again.
There was a problem hiding this comment.
Fixed. Changed @QueryParam(RECON_QUERY_PREVKEY) to @QueryParam(RECON_QUERY_MIN_CONTAINER_ID) so the backend now correctly reads the minContainerId parameter that the frontend sends. Pagination should work correctly now.
| long lastKey = metaList.isEmpty() ? prevKey : metaList.get(metaList.size() - 1).getContainerID(); | ||
| long total = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED); | ||
|
|
||
| return Response.ok(new QuasiClosedContainersResponse(total, firstKey, lastKey, metaList)).build(); |
There was a problem hiding this comment.
Here, this QuasiClosedContainersResponse object is used to wrap the response sent to frontend, but frontend uses ContainersPaginationResponse and two new fields added quasiClosedCount and totalCount there, but don't see totalCount field got added in backend response object - QuasiClosedContainersResponse
| } | ||
| List<ContainerHistory> replicas = containerManager.getLatestContainerHistory(containerID, requiredNodes); | ||
|
|
||
| UnhealthyContainerMetadata metadata = new UnhealthyContainerMetadata( |
There was a problem hiding this comment.
I would recommend to create a dedicated QuasiClosedContainerMetadata that has fields like stateEnterTime, replicaCount, replicaDetails, else use of existing UnhealthyContainerMetadata fields doesn't feel semantically correct and confusing.
There was a problem hiding this comment.
Fixed. Created a dedicated QuasiClosedContainerMetadata DTO with semantically correct fields (stateEnterTime, expectedReplicaCount, actualReplicaCount, pipelineID, keys, replicas). Updated QuasiClosedContainersResponse to use List<QuasiClosedContainerMetadata> and removed the temporary extra constructor that was added to UnhealthyContainerMetadata. The frontend types are also updated with a separate QuasiClosedContainer type.
| } catch (Exception e) { | ||
| LOG.warn("Could not get required nodes for container {}", containerID, e); | ||
| } | ||
| List<ContainerHistory> replicas = containerManager.getLatestContainerHistory(containerID, requiredNodes); |
There was a problem hiding this comment.
We are calling this method inside a stream lambda and has no exception handling for this method when called here. If it throws a runtime exception, then entire response will fail. Better extract whole logic into a helper method (like toQuasiClosedMetadata) with proper try/catch and WebApplicationException wrapping. You can see how below is done:
private UnhealthyContainerMetadata toUnhealthyMetadata(
ContainerHealthSchemaManager.UnhealthyContainerRecord record) {
try {
...
} catch (IOException ioEx) {
throw new UncheckedIOException(ioEx);
}
}
There was a problem hiding this comment.
Fixed. Extracted a toQuasiClosedMetadata(ContainerInfo ci) helper method with a proper try/catch block that wraps exceptions in WebApplicationException with INTERNAL_SERVER_ERROR, mirroring the existing toUnhealthyMetadata pattern.
| } | ||
|
|
||
| // Also include the quasi-closed count in the summary for the frontend Highlights tab | ||
| long quasiClosedCount = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED); |
There was a problem hiding this comment.
Actually this is not good to inject here, because above all calls are fetching data from DB, so they are making DB call. But here, its a in-memory data to be retrieved and if user just wants to see QUASI CLOSED containers, we are simply making DB call also and retrieving all other unhealthy containers.
There was a problem hiding this comment.
Fixed. Removed quasiClosedCount from UnhealthyContainersResponse and the injection from getUnhealthyContainersFromSchema. On the frontend, the Highlights card now gets the quasi-closed count via an independent background call to /api/v1/containers/quasiClosed?limit=1 on page load and on full refresh, keeping the unhealthy DB query path completely separate from the in-memory quasi-closed count.
devmadhuu
left a comment
There was a problem hiding this comment.
@ArafatKhan2198 Pls check some issues.
| replicaMismatchCount: number; | ||
| quasiClosedCount: number; | ||
| } | ||
|
|
There was a problem hiding this comment.
Pls check , in latest commit, quasi closed tab is removed from UI code.
| overReplicatedCount: number; | ||
| misReplicatedCount: number; | ||
| replicaMismatchCount: number; | ||
| quasiClosedCount: number; |
There was a problem hiding this comment.
This is required field, but from containers.tsx, its removed
devmadhuu
left a comment
There was a problem hiding this comment.
@ArafatKhan2198 Pls check few comments. Also pls add the JIRA link and Test plan in your PR description.
| '3': 'OVER_REPLICATED', | ||
| '4': 'MIS_REPLICATED', | ||
| '5': 'REPLICA_MISMATCH', | ||
| '6': 'QUASI_CLOSED', |
There was a problem hiding this comment.
Why this entry needs to be added. We are not using it anyways and deciding the URL to call. Because for other tabs, its "/api/v1/containers/unhealthy", but for tab 6, we have different URL /quasiClosed. Better to remove from this map and explicitly handle using 6 as key something like:
if (tabKey === '6') { /* quasi-closed path */ return; }
if (!containerStateName) return; // skips tab '7' (Export)
There was a problem hiding this comment.
TAB_STATE_MAP: The '6': 'QUASI_CLOSED' entry is removed. That map is only for building /api/v1/containers/unhealthy/{state} URLs for tabs 1–5. Tab 6 never used that value in practice because it always calls /quasiClosed instead. Keeping it in the map was misleading. The flow is now: handle tabKey === '6' first (quasi-closed path), then if (!containerStateName) return for the Export tab and any unknown keys, then the unhealthy fetch for tabs 1–5.
| const mapped = pageContainers.map(c => ({ | ||
| ...c, | ||
| containerState: 'QUASI_CLOSED', | ||
| unhealthySince: c.stateEnterTime, |
There was a problem hiding this comment.
Instead of doing like this, better define a mapper function, so in future it can catch any type safety issues. If tomorrow someone renames containerID to id in QuasiClosedContainer, TypeScript won't warn you that c.containerID is now undefined
function toContainer(qc: QuasiClosedContainer): Container {
return {
containerID: qc.containerID,
pipelineID: qc.pipelineID,
keys: qc.keys,
containerState: 'QUASI_CLOSED',
unhealthySince: qc.stateEnterTime, // clear rename
expectedReplicaCount: qc.expectedReplicaCount,
actualReplicaCount: qc.actualReplicaCount,
replicaDeltaCount: qc.actualReplicaCount - qc.expectedReplicaCount,
reason: '',
replicas: qc.replicas,
};
}
const mapped: Container[] = pageContainers.map(toContainer); // fully typed, no 'any'
There was a problem hiding this comment.
The inline spread + cast is replaced with a small toContainer(qc: QuasiClosedContainer): Container helper that maps every field explicitly, then const mapped: Container[] = pageContainers.map(toContainer). That way renames or shape changes on QuasiClosedContainer surface at compile time instead of silently breaking at runtime.
| } | ||
|
|
||
| export type QuasiClosedTabState = { | ||
| data: QuasiClosedContainer[]; |
There was a problem hiding this comment.
This is not used because TabPaginationState has data: Container[], not data: QuasiClosedContainer[].
There was a problem hiding this comment.
Removed it. The quasi-closed tab still uses TabPaginationState with data: Container[]; we map each QuasiClosedContainer from the API through toContainer() before putting rows in state, so a separate QuasiClosedTabState was never needed.
| const response = await fetchData<QuasiClosedContainersResponse>( | ||
| '/api/v1/containers/quasiClosed', | ||
| 'GET', | ||
| { limit: 1, minContainerId: 0 } |
There was a problem hiding this comment.
Here, we are fetching full QuasiClosedContainerMetadata object, but we just need count. Frontend throws away everything except quasiClosedCount. Can we check if we pass limit as 0, the backend should return an empty container list but still compute and return quasiClosedCount
There was a problem hiding this comment.
Count-only fetch (limit=0): Updated fetchQuasiClosedCount to call /quasiClosed with limit=0 and minContainerId=0. On the backend, limit=0 means getContainers(..., 0, QUASI_CLOSED) returns an empty list, so we don’t build any row metadata or hit replica history for a throwaway first row we only return quasiClosedCount from getContainerStateCount plus empty containers and the usual pagination keys.
| @DefaultValue(PREV_CONTAINER_ID_DEFAULT_VALUE) | ||
| @QueryParam(RECON_QUERY_MIN_CONTAINER_ID) long minContainerId) { | ||
|
|
||
| List<ContainerInfo> containers = containerManager.getContainers( |
There was a problem hiding this comment.
Have some validation check here before using on minContainerId and limit.
There was a problem hiding this comment.
Validation on limit / minContainerId: Added checks at the start of getQuasiClosedContainers: reject minContainerId < 0 or limit < 0 with 400 BAD_REQUEST. limit=0 is explicitly allowed so the count-only path above stays valid.
|
|
||
| long firstKey = metaList.isEmpty() ? minContainerId : metaList.get(0).getContainerID(); | ||
| long lastKey = metaList.isEmpty() ? minContainerId : metaList.get(metaList.size() - 1).getContainerID(); | ||
| long total = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED); |
There was a problem hiding this comment.
Keep the return type same as underlying API return type.
There was a problem hiding this comment.
Hey @devmadhuu are you suggesting that the quasi-closed endpoint should also return UnhealthyContainersResponse for consistency across container endpoints
What changes were proposed in this pull request?
Please describe your PR in detail: What changes are proposed in the PR? and Why? This PR introduces a new feature in Apache Ozone Recon to track and display containers that are in the
QUASI_CLOSEDlifecycle state.A container enters the
QUASI_CLOSEDstate when it is locally closed by a DataNode (often due to a pipeline failure or interruption) but SCM has not yet finalized it asCLOSEDdue to a lack of quorum. While these containers may not always trigger existing unhealthy replication alerts (like missing or under-replicated), they are critical for debugging because a container stuck inQUASI_CLOSEDfor an extended period indicates a stalled pipeline finalization or lifecycle issue.Previously, Recon only tracked replication health issues. This PR adds a dedicated, in-memory tracking path for
QUASI_CLOSEDcontainers, separating lifecycle state tracking from replication health tracking.Approach used to solve the issue: To ensure minimal overhead and avoid unnecessary database writes, this feature is implemented entirely in-memory without introducing new Derby DB tables or background persistence tasks:
ContainerEndpoint.java): Added a newGET /api/v1/containers/quasiClosedendpoint. This endpoint utilizes cursor-based pagination and fetchesQUASI_CLOSEDcontainers directly from the in-memoryReconContainerManager(ContainerStateMap), which is an efficientO(limit)lookup.QuasiClosedContainersResponse.java): Created a lightweight response object that maps the in-memoryContainerInfotoUnhealthyContainerMetadata, allowing seamless integration with the existing frontend components.ContainerTablecomponent, adds the quasi-closed count to the Highlights card, and overrides the column title to display "State Enter Time".What is the link to the Apache JIRA
How was this patch tested?