Skip to content

HDDS-14927. Add Quasi-Closed Container Tracking in Recon.#10198

Draft
ArafatKhan2198 wants to merge 7 commits intoapache:masterfrom
ArafatKhan2198:HDDS-14927-new
Draft

HDDS-14927. Add Quasi-Closed Container Tracking in Recon.#10198
ArafatKhan2198 wants to merge 7 commits intoapache:masterfrom
ArafatKhan2198:HDDS-14927-new

Conversation

@ArafatKhan2198
Copy link
Copy Markdown
Contributor

@ArafatKhan2198 ArafatKhan2198 commented May 6, 2026

What changes were proposed in this pull request?

Please describe your PR in detail: What changes are proposed in the PR? and Why? This PR introduces a new feature in Apache Ozone Recon to track and display containers that are in the QUASI_CLOSED lifecycle state.

A container enters the QUASI_CLOSED state when it is locally closed by a DataNode (often due to a pipeline failure or interruption) but SCM has not yet finalized it as CLOSED due to a lack of quorum. While these containers may not always trigger existing unhealthy replication alerts (like missing or under-replicated), they are critical for debugging because a container stuck in QUASI_CLOSED for an extended period indicates a stalled pipeline finalization or lifecycle issue.

Previously, Recon only tracked replication health issues. This PR adds a dedicated, in-memory tracking path for QUASI_CLOSED containers, separating lifecycle state tracking from replication health tracking.

Approach used to solve the issue: To ensure minimal overhead and avoid unnecessary database writes, this feature is implemented entirely in-memory without introducing new Derby DB tables or background persistence tasks:

  1. Backend API (ContainerEndpoint.java): Added a new GET /api/v1/containers/quasiClosed endpoint. This endpoint utilizes cursor-based pagination and fetches QUASI_CLOSED containers directly from the in-memory ReconContainerManager (ContainerStateMap), which is an efficient O(limit) lookup.
  2. DTO (QuasiClosedContainersResponse.java): Created a lightweight response object that maps the in-memory ContainerInfo to UnhealthyContainerMetadata, allowing seamless integration with the existing frontend components.
  3. Frontend UI: Integrated a new "Quasi Closed" tab into the existing Unhealthy Containers page. It reuses the existing ContainerTable component, adds the quasi-closed count to the Highlights card, and overrides the column title to display "State Enter Time".

What is the link to the Apache JIRA

How was this patch tested?

@devmadhuu devmadhuu self-requested a review May 7, 2026 07:57
Copy link
Copy Markdown
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for the patch. Please see comments in line.

*/
@GET
@Path("/quasiClosed")
public Response getQuasiClosedContainers(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use existing /unhealthy API endpoint, why need new API end point ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My Idea was that The /unhealthy endpoint reads from the Derby UNHEALTHY_CONTAINERS table which tracks replication health. QUASI_CLOSED is a lifecycle state, not a replication health problem, so I feel It doesn't belong in that table or that flow.

@Path("/quasiClosed")
public Response getQuasiClosedContainers(
@DefaultValue(DEFAULT_FETCH_COUNT) @QueryParam(RECON_QUERY_LIMIT) int limit,
@DefaultValue(PREV_CONTAINER_ID_DEFAULT_VALUE) @QueryParam(RECON_QUERY_PREVKEY) long prevKey) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this new API endpoint uses prevKey as parameter, but frontend is sending minContainerId, so API will always fallback to default value of prevKey as 0 and pagination is broke here completely. Every "next page" click will return the first page again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Changed @QueryParam(RECON_QUERY_PREVKEY) to @QueryParam(RECON_QUERY_MIN_CONTAINER_ID) so the backend now correctly reads the minContainerId parameter that the frontend sends. Pagination should work correctly now.

long lastKey = metaList.isEmpty() ? prevKey : metaList.get(metaList.size() - 1).getContainerID();
long total = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED);

return Response.ok(new QuasiClosedContainersResponse(total, firstKey, lastKey, metaList)).build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, this QuasiClosedContainersResponse object is used to wrap the response sent to frontend, but frontend uses ContainersPaginationResponse and two new fields added quasiClosedCount and totalCount there, but don't see totalCount field got added in backend response object - QuasiClosedContainersResponse

}
List<ContainerHistory> replicas = containerManager.getLatestContainerHistory(containerID, requiredNodes);

UnhealthyContainerMetadata metadata = new UnhealthyContainerMetadata(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to create a dedicated QuasiClosedContainerMetadata that has fields like stateEnterTime, replicaCount, replicaDetails, else use of existing UnhealthyContainerMetadata fields doesn't feel semantically correct and confusing.

Copy link
Copy Markdown
Contributor Author

@ArafatKhan2198 ArafatKhan2198 May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Created a dedicated QuasiClosedContainerMetadata DTO with semantically correct fields (stateEnterTimeexpectedReplicaCountactualReplicaCountpipelineIDkeysreplicas). Updated QuasiClosedContainersResponse to use List<QuasiClosedContainerMetadata> and removed the temporary extra constructor that was added to UnhealthyContainerMetadata. The frontend types are also updated with a separate QuasiClosedContainer type.

} catch (Exception e) {
LOG.warn("Could not get required nodes for container {}", containerID, e);
}
List<ContainerHistory> replicas = containerManager.getLatestContainerHistory(containerID, requiredNodes);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling this method inside a stream lambda and has no exception handling for this method when called here. If it throws a runtime exception, then entire response will fail. Better extract whole logic into a helper method (like toQuasiClosedMetadata) with proper try/catch and WebApplicationException wrapping. You can see how below is done:

private UnhealthyContainerMetadata toUnhealthyMetadata(
    ContainerHealthSchemaManager.UnhealthyContainerRecord record) {
  try {
    ...
  } catch (IOException ioEx) {
    throw new UncheckedIOException(ioEx);
  }
}

Copy link
Copy Markdown
Contributor Author

@ArafatKhan2198 ArafatKhan2198 May 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Extracted a toQuasiClosedMetadata(ContainerInfo ci) helper method with a proper try/catch block that wraps exceptions in WebApplicationException with INTERNAL_SERVER_ERROR, mirroring the existing toUnhealthyMetadata pattern.

}

// Also include the quasi-closed count in the summary for the frontend Highlights tab
long quasiClosedCount = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not good to inject here, because above all calls are fetching data from DB, so they are making DB call. But here, its a in-memory data to be retrieved and if user just wants to see QUASI CLOSED containers, we are simply making DB call also and retrieving all other unhealthy containers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed quasiClosedCount from UnhealthyContainersResponse and the injection from getUnhealthyContainersFromSchema. On the frontend, the Highlights card now gets the quasi-closed count via an independent background call to /api/v1/containers/quasiClosed?limit=1 on page load and on full refresh, keeping the unhealthy DB query path completely separate from the in-memory quasi-closed count.

@ArafatKhan2198 ArafatKhan2198 requested a review from devmadhuu May 11, 2026 05:12
Copy link
Copy Markdown
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArafatKhan2198 Pls check some issues.

replicaMismatchCount: number;
quasiClosedCount: number;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls check , in latest commit, quasi closed tab is removed from UI code.

overReplicatedCount: number;
misReplicatedCount: number;
replicaMismatchCount: number;
quasiClosedCount: number;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required field, but from containers.tsx, its removed

Copy link
Copy Markdown
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArafatKhan2198 Pls check few comments. Also pls add the JIRA link and Test plan in your PR description.

'3': 'OVER_REPLICATED',
'4': 'MIS_REPLICATED',
'5': 'REPLICA_MISMATCH',
'6': 'QUASI_CLOSED',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this entry needs to be added. We are not using it anyways and deciding the URL to call. Because for other tabs, its "/api/v1/containers/unhealthy", but for tab 6, we have different URL /quasiClosed. Better to remove from this map and explicitly handle using 6 as key something like:


if (tabKey === '6') { /* quasi-closed path */ return; }
if (!containerStateName) return; // skips tab '7' (Export)

Copy link
Copy Markdown
Contributor Author

@ArafatKhan2198 ArafatKhan2198 May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAB_STATE_MAP: The '6': 'QUASI_CLOSED' entry is removed. That map is only for building /api/v1/containers/unhealthy/{state} URLs for tabs 1–5. Tab 6 never used that value in practice because it always calls /quasiClosed instead. Keeping it in the map was misleading. The flow is now: handle tabKey === '6' first (quasi-closed path), then if (!containerStateName) return for the Export tab and any unknown keys, then the unhealthy fetch for tabs 1–5.

const mapped = pageContainers.map(c => ({
...c,
containerState: 'QUASI_CLOSED',
unhealthySince: c.stateEnterTime,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing like this, better define a mapper function, so in future it can catch any type safety issues. If tomorrow someone renames containerID to id in QuasiClosedContainer, TypeScript won't warn you that c.containerID is now undefined

function toContainer(qc: QuasiClosedContainer): Container {
  return {
    containerID: qc.containerID,
    pipelineID: qc.pipelineID,
    keys: qc.keys,
    containerState: 'QUASI_CLOSED',
    unhealthySince: qc.stateEnterTime,   // clear rename
    expectedReplicaCount: qc.expectedReplicaCount,
    actualReplicaCount: qc.actualReplicaCount,
    replicaDeltaCount: qc.actualReplicaCount - qc.expectedReplicaCount,
    reason: '',
    replicas: qc.replicas,
  };
}

const mapped: Container[] = pageContainers.map(toContainer);  // fully typed, no 'any'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline spread + cast is replaced with a small toContainer(qc: QuasiClosedContainer): Container helper that maps every field explicitly, then const mapped: Container[] = pageContainers.map(toContainer). That way renames or shape changes on QuasiClosedContainer surface at compile time instead of silently breaking at runtime.

}

export type QuasiClosedTabState = {
data: QuasiClosedContainer[];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used because TabPaginationState has data: Container[], not data: QuasiClosedContainer[].

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it. The quasi-closed tab still uses TabPaginationState with data: Container[]; we map each QuasiClosedContainer from the API through toContainer() before putting rows in state, so a separate QuasiClosedTabState was never needed.

const response = await fetchData<QuasiClosedContainersResponse>(
'/api/v1/containers/quasiClosed',
'GET',
{ limit: 1, minContainerId: 0 }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are fetching full QuasiClosedContainerMetadata object, but we just need count. Frontend throws away everything except quasiClosedCount. Can we check if we pass limit as 0, the backend should return an empty container list but still compute and return quasiClosedCount

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Count-only fetch (limit=0): Updated fetchQuasiClosedCount to call /quasiClosed with limit=0 and minContainerId=0. On the backend, limit=0 means getContainers(..., 0, QUASI_CLOSED) returns an empty list, so we don’t build any row metadata or hit replica history for a throwaway first row we only return quasiClosedCount from getContainerStateCount plus empty containers and the usual pagination keys.

@DefaultValue(PREV_CONTAINER_ID_DEFAULT_VALUE)
@QueryParam(RECON_QUERY_MIN_CONTAINER_ID) long minContainerId) {

List<ContainerInfo> containers = containerManager.getContainers(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some validation check here before using on minContainerId and limit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation on limit / minContainerId: Added checks at the start of getQuasiClosedContainers: reject minContainerId < 0 or limit < 0 with 400 BAD_REQUESTlimit=0 is explicitly allowed so the count-only path above stays valid.


long firstKey = metaList.isEmpty() ? minContainerId : metaList.get(0).getContainerID();
long lastKey = metaList.isEmpty() ? minContainerId : metaList.get(metaList.size() - 1).getContainerID();
long total = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the return type same as underlying API return type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @devmadhuu are you suggesting that the quasi-closed endpoint should also return UnhealthyContainersResponse for consistency across container endpoints

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants