Skip to content

NO-JIRA: Add onboarding guide for new HCP team members#8132

Open
jparrill wants to merge 1 commit intoopenshift:mainfrom
jparrill:code-onboarding-guide
Open

NO-JIRA: Add onboarding guide for new HCP team members#8132
jparrill wants to merge 1 commit intoopenshift:mainfrom
jparrill:code-onboarding-guide

Conversation

@jparrill
Copy link
Copy Markdown
Contributor

@jparrill jparrill commented Mar 31, 2026

Summary

  • Adds a comprehensive onboarding guide (docs/content/reference/onboarding-guide.md) covering HyperShift architecture, control plane and data plane internals, supported platforms, APIs, development workflow, and a recommended learning path
  • Includes Mermaid diagrams for visual understanding of component interactions, lifecycle flows, and dependency graphs
  • Provides direct code file references throughout so newcomers can self-guide their exploration of the codebase
  • Adds the guide to the mkdocs nav under the Reference section and regenerates aggregated docs

Test plan

  • Verify the onboarding guide renders correctly in mkdocs (cd docs && mkdocs serve)
  • Verify all Mermaid diagrams render properly
  • Verify file path references in the guide point to existing files
  • Review content accuracy with team members

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive Onboarding Guide covering HyperShift/Hosted Control Plane architecture, core concepts, control-plane vs data-plane lifecycles, component model, and platform-specific notes (including KubeVirt).
    • Documented NodePool/data-plane lifecycle, upgrade/delete flows, and operator responsibilities.
    • Included development workflows, recommended reading order, API/compatibility guidelines, and updated site navigation to include the new guide.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 31, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@jparrill: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

  • Adds a comprehensive onboarding guide (docs/content/reference/onboarding-guide.md) covering HyperShift architecture, control plane and data plane internals, supported platforms, APIs, development workflow, and a recommended learning path
  • Includes Mermaid diagrams for visual understanding of component interactions, lifecycle flows, and dependency graphs
  • Provides direct code file references throughout so newcomers can self-guide their exploration of the codebase
  • Adds the guide to the mkdocs nav under the Reference section and regenerates aggregated docs

Test plan

  • Verify the onboarding guide renders correctly in mkdocs (cd docs && mkdocs serve)
  • Verify all Mermaid diagrams render properly
  • Verify file path references in the guide point to existing files
  • Review content accuracy with team members

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. do-not-merge/needs-area labels Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new onboarding guide at docs/content/reference/onboarding-guide.md that documents HyperShift’s control-plane/data-plane decoupling, core CRDs and namespace topology (HostedCluster, HostedControlPlane, NodePool, control-plane namespace, operators), control-plane component model and status propagation, NodePool → CAPI → cloud provisioning and ignition/token flows, platform interfaces and per-cloud roles (credentials/encryption, CAPI/infra reconciliation, KubeVirt/CCM notes), development workflow and API-change rules, common controller patterns, architectural invariants, key file references, and a staged learning path. Also updates docs/mkdocs.yml to add the guide to Reference navigation.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label Mar 31, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 31, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from cblecker and sjenning March 31, 2026 10:33
@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Mar 31, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/content/reference/onboarding-guide.md`:
- Around line 293-295: The blockquote sections (the one that starts with
"Explore yourself" referencing registerComponents() and the later similar
blockquote) contain blank lines inside the quoted blocks which triggers
markdownlint MD028; remove the empty lines inside those blockquotes so every
quoted line directly follows the '>' prefix (no blank line between '> ...'
lines), ensure the text about registerComponents() and the kube-scheduler
example remains contiguous, and re-run markdownlint to confirm MD028 is
resolved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: d52f91d9-a5d5-44de-89e0-f6f2fb18bc09

📥 Commits

Reviewing files that changed from the base of the PR and between 7ce6015 and 81d3a3c.

⛔ Files ignored due to path filters (1)
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
📒 Files selected for processing (2)
  • docs/content/reference/onboarding-guide.md
  • docs/mkdocs.yml

Comment thread docs/content/reference/onboarding-guide.md Outdated
@jparrill jparrill force-pushed the code-onboarding-guide branch from 81d3a3c to 922f217 Compare March 31, 2026 10:38
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.74%. Comparing base (51af991) to head (af7b12a).
⚠️ Report is 72 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8132      +/-   ##
==========================================
+ Coverage   27.50%   29.74%   +2.24%     
==========================================
  Files        1096     1099       +3     
  Lines      107277   108949    +1672     
==========================================
+ Hits        29503    32409    +2906     
+ Misses      75240    73853    -1387     
- Partials     2534     2687     +153     

see 68 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2026
@jparrill jparrill force-pushed the code-onboarding-guide branch from 922f217 to 82865e5 Compare April 6, 2026 10:20
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2026
@jparrill jparrill force-pushed the code-onboarding-guide branch from 82865e5 to ad5edc1 Compare April 6, 2026 17:00
@jparrill
Copy link
Copy Markdown
Contributor Author

jparrill commented Apr 6, 2026

@bryan-cox Thanks for the feedback! I've pushed changes to address the points you raised on Slack:

Content duplication concern:

  • Added an introductory note at the top clarifying that this guide is a curated learning path — it provides a structured narrative for newcomers, not a replacement for existing reference docs.
  • Added "See also" cross-reference links in 6 sections pointing to the authoritative docs where topics are covered in more detail:
    • Section 2 (Key Concepts) → concepts-and-personas.md
    • Section 4 (Main Components) → controller-architecture.md
    • Section 7 (Data Plane) → nodepool-rollouts.md
    • Section 8 (Platforms) → multi-platform-support.md + how-to guides
    • Section 10 (Dev Workflow) → run-tests.md, run-hypershift-operator-locally.md, develop_in_cluster.md
    • Section 12 (Invariants) → goals-and-design-invariants.md

This way the guide serves as a single entry point for onboarding while deferring to existing docs for deep dives, so we don't have to maintain the same content in two places.

Let me know if you'd like further adjustments!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/content/reference/onboarding-guide.md (1)

299-301: ⚠️ Potential issue | 🟡 Minor

Remove blank lines inside blockquotes to fix MD028.

Line 300 and Line 893 break blockquote continuity, triggering markdownlint MD028. Keep quoted lines contiguous.

Proposed fix
 > Light blue components have no KAS dependency. KAS (orange) is an implicit dependency for everything else.
-
 > **Explore yourself**: Look at the `registerComponents()` function (~line 236 in `hostedcontrolplane_controller.go`) to see the full list of registered components. Then pick one simple component like `kube-scheduler` at `control-plane-operator/controllers/hostedcontrolplane/v2/kube_scheduler/` to understand the pattern.
 > **GOLDEN RULE**: After any change in `api/`, run `make update`. This runs: `api-deps` -> `workspace-sync` -> `deps` -> `api` -> `api-docs` -> `clients` -> `docs-aggregate`.
-
 > **Explore yourself**: 
 > - `api/go.mod` - the separate module definition
 > - `api/CLAUDE.md` - API backward compatibility rules (critical reading!)

Also applies to: 892-894

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/content/reference/onboarding-guide.md` around lines 299 - 301, The
blockquote in the onboarding guide contains empty lines that break continuity
and trigger markdownlint MD028; edit the blockquote around the mention of
registerComponents() and the kube-scheduler example to remove the blank lines so
the quoted lines are contiguous (i.e., collapse the blank lines at the
boundaries noted and ensure the two quoted paragraphs are immediately adjacent),
preserving the text order and formatting so the reference to
registerComponents() and the kube-scheduler example remains intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/content/reference/onboarding-guide.md`:
- Around line 299-301: The blockquote in the onboarding guide contains empty
lines that break continuity and trigger markdownlint MD028; edit the blockquote
around the mention of registerComponents() and the kube-scheduler example to
remove the blank lines so the quoted lines are contiguous (i.e., collapse the
blank lines at the boundaries noted and ensure the two quoted paragraphs are
immediately adjacent), preserving the text order and formatting so the reference
to registerComponents() and the kube-scheduler example remains intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: f1872967-6edd-477c-8c45-1fb47ae2f322

📥 Commits

Reviewing files that changed from the base of the PR and between 82865e5 and ad5edc1.

⛔ Files ignored due to path filters (1)
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
📒 Files selected for processing (2)
  • docs/content/reference/onboarding-guide.md
  • docs/mkdocs.yml
✅ Files skipped from review due to trivial changes (1)
  • docs/mkdocs.yml

@@ -0,0 +1,1315 @@
# HyperShift / Hosted Control Planes (HCP) - Onboarding Guide

> Ramp-up guide for new engineers joining the HyperShift/HCP team.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can drop this? It seems like the note sufficiently explains this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the blockquote. The admonition note covers it.

- **Resource overhead**: 3+ master nodes per cluster just for the control plane
- **Provisioning time**: 30-45 minutes including bootstrap
- **Distributed operations**: each control plane is independently operated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be worth it to have a chart here on how standalone looks just to compare/contrast with the next chart?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — split into two separate diagrams: standalone showing masters embedded per cluster, and HyperShift showing the shared management cluster.


subgraph "HyperShift Model"
D[Management Cluster]
D --> E[CP 1 - Pods]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be renamed to HCPs instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed to HCP 1/2/3.

D --> E[CP 1 - Pods]
D --> F[CP 2 - Pods]
D --> G[CP 3 - Pods]
E -.-> H[Workers 1]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be labeled 1..N workers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — labeled as Workers 1..N.


```mermaid
graph LR
subgraph "Standalone OpenShift"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would still be good to change the chart a bit and show how you need a management cluster for each standalone

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the HyperShift diagram is now right below the standalone one for direct comparison.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — standalone diagram now shows each cluster with its own Management / Masters x3 embedded.

**Rollout detection**: `ConfigGenerator.Hash()` produces a new hash when config or version changes. New hash = new Secrets = new `DataSecretName` on MachineDeployment = CAPI rolling update.

> **Explore yourself**: Platform-specific machine template builders:
> - `hypershift-operator/controllers/nodepool/aws.go` - `awsMachineTemplateSpec()`: AMI resolution, instance type, root volume, security groups
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List is not displaying correctly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added blank > line before the bullet list.

> - `hypershift-operator/controllers/nodepool/gcp.go` - GCP machine config
> - `hypershift-operator/controllers/nodepool/openstack.go` - OpenStack config

### 7.4 Auto-scaling
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added See Also link to the resource-based-control-plane-autoscaling page.


---

## 8. Supported Cloud Platforms
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should distinguish between self managed AWS and ROSA HCP

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — diagram and table now distinguish AWS Self-Managed vs AWS Managed (ROSA HCP).


---

## 8. Supported Cloud Platforms
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure is not managed only, there is self-managed Azure as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. Diagram and table now show Azure Self-Managed and Azure Managed (ARO HCP) as separate entries.

| Infra provisioning | EC2, VPC, ELB | Azure VMs, VNet | KubeVirt VMs | Pre-provisioned |
| Cloud Controller Manager | aws-ccm | azure-ccm | kubevirt-ccm | none |

> **Explore yourself**: Each platform implementation is in its own directory:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List is not displaying well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added blank > line before the bullet list in the blockquote.

Add a comprehensive onboarding guide covering HyperShift architecture,
control plane and data plane internals, supported platforms, APIs,
development workflow, and a recommended learning path with direct
code references for self-guided exploration.

Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@cblecker
Copy link
Copy Markdown
Member

/uncc

@openshift-ci openshift-ci Bot removed the request for review from cblecker April 13, 2026 17:54
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2026

@jparrill: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify-workflows af7b12a link true /test verify-workflows

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

I now have a complete picture. Here is the analysis:

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-verify-workflows
  • Build ID: 2045110413854511104
  • Target: verify-workflows
  • Cluster: build01
  • PR: #8132 (NO-JIRA: Add onboarding guide for new HCP team members)
  • Duration: 12:01:21Z → 12:09:03Z (~8 minutes)

Test Failure Analysis

Error

step src failed: inspect of image "image-registry.openshift-image-registry.svc:5000/ci-op-qcct24lp/pipeline:src-amd64"
failed with error: failed to do request: Head "https://image-registry.openshift-image-registry.svc:5000/v2/ci-op-qcct24lp/pipeline/manifests/src-amd64":
dial tcp: lookup image-registry.openshift-image-registry.svc on 172.30.0.10:53: no such host

Summary

This is a CI infrastructure flake unrelated to the PR content. The verify-workflows step never executed — the failure occurred in the preceding src step during CI's internal image management. The source code was cloned and the container image was built and pushed successfully (confirmed at 12:08:34Z), but 2 seconds later ci-operator's post-push image inspect failed because DNS resolution for the internal OpenShift image registry (image-registry.openshift-image-registry.svc) returned "no such host" on the build01 cluster. The PR itself only adds documentation files (onboarding-guide.md) and cannot have caused this failure.

Root Cause

Transient DNS resolution failure on the build01 CI cluster.

The complete timeline shows:

  1. 12:05:34Z–12:08:34Z: The src build ran successfully — source code for PR NO-JIRA: Add onboarding guide for new HCP team members #8132 was cloned, the container image was built, and it was pushed to image-registry.openshift-image-registry.svc:5000/ci-op-qcct24lp/pipeline:src-amd64 (log confirms "Push successful").

  2. 12:08:36Z: Immediately after the push, ci-operator attempted to inspect the image by issuing a HEAD request to the same registry hostname. DNS resolution for image-registry.openshift-image-registry.svc against the cluster DNS server at 172.30.0.10:53 returned "no such host".

  3. 12:08:46Z: After a 10-second retry window, ci-operator gave up and reported the step as failed with reason executing_graph:step_failed:cloning_source.

  4. The verify-workflows step (the actual test target) was never started because it depends on the src step.

The DNS failure is an infrastructure-level issue on the build01 cluster. The cluster was also experiencing node pressure during this window — events show multiple FailedScheduling warnings with nodes in not-ready and unschedulable states, along with Insufficient cpu errors. These node issues likely contributed to instability in the openshift-image-registry service or the cluster DNS (CoreDNS) pods, causing the transient DNS resolution failure.

This is not caused by the PR (which only adds markdown documentation files) and is not a product bug.

Recommendations
  1. Retest the PR — This is a transient CI infrastructure flake. Simply rerun the job with /retest or /test verify-workflows on the PR.

  2. No code changes needed — The PR only adds docs/content/reference/onboarding-guide.md and updates docs/mkdocs.yml and docs/content/reference/aggregated-docs.md. These documentation-only changes cannot cause CI infrastructure DNS failures.

  3. If the failure persists on retry — File an issue against the OpenShift CI infrastructure team (#forum-ocp-testplatform on Slack) referencing the build01 cluster DNS instability and the node pressure observed in the events.

Evidence
Evidence Detail
Failed step src (source cloning/image build — infrastructure step, not test step)
Actual test step verify-workflowsnever executed (null start/finish times in step graph)
Error type DNS resolution failure: lookup image-registry.openshift-image-registry.svc on 172.30.0.10:53: no such host
Image push Succeeded at 12:08:34Z (Push successful in src-amd64.log)
Image inspect Failed at 12:08:36Z — 2 seconds after successful push, DNS could no longer resolve the same hostname
CI cluster build01
Node pressure Multiple FailedScheduling events: nodes in not-ready, unschedulable states; Insufficient cpu
Failure reason executing_graph:step_failed:cloning_source
PR content Documentation-only: onboarding-guide.md, aggregated-docs.md, mkdocs.yml
Job namespace ci-op-qcct24lp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation Indicates the PR includes changes for documentation jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants