Skip to content

Conversation

@Gym-Gary
Copy link

@Gym-Gary Gym-Gary commented Dec 11, 2025

What does this PR do?

Currently, whether in end-to-end mode or discrete mode, all roles are fully collected. As the sequence length continues to increase, the volume of collected data becomes large, leading to slow parsing. Therefore, we introduce a new feature in the NPU Profiler that allows optional role selection in discrete mode, enabling quick collection of specific roles.

We have added a new roles parameter in npu_profile.yaml to specify the roles to be collected. The currently supported options are: all, rollout_generate, actor_compute_log_prob, actor_update and ref_compute_log_prob. Setting roles to ["all"] means all roles will be collected. Other options can be freely combined, for example: ["actor_update", "ref_compute_log_prob"]

to close issue #4400

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Dec 11, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable performance feature by allowing selective profiling of different stages ('roles') in the NPU profiler's discrete mode. This will help reduce data volume and parsing time. The implementation is mostly well-done, with corresponding updates to documentation and worker annotations. However, there is a critical inconsistency in the configuration naming (stages vs. roles) that will prevent the feature from working. I've also included a suggestion to improve the maintainability of the configuration validation logic. Addressing these points will make this a solid contribution.

@Gym-Gary Gym-Gary force-pushed the 1211.profile branch 2 times, most recently from 0aff344 to ecaf261 Compare December 12, 2025 08:26
npu:
discrete: False
# rollout & ref follow actor settings
# rollout & ref follow actor settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unnecessary diff

discrete: True
# rollout & ref follow actor settings
roles: ["all"]
# rollout & ref follow actor settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@Gym-Gary Gym-Gary requested a review from tardis-key December 18, 2025 01:30
Copy link
Contributor

@tardis-key tardis-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. rename 'role'
  2. explain the necessity of stage filtering on the premise that role filtering is already available
  3. gpu and npu test

Copy link
Contributor

@tardis-key tardis-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. rename 'role'
  2. explain the necessity of stage filtering on the premise that role filtering is already available
  3. gpu and npu test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants