Skip to content

[LLM] Simplify IFEval reward aggregator#3543

Closed
vmoens wants to merge 2 commits intogh/vmoens/235/basefrom
gh/vmoens/235/head
Closed

[LLM] Simplify IFEval reward aggregator#3543
vmoens wants to merge 2 commits intogh/vmoens/235/basefrom
gh/vmoens/235/head

Conversation

@vmoens
Copy link
Copy Markdown
Collaborator

@vmoens vmoens commented Mar 5, 2026

Stack from ghstack (oldest at bottom):

Replace the complex tiered multiplicative reward (structure multiplier,
quality bonus thresholds, complexity scaling) with a simple weighted
average of IFEval metrics plus a small additive format bonus.

The new reward is: weighted_avg(strict/loose metrics) + format_bonus,
where format_bonus is 0.1 for a single answer block and 0.05 for a
single think block. Reward range: ~[0, 1.15].

Made-with: Cursor

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 5, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3543

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Unrelated Failure

As of commit 9e257f9 with merge base 4e2e787 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [LLM].

Current title: [LLM] Simplify IFEval reward aggregator

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 5, 2026

⚠️ PR Title Label Error

Unknown or invalid prefix [LLM].

Current title: [LLM] Simplify IFEval reward aggregator

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

[ghstack-poisoned]
@github-actions
Copy link
Copy Markdown
Contributor

⚠️ PR Title Label Error

Unknown or invalid prefix [LLM].

Current title: [LLM] Simplify IFEval reward aggregator

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@github-actions
Copy link
Copy Markdown
Contributor

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 172. Improved: $\large\color{#35bf28}16$. Worsened: $\large\color{#d91a1a}7$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 79.4161μs 78.3957μs 12.7558 KOps/s 12.4687 KOps/s $\color{#35bf28}+2.30\%$
test_tensor_to_bytestream_speed[torch.save] 0.1367ms 0.1356ms 7.3763 KOps/s 7.2662 KOps/s $\color{#35bf28}+1.52\%$
test_tensor_to_bytestream_speed[untyped_storage] 98.7174ms 98.2088ms 10.1824 Ops/s 9.8596 Ops/s $\color{#35bf28}+3.27\%$
test_tensor_to_bytestream_speed[numpy] 2.4856μs 2.4827μs 402.7859 KOps/s 401.7830 KOps/s $\color{#35bf28}+0.25\%$
test_tensor_to_bytestream_speed[safetensors] 36.3587μs 36.2400μs 27.5938 KOps/s 26.6103 KOps/s $\color{#35bf28}+3.70\%$
test_simple 0.7740s 0.7729s 1.2939 Ops/s 1.2522 Ops/s $\color{#35bf28}+3.33\%$
test_transformed 1.3640s 1.3617s 0.7344 Ops/s 0.7244 Ops/s $\color{#35bf28}+1.38\%$
test_serial 2.2954s 2.2708s 0.4404 Ops/s 0.4353 Ops/s $\color{#35bf28}+1.16\%$
test_parallel 1.8988s 1.8083s 0.5530 Ops/s 0.5640 Ops/s $\color{#d91a1a}-1.95\%$
test_step_mdp_speed[True-True-True-True-True] 0.4357ms 40.0488μs 24.9695 KOps/s 25.1486 KOps/s $\color{#d91a1a}-0.71\%$
test_step_mdp_speed[True-True-True-True-False] 50.1110μs 22.1486μs 45.1496 KOps/s 44.5849 KOps/s $\color{#35bf28}+1.27\%$
test_step_mdp_speed[True-True-True-False-True] 50.2610μs 22.6755μs 44.1004 KOps/s 44.0618 KOps/s $\color{#35bf28}+0.09\%$
test_step_mdp_speed[True-True-True-False-False] 35.1210μs 12.3015μs 81.2909 KOps/s 80.1024 KOps/s $\color{#35bf28}+1.48\%$
test_step_mdp_speed[True-True-False-True-True] 80.9920μs 42.8149μs 23.3564 KOps/s 22.7971 KOps/s $\color{#35bf28}+2.45\%$
test_step_mdp_speed[True-True-False-True-False] 58.4710μs 24.3549μs 41.0595 KOps/s 40.6015 KOps/s $\color{#35bf28}+1.13\%$
test_step_mdp_speed[True-True-False-False-True] 90.5220μs 25.1667μs 39.7350 KOps/s 37.6309 KOps/s $\textbf{\color{#35bf28}+5.59\%}$
test_step_mdp_speed[True-True-False-False-False] 38.9910μs 14.8114μs 67.5154 KOps/s 66.4719 KOps/s $\color{#35bf28}+1.57\%$
test_step_mdp_speed[True-False-True-True-True] 70.9510μs 44.9178μs 22.2629 KOps/s 22.3180 KOps/s $\color{#d91a1a}-0.25\%$
test_step_mdp_speed[True-False-True-True-False] 52.3710μs 27.3528μs 36.5594 KOps/s 36.7846 KOps/s $\color{#d91a1a}-0.61\%$
test_step_mdp_speed[True-False-True-False-True] 56.5510μs 24.9803μs 40.0315 KOps/s 39.0160 KOps/s $\color{#35bf28}+2.60\%$
test_step_mdp_speed[True-False-True-False-False] 40.0710μs 14.9966μs 66.6817 KOps/s 66.7574 KOps/s $\color{#d91a1a}-0.11\%$
test_step_mdp_speed[True-False-False-True-True] 0.1150ms 47.5348μs 21.0372 KOps/s 20.4732 KOps/s $\color{#35bf28}+2.75\%$
test_step_mdp_speed[True-False-False-True-False] 57.7210μs 29.5447μs 33.8470 KOps/s 33.5018 KOps/s $\color{#35bf28}+1.03\%$
test_step_mdp_speed[True-False-False-False-True] 56.9820μs 28.6873μs 34.8586 KOps/s 34.5437 KOps/s $\color{#35bf28}+0.91\%$
test_step_mdp_speed[True-False-False-False-False] 46.3120μs 17.5721μs 56.9083 KOps/s 57.0869 KOps/s $\color{#d91a1a}-0.31\%$
test_step_mdp_speed[False-True-True-True-True] 92.4220μs 45.6986μs 21.8825 KOps/s 21.6403 KOps/s $\color{#35bf28}+1.12\%$
test_step_mdp_speed[False-True-True-True-False] 99.7420μs 27.2553μs 36.6902 KOps/s 37.3508 KOps/s $\color{#d91a1a}-1.77\%$
test_step_mdp_speed[False-True-True-False-True] 2.5318ms 29.4916μs 33.9080 KOps/s 34.6697 KOps/s $\color{#d91a1a}-2.20\%$
test_step_mdp_speed[False-True-True-False-False] 47.6320μs 16.5035μs 60.5930 KOps/s 60.9044 KOps/s $\color{#d91a1a}-0.51\%$
test_step_mdp_speed[False-True-False-True-True] 0.1201ms 47.7249μs 20.9534 KOps/s 20.8714 KOps/s $\color{#35bf28}+0.39\%$
test_step_mdp_speed[False-True-False-True-False] 59.9110μs 29.2546μs 34.1827 KOps/s 34.2042 KOps/s $\color{#d91a1a}-0.06\%$
test_step_mdp_speed[False-True-False-False-True] 62.1520μs 30.7934μs 32.4745 KOps/s 31.9181 KOps/s $\color{#35bf28}+1.74\%$
test_step_mdp_speed[False-True-False-False-False] 88.6620μs 18.5744μs 53.8375 KOps/s 52.8615 KOps/s $\color{#35bf28}+1.85\%$
test_step_mdp_speed[False-False-True-True-True] 78.7020μs 49.6323μs 20.1482 KOps/s 19.6326 KOps/s $\color{#35bf28}+2.63\%$
test_step_mdp_speed[False-False-True-True-False] 61.0820μs 31.6975μs 31.5482 KOps/s 31.0240 KOps/s $\color{#35bf28}+1.69\%$
test_step_mdp_speed[False-False-True-False-True] 62.9020μs 30.7688μs 32.5005 KOps/s 32.5083 KOps/s $\color{#d91a1a}-0.02\%$
test_step_mdp_speed[False-False-True-False-False] 44.9410μs 18.7394μs 53.3635 KOps/s 53.6137 KOps/s $\color{#d91a1a}-0.47\%$
test_step_mdp_speed[False-False-False-True-True] 84.6120μs 51.9155μs 19.2621 KOps/s 18.4770 KOps/s $\color{#35bf28}+4.25\%$
test_step_mdp_speed[False-False-False-True-False] 65.9520μs 34.3502μs 29.1119 KOps/s 28.7049 KOps/s $\color{#35bf28}+1.42\%$
test_step_mdp_speed[False-False-False-False-True] 63.3610μs 32.9557μs 30.3438 KOps/s 30.5624 KOps/s $\color{#d91a1a}-0.72\%$
test_step_mdp_speed[False-False-False-False-False] 52.2810μs 21.3311μs 46.8798 KOps/s 47.1476 KOps/s $\color{#d91a1a}-0.57\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.7085s 0.7035s 1.4214 Ops/s 1.3637 Ops/s $\color{#35bf28}+4.23\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.6932s 0.5932s 1.6856 Ops/s 1.6725 Ops/s $\color{#35bf28}+0.78\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.6937s 1.6085s 0.6217 Ops/s 0.6156 Ops/s $\color{#35bf28}+1.00\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.4715s 1.3895s 0.7197 Ops/s 0.7152 Ops/s $\color{#35bf28}+0.63\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 1.9383s 1.8534s 0.5396 Ops/s 0.5366 Ops/s $\color{#35bf28}+0.55\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.7160s 1.6331s 0.6123 Ops/s 0.6092 Ops/s $\color{#35bf28}+0.51\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.6561s 4.5546s 0.2196 Ops/s 0.2216 Ops/s $\color{#d91a1a}-0.92\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.5280s 4.4029s 0.2271 Ops/s 0.2307 Ops/s $\color{#d91a1a}-1.55\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 1.9092s 1.8243s 0.5481 Ops/s 0.5410 Ops/s $\color{#35bf28}+1.32\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.6965s 1.5689s 0.6374 Ops/s 0.6461 Ops/s $\color{#d91a1a}-1.36\%$
test_values[generalized_advantage_estimate-True-True] 21.2148ms 20.2914ms 49.2820 Ops/s 50.2526 Ops/s $\color{#d91a1a}-1.93\%$
test_values[vec_generalized_advantage_estimate-True-True] 0.1316s 3.5510ms 281.6099 Ops/s 277.7337 Ops/s $\color{#35bf28}+1.40\%$
test_values[td0_return_estimate-False-False] 0.1087ms 82.6262μs 12.1027 KOps/s 12.2500 KOps/s $\color{#d91a1a}-1.20\%$
test_values[td1_return_estimate-False-False] 48.3465ms 47.9644ms 20.8488 Ops/s 21.2523 Ops/s $\color{#d91a1a}-1.90\%$
test_values[vec_td1_return_estimate-False-False] 1.3297ms 1.0842ms 922.3679 Ops/s 926.8046 Ops/s $\color{#d91a1a}-0.48\%$
test_values[td_lambda_return_estimate-True-False] 84.5313ms 79.6488ms 12.5551 Ops/s 12.9728 Ops/s $\color{#d91a1a}-3.22\%$
test_values[vec_td_lambda_return_estimate-True-False] 1.2843ms 1.0865ms 920.3838 Ops/s 929.8435 Ops/s $\color{#d91a1a}-1.02\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 21.6776ms 20.3754ms 49.0789 Ops/s 50.4853 Ops/s $\color{#d91a1a}-2.79\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.0345ms 0.7544ms 1.3256 KOps/s 1.3306 KOps/s $\color{#d91a1a}-0.38\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7199ms 0.6737ms 1.4843 KOps/s 1.5008 KOps/s $\color{#d91a1a}-1.09\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.6154ms 1.5157ms 659.7434 Ops/s 672.8061 Ops/s $\color{#d91a1a}-1.94\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.7413ms 0.6861ms 1.4575 KOps/s 1.4635 KOps/s $\color{#d91a1a}-0.41\%$
test_dqn_speed[False-None] 1.8510ms 1.5251ms 655.6768 Ops/s 675.7994 Ops/s $\color{#d91a1a}-2.98\%$
test_dqn_speed[False-backward] 2.3108ms 2.1616ms 462.6150 Ops/s 471.0057 Ops/s $\color{#d91a1a}-1.78\%$
test_dqn_speed[True-None] 0.7391ms 0.5800ms 1.7243 KOps/s 1.7304 KOps/s $\color{#d91a1a}-0.35\%$
test_dqn_speed[True-backward] 1.2893ms 1.2396ms 806.6809 Ops/s 811.2419 Ops/s $\color{#d91a1a}-0.56\%$
test_dqn_speed[reduce-overhead-None] 0.6698ms 0.5990ms 1.6695 KOps/s 1.6416 KOps/s $\color{#35bf28}+1.70\%$
test_ddpg_speed[False-None] 3.2004ms 2.8311ms 353.2175 Ops/s 356.7675 Ops/s $\color{#d91a1a}-1.00\%$
test_ddpg_speed[False-backward] 4.7088ms 4.2793ms 233.6821 Ops/s 237.7335 Ops/s $\color{#d91a1a}-1.70\%$
test_ddpg_speed[True-None] 1.5105ms 1.3779ms 725.7543 Ops/s 730.0601 Ops/s $\color{#d91a1a}-0.59\%$
test_ddpg_speed[True-backward] 2.6084ms 2.5639ms 390.0251 Ops/s 391.4092 Ops/s $\color{#d91a1a}-0.35\%$
test_ddpg_speed[reduce-overhead-None] 1.4780ms 1.4026ms 712.9500 Ops/s 720.9210 Ops/s $\color{#d91a1a}-1.11\%$
test_sac_speed[False-None] 8.9594ms 8.3185ms 120.2137 Ops/s 120.8201 Ops/s $\color{#d91a1a}-0.50\%$
test_sac_speed[False-backward] 11.9718ms 11.5836ms 86.3287 Ops/s 86.9959 Ops/s $\color{#d91a1a}-0.77\%$
test_sac_speed[True-None] 2.2423ms 1.9034ms 525.3734 Ops/s 534.7726 Ops/s $\color{#d91a1a}-1.76\%$
test_sac_speed[True-backward] 3.7403ms 3.6919ms 270.8619 Ops/s 270.3296 Ops/s $\color{#35bf28}+0.20\%$
test_sac_speed[reduce-overhead-None] 16.1150ms 9.8145ms 101.8904 Ops/s 100.6279 Ops/s $\color{#35bf28}+1.25\%$
test_redq_deprec_speed[False-None] 10.2274ms 9.3746ms 106.6713 Ops/s 107.6482 Ops/s $\color{#d91a1a}-0.91\%$
test_redq_deprec_speed[False-backward] 13.5442ms 12.6806ms 78.8604 Ops/s 78.8616 Ops/s $-0.00\%$
test_redq_deprec_speed[True-None] 2.7738ms 2.6084ms 383.3837 Ops/s 385.1620 Ops/s $\color{#d91a1a}-0.46\%$
test_redq_deprec_speed[True-backward] 4.6043ms 4.1382ms 241.6522 Ops/s 243.3308 Ops/s $\color{#d91a1a}-0.69\%$
test_redq_deprec_speed[reduce-overhead-None] 14.5219ms 9.5308ms 104.9232 Ops/s 104.7432 Ops/s $\color{#35bf28}+0.17\%$
test_td3_speed[False-None] 8.4606ms 8.1829ms 122.2066 Ops/s 122.2809 Ops/s $\color{#d91a1a}-0.06\%$
test_td3_speed[False-backward] 11.1052ms 10.6061ms 94.2858 Ops/s 47.0589 Ops/s $\textbf{\color{#35bf28}+100.36\%}$
test_td3_speed[True-None] 1.7716ms 1.7315ms 577.5494 Ops/s 603.0313 Ops/s $\color{#d91a1a}-4.23\%$
test_td3_speed[True-backward] 3.1275ms 3.0743ms 325.2804 Ops/s 324.7876 Ops/s $\color{#35bf28}+0.15\%$
test_td3_speed[reduce-overhead-None] 83.9646ms 25.2956ms 39.5326 Ops/s 40.2036 Ops/s $\color{#d91a1a}-1.67\%$
test_cql_speed[False-None] 17.7078ms 17.3144ms 57.7554 Ops/s 58.3759 Ops/s $\color{#d91a1a}-1.06\%$
test_cql_speed[False-backward] 23.0525ms 22.5591ms 44.3281 Ops/s 44.7404 Ops/s $\color{#d91a1a}-0.92\%$
test_cql_speed[True-None] 3.5006ms 3.3778ms 296.0538 Ops/s 295.9705 Ops/s $\color{#35bf28}+0.03\%$
test_cql_speed[True-backward] 6.1164ms 5.7125ms 175.0561 Ops/s 176.3982 Ops/s $\color{#d91a1a}-0.76\%$
test_cql_speed[reduce-overhead-None] 0.8407s 17.1205ms 58.4096 Ops/s 83.8052 Ops/s $\textbf{\color{#d91a1a}-30.30\%}$
test_a2c_speed[False-None] 3.5178ms 3.2244ms 310.1362 Ops/s 312.3408 Ops/s $\color{#d91a1a}-0.71\%$
test_a2c_speed[False-backward] 6.8597ms 6.3734ms 156.9026 Ops/s 164.6208 Ops/s $\color{#d91a1a}-4.69\%$
test_a2c_speed[True-None] 1.4652ms 1.3665ms 731.8232 Ops/s 736.4737 Ops/s $\color{#d91a1a}-0.63\%$
test_a2c_speed[True-backward] 3.2587ms 3.2073ms 311.7891 Ops/s 310.1845 Ops/s $\color{#35bf28}+0.52\%$
test_a2c_speed[reduce-overhead-None] 1.0874ms 1.0127ms 987.4830 Ops/s 981.2232 Ops/s $\color{#35bf28}+0.64\%$
test_ppo_speed[False-None] 3.9359ms 3.8250ms 261.4349 Ops/s 260.6932 Ops/s $\color{#35bf28}+0.28\%$
test_ppo_speed[False-backward] 7.7762ms 7.1812ms 139.2523 Ops/s 137.5615 Ops/s $\color{#35bf28}+1.23\%$
test_ppo_speed[True-None] 1.6153ms 1.4743ms 678.2817 Ops/s 674.7192 Ops/s $\color{#35bf28}+0.53\%$
test_ppo_speed[True-backward] 3.4181ms 3.1731ms 315.1494 Ops/s 296.6266 Ops/s $\textbf{\color{#35bf28}+6.24\%}$
test_ppo_speed[reduce-overhead-None] 1.1843ms 1.0556ms 947.3296 Ops/s 920.5002 Ops/s $\color{#35bf28}+2.91\%$
test_reinforce_speed[False-None] 2.3781ms 2.2552ms 443.4160 Ops/s 444.8851 Ops/s $\color{#d91a1a}-0.33\%$
test_reinforce_speed[False-backward] 3.5124ms 3.4328ms 291.3036 Ops/s 295.1485 Ops/s $\color{#d91a1a}-1.30\%$
test_reinforce_speed[True-None] 1.4663ms 1.3229ms 755.9339 Ops/s 749.6711 Ops/s $\color{#35bf28}+0.84\%$
test_reinforce_speed[True-backward] 3.2249ms 3.1387ms 318.6060 Ops/s 316.9226 Ops/s $\color{#35bf28}+0.53\%$
test_reinforce_speed[reduce-overhead-None] 15.5045ms 8.8124ms 113.4763 Ops/s 112.5877 Ops/s $\color{#35bf28}+0.79\%$
test_iql_speed[False-None] 10.4761ms 9.4648ms 105.6545 Ops/s 106.5425 Ops/s $\color{#d91a1a}-0.83\%$
test_iql_speed[False-backward] 13.6199ms 13.2270ms 75.6028 Ops/s 74.8526 Ops/s $\color{#35bf28}+1.00\%$
test_iql_speed[True-None] 2.3947ms 2.2598ms 442.5235 Ops/s 442.7527 Ops/s $\color{#d91a1a}-0.05\%$
test_iql_speed[True-backward] 5.0307ms 4.9322ms 202.7503 Ops/s 199.9765 Ops/s $\color{#35bf28}+1.39\%$
test_iql_speed[reduce-overhead-None] 15.8369ms 9.8739ms 101.2767 Ops/s 100.8535 Ops/s $\color{#35bf28}+0.42\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.1729ms 5.7848ms 172.8676 Ops/s 170.8776 Ops/s $\color{#35bf28}+1.16\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.0111ms 0.3247ms 3.0799 KOps/s 3.1161 KOps/s $\color{#d91a1a}-1.16\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5928ms 0.3329ms 3.0035 KOps/s 3.1993 KOps/s $\textbf{\color{#d91a1a}-6.12\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.8512ms 5.6032ms 178.4686 Ops/s 177.0096 Ops/s $\color{#35bf28}+0.82\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.7925ms 0.3182ms 3.1428 KOps/s 2.6235 KOps/s $\textbf{\color{#35bf28}+19.79\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6141ms 0.3045ms 3.2843 KOps/s 2.7487 KOps/s $\textbf{\color{#35bf28}+19.49\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.4818ms 1.2594ms 794.0514 Ops/s 680.7101 Ops/s $\textbf{\color{#35bf28}+16.65\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.3905ms 1.1733ms 852.2686 Ops/s 735.5898 Ops/s $\textbf{\color{#35bf28}+15.86\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 12.7205ms 5.8718ms 170.3061 Ops/s 172.2861 Ops/s $\color{#d91a1a}-1.15\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.9350ms 0.4714ms 2.1212 KOps/s 1.8659 KOps/s $\textbf{\color{#35bf28}+13.68\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8572ms 0.4697ms 2.1289 KOps/s 2.0630 KOps/s $\color{#35bf28}+3.19\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 5.7032ms 5.5686ms 179.5790 Ops/s 176.4551 Ops/s $\color{#35bf28}+1.77\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.7087ms 0.2866ms 3.4894 KOps/s 2.7059 KOps/s $\textbf{\color{#35bf28}+28.96\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5490ms 0.2702ms 3.7012 KOps/s 3.0478 KOps/s $\textbf{\color{#35bf28}+21.44\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 5.7338ms 5.5534ms 180.0687 Ops/s 179.0451 Ops/s $\color{#35bf28}+0.57\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.9388ms 0.2969ms 3.3681 KOps/s 2.6214 KOps/s $\textbf{\color{#35bf28}+28.49\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.4984ms 0.2674ms 3.7397 KOps/s 2.7985 KOps/s $\textbf{\color{#35bf28}+33.63\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.1365ms 5.7133ms 175.0290 Ops/s 170.7726 Ops/s $\color{#35bf28}+2.49\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.8500ms 0.4450ms 2.2471 KOps/s 1.9590 KOps/s $\textbf{\color{#35bf28}+14.71\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7174ms 0.4529ms 2.2081 KOps/s 2.0388 KOps/s $\textbf{\color{#35bf28}+8.30\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.9614s 24.1059ms 41.4836 Ops/s 197.7809 Ops/s $\textbf{\color{#d91a1a}-79.03\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 10.0424ms 1.9293ms 518.3312 Ops/s 561.9199 Ops/s $\textbf{\color{#d91a1a}-7.76\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 53.4649ms 2.1944ms 455.6991 Ops/s 1.0199 KOps/s $\textbf{\color{#d91a1a}-55.32\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 6.8405ms 4.9959ms 200.1641 Ops/s 194.6609 Ops/s $\color{#35bf28}+2.83\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 3.9555ms 1.7909ms 558.3807 Ops/s 538.9639 Ops/s $\color{#35bf28}+3.60\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1.1359ms 0.9355ms 1.0689 KOps/s 1.0345 KOps/s $\color{#35bf28}+3.32\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 7.5341ms 5.1905ms 192.6609 Ops/s 45.4567 Ops/s $\textbf{\color{#35bf28}+323.83\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 11.1464ms 2.1214ms 471.3947 Ops/s 502.5052 Ops/s $\textbf{\color{#d91a1a}-6.19\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 3.7016ms 1.1892ms 840.8935 Ops/s 873.0440 Ops/s $\color{#d91a1a}-3.68\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 39.9043ms 37.6180ms 26.5830 Ops/s 25.7439 Ops/s $\color{#35bf28}+3.26\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 19.4268ms 17.7521ms 56.3312 Ops/s 54.0038 Ops/s $\color{#35bf28}+4.31\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 42.6200ms 39.3359ms 25.4221 Ops/s 25.0367 Ops/s $\color{#35bf28}+1.54\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 20.0884ms 18.2338ms 54.8432 Ops/s 53.9529 Ops/s $\color{#35bf28}+1.65\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 42.3557ms 40.7805ms 24.5215 Ops/s 23.9928 Ops/s $\color{#35bf28}+2.20\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 20.9460ms 19.6006ms 51.0189 Ops/s 49.3514 Ops/s $\color{#35bf28}+3.38\%$
test_storage_write_lazystack[50-img_shape0-small] 0.9106ms 0.2147ms 4.6586 KOps/s 4.4323 KOps/s $\textbf{\color{#35bf28}+5.11\%}$
test_storage_write_lazystack[100-img_shape1-atari] 1.6520ms 1.3948ms 716.9575 Ops/s 726.3995 Ops/s $\color{#d91a1a}-1.30\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.6140ms 2.3450ms 426.4445 Ops/s 426.0980 Ops/s $\color{#35bf28}+0.08\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.1539ms 2.9467ms 339.3646 Ops/s 341.8142 Ops/s $\color{#d91a1a}-0.72\%$
test_storage_write_contiguous[50-img_shape0-small] 0.2405ms 0.1623ms 6.1600 KOps/s 6.0737 KOps/s $\color{#35bf28}+1.42\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3837ms 0.2223ms 4.4980 KOps/s 4.5067 KOps/s $\color{#d91a1a}-0.19\%$
test_storage_write_contiguous[100-img_shape2-large_img] 2.0592ms 1.8273ms 547.2517 Ops/s 529.7240 Ops/s $\color{#35bf28}+3.31\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.7188ms 1.4409ms 693.9905 Ops/s 718.6203 Ops/s $\color{#d91a1a}-3.43\%$
test_collector_stack_then_write[50-img_shape0-small] 1.4259ms 1.1246ms 889.2136 Ops/s 889.2022 Ops/s $+0.00\%$
test_collector_stack_then_write[100-img_shape1-atari] 7.5560ms 3.5742ms 279.7868 Ops/s 268.7438 Ops/s $\color{#35bf28}+4.11\%$
test_collector_stack_then_write[100-img_shape2-large_img] 11.2727ms 5.8084ms 172.1641 Ops/s 173.1760 Ops/s $\color{#d91a1a}-0.58\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.7009ms 7.2431ms 138.0631 Ops/s 142.7606 Ops/s $\color{#d91a1a}-3.29\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4382ms 0.2736ms 3.6556 KOps/s 3.7019 KOps/s $\color{#d91a1a}-1.25\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.7430ms 1.5095ms 662.4526 Ops/s 683.1131 Ops/s $\color{#d91a1a}-3.02\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.8390ms 2.5075ms 398.8016 Ops/s 408.9205 Ops/s $\color{#d91a1a}-2.47\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.5478ms 3.1529ms 317.1664 Ops/s 319.8388 Ops/s $\color{#d91a1a}-0.84\%$
test_collector_without_rb[100-img_shape0-atari] 33.2977ms 32.3662ms 30.8964 Ops/s 31.1529 Ops/s $\color{#d91a1a}-0.82\%$
test_collector_without_rb[200-img_shape1-large_batch] 63.2353ms 62.8162ms 15.9194 Ops/s 15.7511 Ops/s $\color{#35bf28}+1.07\%$
test_collector_with_rb[100-img_shape0-atari] 37.1945ms 36.4602ms 27.4272 Ops/s 26.9839 Ops/s $\color{#35bf28}+1.64\%$
test_collector_with_rb[200-img_shape1-large_batch] 94.7857ms 75.6199ms 13.2240 Ops/s 13.9985 Ops/s $\textbf{\color{#d91a1a}-5.53\%}$
test_collector_without_rb_cuda[100-img_shape0-atari] 55.2475ms 54.5147ms 18.3437 Ops/s 18.4519 Ops/s $\color{#d91a1a}-0.59\%$
test_collector_without_rb_cuda[200-img_shape1-large_batch] 0.1087s 0.1084s 9.2263 Ops/s 9.2742 Ops/s $\color{#d91a1a}-0.52\%$
test_collector_with_rb_cuda[100-img_shape0-atari] 56.5716ms 56.2815ms 17.7678 Ops/s 17.8741 Ops/s $\color{#d91a1a}-0.59\%$
test_collector_with_rb_cuda[200-img_shape1-large_batch] 0.1127s 0.1123s 8.9068 Ops/s 8.9366 Ops/s $\color{#d91a1a}-0.33\%$

@github-actions
Copy link
Copy Markdown
Contributor

⚠️ PR Title Label Error

Unknown or invalid prefix [LLM].

Current title: [LLM] Simplify IFEval reward aggregator

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function
[Quality] Quality [Quality] Fix typos and add codespell

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@vmoens vmoens closed this Mar 26, 2026
@vmoens vmoens deleted the gh/vmoens/235/head branch March 26, 2026 16:54
vmoens added a commit that referenced this pull request Mar 26, 2026
Replace the complex tiered multiplicative reward (structure multiplier,
quality bonus thresholds, complexity scaling) with a simple weighted
average of IFEval metrics plus a small additive format bonus.

The new reward is: weighted_avg(strict/loose metrics) + format_bonus,
where format_bonus is 0.1 for a single answer block and 0.05 for a
single think block. Reward range: ~[0, 1.15].

Made-with: Cursor
ghstack-source-id: 559c504
Pull-Request: #3543

ghstack-source-id: 559c504
Pull Request resolved: #3569
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. llm/ LLM-related PR, triggers LLM CI tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant