Consolidate lm-eval scripts: merge AnyModel auto-detection into lm_eval_hf.py by j-rausch · Pull Request #1206 · NVIDIA/Model-Optimizer

j-rausch · 2026-04-08T15:19:24Z

Summary

Merge examples/puzzletron/evaluation/lm_eval_anymodel.py into the existing
examples/llm_eval/lm_eval_hf.py so there is a single evaluation entry point
for both standard HF and AnyModel/Puzzletron checkpoints.
AnyModel support is auto-detected at load time via resolve_descriptor_from_pretrained;
the puzzletron extra is optional

Notes

AnyModel auto-detection uses resolve_descriptor_from_pretrained, which currently
relies on a hardcoded _MODEL_TYPE_TO_DESCRIPTOR dict that must be kept in sync
manually with descriptor registrations. This should be addressed in the future.

Summary by CodeRabbit

New Features
- Evaluation now supports Puzzletron heterogeneous pruned checkpoints through the main lm-eval entrypoint.
Documentation
- Added “Heterogeneous Pruned Checkpoints (Puzzletron)” subsection with installation notes, example evaluation commands, and a smoke-test tip.
Chores
- Consolidated Puzzletron evaluation into the primary workflow and removed the separate Puzzletron entrypoint.
- License-header insertion now applies to the previously excluded Puzzletron evaluation script.

coderabbitai · 2026-04-08T15:23:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 490869c3-8a8a-47fa-9af5-d344bb0517d4

📥 Commits

Reviewing files that changed from the base of the PR and between d378155 and bf5cbb1.

📒 Files selected for processing (5)

.pre-commit-config.yaml
examples/llm_eval/README.md
examples/llm_eval/lm_eval_hf.py
examples/puzzletron/README.md
examples/puzzletron/evaluation/lm_eval_anymodel.py

💤 Files with no reviewable changes (2)

.pre-commit-config.yaml
examples/puzzletron/evaluation/lm_eval_anymodel.py

✅ Files skipped from review due to trivial changes (1)

examples/llm_eval/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

examples/llm_eval/lm_eval_hf.py

📝 Walkthrough

Walkthrough

Consolidates Puzzletron AnyModel support into the main HuggingFace lm-eval wrapper: removes the Puzzletron-specific entrypoint, adds conditional Puzzletron patching inside examples/llm_eval/lm_eval_hf.py, updates docs, and adjusts pre-commit exclusions.

Changes

Cohort / File(s)	Summary
Pre-commit config `\.pre-commit-config.yaml`	Removed exclude entry for `examples/puzzletron/evaluation/lm_eval_anymodel.py`, so that file is no longer exempt from the `insert-license` hook (file subsequently deleted).
Documentation `examples/llm_eval/README.md`, `examples/puzzletron/README.md`	Added Puzzletron usage instructions to `examples/llm_eval/README.md` (example `lm_eval_hf.py` invocation and install note); updated `examples/puzzletron/README.md` to point to `examples/llm_eval/lm_eval_hf.py`.
HuggingFace lm-eval wrapper `examples/llm_eval/lm_eval_hf.py`	Added guarded Puzzletron imports and `_ANYMODEL_AVAILABLE` detection; introduced `_anymodel_patcher_context(pretrained, trust_remote_code=False)`; wrapped model construction in that context; added `create_from_arg_string` and monkey-patched `HFLM.create_from_arg_string`.
Removed Puzzletron entrypoint `examples/puzzletron/evaluation/lm_eval_anymodel.py`	Deleted standalone Puzzletron lm-eval entrypoint that provided patched `create_from_arg_obj/create_from_arg_string` and direct `cli_evaluate()` execution.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LM_HF as lm_eval_hf.py
    participant Puzzletron
    participant HF as HuggingFace Loader

    User->>LM_HF: call create_from_arg_obj(arg_dict) / create_from_arg_string(arg_string)
    LM_HF->>LM_HF: extract `pretrained`, `trust_remote_code`
    LM_HF->>Puzzletron: resolve_descriptor_from_pretrained(pretrained)
    alt descriptor found
        Puzzletron-->>LM_HF: descriptor
        LM_HF->>LM_HF: enter deci_x_patcher context
        LM_HF->>HF: load model (patched)
        HF-->>LM_HF: patched model instance
        LM_HF->>LM_HF: exit patcher context
    else descriptor not found or Puzzletron unavailable
        Puzzletron-->>LM_HF: error / none
        LM_HF->>HF: load model normally
        HF-->>LM_HF: standard model instance
    end
    LM_HF-->>User: return configured model

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main objective: consolidating lm-eval scripts by merging AnyModel auto-detection functionality into lm_eval_hf.py, which is the primary change across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	Code review against SECURITY.md reveals no security anti-patterns: no torch.load() with weights_only=False, no numpy.load() with allow_pickle=True, trust_remote_code properly exposed as caller-configurable parameter defaulting to False, no eval()/exec() on external input, no nosec comments, and dead code removal without non-permissive license dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jrausch/lm-eval-consolidation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-08T15:24:34Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-09 14:15 UTC

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/llm_eval/lm_eval_hf.py (1)
142-155: Feature gap: create_from_arg_string lacks quantization/sparsity support.

Unlike create_from_arg_obj (which applies quantize_model and sparsify_model), this method only enables HuggingFace checkpointing and Puzzletron patching. If users invoke lm_eval through its standard CLI (bypassing this script's __main__), quantization arguments would be silently ignored.

If this is intentional (string-based path is for simpler use cases), consider adding a docstring note. Otherwise, consider extracting the quantization/sparsity logic into a shared helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_eval/lm_eval_hf.py` around lines 142 - 155,
create_from_arg_string currently only enables HF checkpointing and Puzzletron
patching but ignores quantization/sparsity steps applied by create_from_arg_obj;
update create_from_arg_string to call the same quantize_model and sparsify_model
logic (or extract that logic into a shared helper used by both) after
constructing the model but before returning it so that quantization/sparsity CLI
args in args (e.g., any quantize_* or sparsify_* flags) are honored; reference
create_from_arg_string, create_from_arg_obj, quantize_model, and sparsify_model
when locating and reusing the existing implementation, ensuring the helper is
invoked inside the _anymodel_patcher_context block (after cls(**args, **args2))
and preserving mto.enable_huggingface_checkpointing behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Around line 74-78: The check for "descriptor is None" is unreachable because
resolve_descriptor_from_pretrained raises ValueError on failure (which is
already caught), so remove the redundant guard and directly return
deci_x_patcher; specifically, in the block where
resolve_descriptor_from_pretrained is called, delete the "if descriptor is None:
return contextlib.nullcontext()" lines and leave "return
deci_x_patcher(model_descriptor=descriptor)" (referencing
resolve_descriptor_from_pretrained and deci_x_patcher to locate the code).

---

Nitpick comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Around line 142-155: create_from_arg_string currently only enables HF
checkpointing and Puzzletron patching but ignores quantization/sparsity steps
applied by create_from_arg_obj; update create_from_arg_string to call the same
quantize_model and sparsify_model logic (or extract that logic into a shared
helper used by both) after constructing the model but before returning it so
that quantization/sparsity CLI args in args (e.g., any quantize_* or sparsify_*
flags) are honored; reference create_from_arg_string, create_from_arg_obj,
quantize_model, and sparsify_model when locating and reusing the existing
implementation, ensuring the helper is invoked inside the
_anymodel_patcher_context block (after cls(**args, **args2)) and preserving
mto.enable_huggingface_checkpointing behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90dcb473-d82b-43c9-86b9-031ab5743b41

📥 Commits

Reviewing files that changed from the base of the PR and between 25266b8 and d862c4e.

📒 Files selected for processing (5)

.pre-commit-config.yaml
examples/llm_eval/README.md
examples/llm_eval/lm_eval_hf.py
examples/puzzletron/README.md
examples/puzzletron/evaluation/lm_eval_anymodel.py

💤 Files with no reviewable changes (2)

.pre-commit-config.yaml
examples/puzzletron/evaluation/lm_eval_anymodel.py

examples/llm_eval/lm_eval_hf.py

kevalmorabia97

Great cleanup

examples/llm_eval/lm_eval_hf.py

examples/llm_eval/README.md

kevalmorabia97 · 2026-04-08T15:39:56Z

examples/puzzletron/evaluation/lm_eval_anymodel.py

What is examples/puzzletron/evaluation/hf_deployable_anymodel.py use for? Is it for evaluation?

This is a file that enables us to use a secondary evaluation path with nemo evaluator. Nemo evaluator, as is, builds on ray deployment via nemo export-deploy. The deployment script of export-deploy doesn't have the patcher for anymodel built in at the moment

…al_hf.py Signed-off-by: jrausch <jrausch@nvidia.com>

Signed-off-by: jrausch <jrausch@nvidia.com>

coderabbitai

🧹 Nitpick comments (2)

examples/llm_eval/lm_eval_hf.py (2)
70-75: Consider broadening exception handling for robustness.

Per the relevant code snippet, resolve_descriptor_from_pretrained calls AutoConfig.from_pretrained(), which can raise OSError, FileNotFoundError, or network-related exceptions beyond ValueError/AttributeError. If the intent is to silently fall back to standard HF loading when Puzzletron detection fails for any reason (including transient issues), consider catching Exception or at least OSError. If the intent is to fail loudly on I/O errors, the current behavior is fine.
♻️ Optional: broader exception handling
     try:
         descriptor = resolve_descriptor_from_pretrained(
             pretrained, trust_remote_code=trust_remote_code
         )
-    except (ValueError, AttributeError):
+    except (ValueError, AttributeError, OSError):
         return contextlib.nullcontext()
     return deci_x_patcher(model_descriptor=descriptor)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_eval/lm_eval_hf.py` around lines 70 - 75, The current except
block around resolve_descriptor_from_pretrained only catches ValueError and
AttributeError but AutoConfig.from_pretrained (called by
resolve_descriptor_from_pretrained) can raise OSError/FileNotFoundError or other
I/O/network exceptions; update the handler to include those exceptions (e.g.,
catch OSError or Exception as appropriate) so transient I/O/network errors fall
back to contextlib.nullcontext() as intended, while preserving the existing
behavior for the Puzzletron-detection path.
140-154: Asymmetric behavior: missing padding_side and quantization/sparsity logic.

Unlike create_from_arg_obj, this function:

Does not set model_obj.tokenizer.padding_side = "left"

Does not apply quantization (quant_cfg) or sparsity (sparse_cfg)

If this is intentional (e.g., string-based creation is only for Puzzletron checkpoints that don't need post-load processing), please add a docstring clarification. Otherwise, consider aligning the behavior.
♻️ Minimal fix: add padding_side for consistency
     with _anymodel_patcher_context(args.get("pretrained"), args.get("trust_remote_code", False)):
         model_obj = cls(**args, **args2)
 
+    model_obj.tokenizer.padding_side = "left"
     return model_obj
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_eval/lm_eval_hf.py` around lines 140 - 154,
create_from_arg_string currently omits the post-load steps present in
create_from_arg_obj: it doesn't set model_obj.tokenizer.padding_side = "left"
nor apply quantization/sparsity (quant_cfg / sparse_cfg), causing asymmetric
behavior; inside the same with _anymodel_patcher_context after instantiating
model_obj (the cls(**args, **args2) call), set model_obj.tokenizer.padding_side
= "left" (if tokenizer exists) and invoke the same quantization and sparsity
application logic used by create_from_arg_obj (or factor that logic into a
shared helper and call it here) so string-based creation mirrors object-based
creation—if the omission was intentional, instead update the function docstring
to explicitly state that padding_side and quant/sparse post-processing are
intentionally skipped for Puzzletron checkpoints.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Around line 70-75: The current except block around
resolve_descriptor_from_pretrained only catches ValueError and AttributeError
but AutoConfig.from_pretrained (called by resolve_descriptor_from_pretrained)
can raise OSError/FileNotFoundError or other I/O/network exceptions; update the
handler to include those exceptions (e.g., catch OSError or Exception as
appropriate) so transient I/O/network errors fall back to
contextlib.nullcontext() as intended, while preserving the existing behavior for
the Puzzletron-detection path.
- Around line 140-154: create_from_arg_string currently omits the post-load
steps present in create_from_arg_obj: it doesn't set
model_obj.tokenizer.padding_side = "left" nor apply quantization/sparsity
(quant_cfg / sparse_cfg), causing asymmetric behavior; inside the same with
_anymodel_patcher_context after instantiating model_obj (the cls(**args,
**args2) call), set model_obj.tokenizer.padding_side = "left" (if tokenizer
exists) and invoke the same quantization and sparsity application logic used by
create_from_arg_obj (or factor that logic into a shared helper and call it here)
so string-based creation mirrors object-based creation—if the omission was
intentional, instead update the function docstring to explicitly state that
padding_side and quant/sparse post-processing are intentionally skipped for
Puzzletron checkpoints.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 48f2ed1d-c9e8-493d-b3cd-e10780c0c4b4

📥 Commits

Reviewing files that changed from the base of the PR and between d862c4e and d378155.

📒 Files selected for processing (2)

examples/llm_eval/README.md
examples/llm_eval/lm_eval_hf.py

✅ Files skipped from review due to trivial changes (1)

examples/llm_eval/README.md

codecov · 2026-04-09T14:03:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.71%. Comparing base (25266b8) to head (bf5cbb1).
⚠️ Report is 1 commits behind head on feature/puzzletron.

Additional details and impacted files

@@                  Coverage Diff                   @@
##           feature/puzzletron    #1206      +/-   ##
======================================================
- Coverage               75.78%   75.71%   -0.08%     
======================================================
  Files                     446      446              
  Lines                   47684    47684              
======================================================
- Hits                    36139    36102      -37     
- Misses                  11545    11582      +37

Flag	Coverage Δ
examples	`43.30% <ø> (-0.16%)`	⬇️
unit	`52.13% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

j-rausch requested review from a team as code owners April 8, 2026 15:19

j-rausch requested review from Edwardf0t1 and kevalmorabia97 and removed request for a team April 8, 2026 15:19

coderabbitai bot reviewed Apr 8, 2026

View reviewed changes

examples/llm_eval/lm_eval_hf.py Show resolved Hide resolved

kevalmorabia97 reviewed Apr 8, 2026

View reviewed changes

j-rausch added 2 commits April 9, 2026 06:49

consolidate lm-eval scripts; merge anymodel auto-detection into lm_ev…

c5ffd9b

…al_hf.py Signed-off-by: jrausch <jrausch@nvidia.com>

remove dead code, clarify readme heading, add docstring

bf5cbb1

Signed-off-by: jrausch <jrausch@nvidia.com>

j-rausch force-pushed the jrausch/lm-eval-consolidation branch from d378155 to bf5cbb1 Compare April 9, 2026 13:50

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

kevalmorabia97 approved these changes Apr 9, 2026

View reviewed changes

kevalmorabia97 enabled auto-merge (squash) April 9, 2026 13:51

kevalmorabia97 merged commit fd5694d into feature/puzzletron Apr 9, 2026
42 of 43 checks passed

kevalmorabia97 deleted the jrausch/lm-eval-consolidation branch April 9, 2026 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate lm-eval scripts: merge AnyModel auto-detection into lm_eval_hf.py#1206

Consolidate lm-eval scripts: merge AnyModel auto-detection into lm_eval_hf.py#1206
kevalmorabia97 merged 2 commits intofeature/puzzletronfrom
jrausch/lm-eval-consolidation

j-rausch commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

kevalmorabia97 left a comment

Uh oh!

Uh oh!

Uh oh!

kevalmorabia97 Apr 8, 2026

Uh oh!

j-rausch Apr 9, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

j-rausch commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevalmorabia97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevalmorabia97 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

j-rausch Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

j-rausch commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

github-actions bot commented Apr 8, 2026 •

edited

Loading

codecov bot commented Apr 9, 2026 •

edited

Loading