Skip to content

Consolidate lm-eval scripts: merge AnyModel auto-detection into lm_eval_hf.py#1206

Merged
kevalmorabia97 merged 2 commits intofeature/puzzletronfrom
jrausch/lm-eval-consolidation
Apr 9, 2026
Merged

Consolidate lm-eval scripts: merge AnyModel auto-detection into lm_eval_hf.py#1206
kevalmorabia97 merged 2 commits intofeature/puzzletronfrom
jrausch/lm-eval-consolidation

Conversation

@j-rausch
Copy link
Copy Markdown
Contributor

@j-rausch j-rausch commented Apr 8, 2026

Summary

  • Merge examples/puzzletron/evaluation/lm_eval_anymodel.py into the existing
    examples/llm_eval/lm_eval_hf.py so there is a single evaluation entry point
    for both standard HF and AnyModel/Puzzletron checkpoints.
  • AnyModel support is auto-detected at load time via resolve_descriptor_from_pretrained;
    the puzzletron extra is optional

Notes

AnyModel auto-detection uses resolve_descriptor_from_pretrained, which currently
relies on a hardcoded _MODEL_TYPE_TO_DESCRIPTOR dict that must be kept in sync
manually with descriptor registrations. This should be addressed in the future.

Summary by CodeRabbit

  • New Features

    • Evaluation now supports Puzzletron heterogeneous pruned checkpoints through the main lm-eval entrypoint.
  • Documentation

    • Added “Heterogeneous Pruned Checkpoints (Puzzletron)” subsection with installation notes, example evaluation commands, and a smoke-test tip.
  • Chores

    • Consolidated Puzzletron evaluation into the primary workflow and removed the separate Puzzletron entrypoint.
    • License-header insertion now applies to the previously excluded Puzzletron evaluation script.

@j-rausch j-rausch requested review from a team as code owners April 8, 2026 15:19
@j-rausch j-rausch requested review from Edwardf0t1 and kevalmorabia97 and removed request for a team April 8, 2026 15:19
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 490869c3-8a8a-47fa-9af5-d344bb0517d4

📥 Commits

Reviewing files that changed from the base of the PR and between d378155 and bf5cbb1.

📒 Files selected for processing (5)
  • .pre-commit-config.yaml
  • examples/llm_eval/README.md
  • examples/llm_eval/lm_eval_hf.py
  • examples/puzzletron/README.md
  • examples/puzzletron/evaluation/lm_eval_anymodel.py
💤 Files with no reviewable changes (2)
  • .pre-commit-config.yaml
  • examples/puzzletron/evaluation/lm_eval_anymodel.py
✅ Files skipped from review due to trivial changes (1)
  • examples/llm_eval/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/llm_eval/lm_eval_hf.py

📝 Walkthrough

Walkthrough

Consolidates Puzzletron AnyModel support into the main HuggingFace lm-eval wrapper: removes the Puzzletron-specific entrypoint, adds conditional Puzzletron patching inside examples/llm_eval/lm_eval_hf.py, updates docs, and adjusts pre-commit exclusions.

Changes

Cohort / File(s) Summary
Pre-commit config
\.pre-commit-config.yaml
Removed exclude entry for examples/puzzletron/evaluation/lm_eval_anymodel.py, so that file is no longer exempt from the insert-license hook (file subsequently deleted).
Documentation
examples/llm_eval/README.md, examples/puzzletron/README.md
Added Puzzletron usage instructions to examples/llm_eval/README.md (example lm_eval_hf.py invocation and install note); updated examples/puzzletron/README.md to point to examples/llm_eval/lm_eval_hf.py.
HuggingFace lm-eval wrapper
examples/llm_eval/lm_eval_hf.py
Added guarded Puzzletron imports and _ANYMODEL_AVAILABLE detection; introduced _anymodel_patcher_context(pretrained, trust_remote_code=False); wrapped model construction in that context; added create_from_arg_string and monkey-patched HFLM.create_from_arg_string.
Removed Puzzletron entrypoint
examples/puzzletron/evaluation/lm_eval_anymodel.py
Deleted standalone Puzzletron lm-eval entrypoint that provided patched create_from_arg_obj/create_from_arg_string and direct cli_evaluate() execution.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LM_HF as lm_eval_hf.py
    participant Puzzletron
    participant HF as HuggingFace Loader

    User->>LM_HF: call create_from_arg_obj(arg_dict) / create_from_arg_string(arg_string)
    LM_HF->>LM_HF: extract `pretrained`, `trust_remote_code`
    LM_HF->>Puzzletron: resolve_descriptor_from_pretrained(pretrained)
    alt descriptor found
        Puzzletron-->>LM_HF: descriptor
        LM_HF->>LM_HF: enter deci_x_patcher context
        LM_HF->>HF: load model (patched)
        HF-->>LM_HF: patched model instance
        LM_HF->>LM_HF: exit patcher context
    else descriptor not found or Puzzletron unavailable
        Puzzletron-->>LM_HF: error / none
        LM_HF->>HF: load model normally
        HF-->>LM_HF: standard model instance
    end
    LM_HF-->>User: return configured model
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective: consolidating lm-eval scripts by merging AnyModel auto-detection functionality into lm_eval_hf.py, which is the primary change across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed Code review against SECURITY.md reveals no security anti-patterns: no torch.load() with weights_only=False, no numpy.load() with allow_pickle=True, trust_remote_code properly exposed as caller-configurable parameter defaulting to False, no eval()/exec() on external input, no nosec comments, and dead code removal without non-permissive license dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jrausch/lm-eval-consolidation

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-09 14:15 UTC

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/llm_eval/lm_eval_hf.py (1)

142-155: Feature gap: create_from_arg_string lacks quantization/sparsity support.

Unlike create_from_arg_obj (which applies quantize_model and sparsify_model), this method only enables HuggingFace checkpointing and Puzzletron patching. If users invoke lm_eval through its standard CLI (bypassing this script's __main__), quantization arguments would be silently ignored.

If this is intentional (string-based path is for simpler use cases), consider adding a docstring note. Otherwise, consider extracting the quantization/sparsity logic into a shared helper.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_eval/lm_eval_hf.py` around lines 142 - 155,
create_from_arg_string currently only enables HF checkpointing and Puzzletron
patching but ignores quantization/sparsity steps applied by create_from_arg_obj;
update create_from_arg_string to call the same quantize_model and sparsify_model
logic (or extract that logic into a shared helper used by both) after
constructing the model but before returning it so that quantization/sparsity CLI
args in args (e.g., any quantize_* or sparsify_* flags) are honored; reference
create_from_arg_string, create_from_arg_obj, quantize_model, and sparsify_model
when locating and reusing the existing implementation, ensuring the helper is
invoked inside the _anymodel_patcher_context block (after cls(**args, **args2))
and preserving mto.enable_huggingface_checkpointing behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Around line 74-78: The check for "descriptor is None" is unreachable because
resolve_descriptor_from_pretrained raises ValueError on failure (which is
already caught), so remove the redundant guard and directly return
deci_x_patcher; specifically, in the block where
resolve_descriptor_from_pretrained is called, delete the "if descriptor is None:
return contextlib.nullcontext()" lines and leave "return
deci_x_patcher(model_descriptor=descriptor)" (referencing
resolve_descriptor_from_pretrained and deci_x_patcher to locate the code).

---

Nitpick comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Around line 142-155: create_from_arg_string currently only enables HF
checkpointing and Puzzletron patching but ignores quantization/sparsity steps
applied by create_from_arg_obj; update create_from_arg_string to call the same
quantize_model and sparsify_model logic (or extract that logic into a shared
helper used by both) after constructing the model but before returning it so
that quantization/sparsity CLI args in args (e.g., any quantize_* or sparsify_*
flags) are honored; reference create_from_arg_string, create_from_arg_obj,
quantize_model, and sparsify_model when locating and reusing the existing
implementation, ensuring the helper is invoked inside the
_anymodel_patcher_context block (after cls(**args, **args2)) and preserving
mto.enable_huggingface_checkpointing behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90dcb473-d82b-43c9-86b9-031ab5743b41

📥 Commits

Reviewing files that changed from the base of the PR and between 25266b8 and d862c4e.

📒 Files selected for processing (5)
  • .pre-commit-config.yaml
  • examples/llm_eval/README.md
  • examples/llm_eval/lm_eval_hf.py
  • examples/puzzletron/README.md
  • examples/puzzletron/evaluation/lm_eval_anymodel.py
💤 Files with no reviewable changes (2)
  • .pre-commit-config.yaml
  • examples/puzzletron/evaluation/lm_eval_anymodel.py

Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great cleanup

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is examples/puzzletron/evaluation/hf_deployable_anymodel.py use for? Is it for evaluation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a file that enables us to use a secondary evaluation path with nemo evaluator. Nemo evaluator, as is, builds on ray deployment via nemo export-deploy. The deployment script of export-deploy doesn't have the patcher for anymodel built in at the moment

j-rausch added 2 commits April 9, 2026 06:49
…al_hf.py

Signed-off-by: jrausch <jrausch@nvidia.com>
Signed-off-by: jrausch <jrausch@nvidia.com>
@j-rausch j-rausch force-pushed the jrausch/lm-eval-consolidation branch from d378155 to bf5cbb1 Compare April 9, 2026 13:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
examples/llm_eval/lm_eval_hf.py (2)

70-75: Consider broadening exception handling for robustness.

Per the relevant code snippet, resolve_descriptor_from_pretrained calls AutoConfig.from_pretrained(), which can raise OSError, FileNotFoundError, or network-related exceptions beyond ValueError/AttributeError. If the intent is to silently fall back to standard HF loading when Puzzletron detection fails for any reason (including transient issues), consider catching Exception or at least OSError. If the intent is to fail loudly on I/O errors, the current behavior is fine.

♻️ Optional: broader exception handling
     try:
         descriptor = resolve_descriptor_from_pretrained(
             pretrained, trust_remote_code=trust_remote_code
         )
-    except (ValueError, AttributeError):
+    except (ValueError, AttributeError, OSError):
         return contextlib.nullcontext()
     return deci_x_patcher(model_descriptor=descriptor)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_eval/lm_eval_hf.py` around lines 70 - 75, The current except
block around resolve_descriptor_from_pretrained only catches ValueError and
AttributeError but AutoConfig.from_pretrained (called by
resolve_descriptor_from_pretrained) can raise OSError/FileNotFoundError or other
I/O/network exceptions; update the handler to include those exceptions (e.g.,
catch OSError or Exception as appropriate) so transient I/O/network errors fall
back to contextlib.nullcontext() as intended, while preserving the existing
behavior for the Puzzletron-detection path.

140-154: Asymmetric behavior: missing padding_side and quantization/sparsity logic.

Unlike create_from_arg_obj, this function:

  1. Does not set model_obj.tokenizer.padding_side = "left"
  2. Does not apply quantization (quant_cfg) or sparsity (sparse_cfg)

If this is intentional (e.g., string-based creation is only for Puzzletron checkpoints that don't need post-load processing), please add a docstring clarification. Otherwise, consider aligning the behavior.

♻️ Minimal fix: add padding_side for consistency
     with _anymodel_patcher_context(args.get("pretrained"), args.get("trust_remote_code", False)):
         model_obj = cls(**args, **args2)
 
+    model_obj.tokenizer.padding_side = "left"
     return model_obj
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_eval/lm_eval_hf.py` around lines 140 - 154,
create_from_arg_string currently omits the post-load steps present in
create_from_arg_obj: it doesn't set model_obj.tokenizer.padding_side = "left"
nor apply quantization/sparsity (quant_cfg / sparse_cfg), causing asymmetric
behavior; inside the same with _anymodel_patcher_context after instantiating
model_obj (the cls(**args, **args2) call), set model_obj.tokenizer.padding_side
= "left" (if tokenizer exists) and invoke the same quantization and sparsity
application logic used by create_from_arg_obj (or factor that logic into a
shared helper and call it here) so string-based creation mirrors object-based
creation—if the omission was intentional, instead update the function docstring
to explicitly state that padding_side and quant/sparse post-processing are
intentionally skipped for Puzzletron checkpoints.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Around line 70-75: The current except block around
resolve_descriptor_from_pretrained only catches ValueError and AttributeError
but AutoConfig.from_pretrained (called by resolve_descriptor_from_pretrained)
can raise OSError/FileNotFoundError or other I/O/network exceptions; update the
handler to include those exceptions (e.g., catch OSError or Exception as
appropriate) so transient I/O/network errors fall back to
contextlib.nullcontext() as intended, while preserving the existing behavior for
the Puzzletron-detection path.
- Around line 140-154: create_from_arg_string currently omits the post-load
steps present in create_from_arg_obj: it doesn't set
model_obj.tokenizer.padding_side = "left" nor apply quantization/sparsity
(quant_cfg / sparse_cfg), causing asymmetric behavior; inside the same with
_anymodel_patcher_context after instantiating model_obj (the cls(**args,
**args2) call), set model_obj.tokenizer.padding_side = "left" (if tokenizer
exists) and invoke the same quantization and sparsity application logic used by
create_from_arg_obj (or factor that logic into a shared helper and call it here)
so string-based creation mirrors object-based creation—if the omission was
intentional, instead update the function docstring to explicitly state that
padding_side and quant/sparse post-processing are intentionally skipped for
Puzzletron checkpoints.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 48f2ed1d-c9e8-493d-b3cd-e10780c0c4b4

📥 Commits

Reviewing files that changed from the base of the PR and between d862c4e and d378155.

📒 Files selected for processing (2)
  • examples/llm_eval/README.md
  • examples/llm_eval/lm_eval_hf.py
✅ Files skipped from review due to trivial changes (1)
  • examples/llm_eval/README.md

@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) April 9, 2026 13:51
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.71%. Comparing base (25266b8) to head (bf5cbb1).
⚠️ Report is 1 commits behind head on feature/puzzletron.

Additional details and impacted files
@@                  Coverage Diff                   @@
##           feature/puzzletron    #1206      +/-   ##
======================================================
- Coverage               75.78%   75.71%   -0.08%     
======================================================
  Files                     446      446              
  Lines                   47684    47684              
======================================================
- Hits                    36139    36102      -37     
- Misses                  11545    11582      +37     
Flag Coverage Δ
examples 43.30% <ø> (-0.16%) ⬇️
unit 52.13% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 merged commit fd5694d into feature/puzzletron Apr 9, 2026
42 of 43 checks passed
@kevalmorabia97 kevalmorabia97 deleted the jrausch/lm-eval-consolidation branch April 9, 2026 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants