Skip to content

skills: add evals/evals.json smoke suite (group A)#1308

Merged
rgsl888prabhu merged 2 commits into
mainfrom
skills-add-evals-suite-a
May 27, 2026
Merged

skills: add evals/evals.json smoke suite (group A)#1308
rgsl888prabhu merged 2 commits into
mainfrom
skills-add-evals-suite-a

Conversation

@rgsl888prabhu
Copy link
Copy Markdown
Collaborator

@rgsl888prabhu rgsl888prabhu commented May 27, 2026

Summary

  • Split from skills: add evals/evals.json smoke suite  #1302 — adds evals/evals.json for 4 skills (group A).
  • Each skill gets one happy-path Q&A entry parallel to its existing benchmark/ directory.
  • API-skill questions ask for an ordered method-name list rather than runnable code, so scoring is pure text-pattern matching (no cuopt/cudf install or execution required).

Adds one happy-path Q&A entry per skill, parallel to the existing
benchmark/ directories. Split from #1302 to keep CI runtime per PR
under the job timeout.

This PR covers:
- cuopt-user-rules
- cuopt-routing-api-python
- cuopt-numerical-optimization-api-python
- cuopt-numerical-optimization-api-cli

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu rgsl888prabhu self-assigned this May 27, 2026
@rgsl888prabhu rgsl888prabhu added the improvement Improves an existing functionality label May 27, 2026
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

@rgsl888prabhu rgsl888prabhu added the non-breaking Introduces a non-breaking change label May 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45e00ea6-faf9-4164-8bbd-4b338e42755a

📥 Commits

Reviewing files that changed from the base of the PR and between b3f7c8f and cedb505.

📒 Files selected for processing (8)
  • skills/cuopt-numerical-optimization-api-cli/skill-card.md
  • skills/cuopt-numerical-optimization-api-cli/skill.oms.sig
  • skills/cuopt-numerical-optimization-api-python/skill-card.md
  • skills/cuopt-numerical-optimization-api-python/skill.oms.sig
  • skills/cuopt-routing-api-python/skill-card.md
  • skills/cuopt-routing-api-python/skill.oms.sig
  • skills/cuopt-user-rules/skill-card.md
  • skills/cuopt-user-rules/skill.oms.sig
✅ Files skipped from review due to trivial changes (1)
  • skills/cuopt-numerical-optimization-api-cli/skill-card.md

📝 Walkthrough

Walkthrough

Adds four evaluation cases (CLI MPS, Python LP API, Python routing VRPTW, user-rules clarification), updates several skill-card frontmatter texts, and regenerates Sigstore DSSE signature bundles for the affected skills.

Changes

cuOpt Skill Evaluation Specifications and Metadata

Layer / File(s) Summary
Evaluation specs
skills/*/evals/evals.json
Adds four evaluation JSON entries: CLI MPS/cli-command expectations, LP API call-sequence expectations, routing VRPTW API call-sequence expectations, and a user-rules “ask clarifying questions” evaluation.
Skill-card edits (CLI)
skills/cuopt-numerical-optimization-api-cli/skill-card.md
Reworded Use Case to reference LP/MILP/QP via cuopt_cli with MPS-format and added “Code” to Output Type(s).
Skill-card edits (Numerical Python)
skills/cuopt-numerical-optimization-api-python/skill-card.md
Adjusted top-level description to explicitly reference cuOpt Python API, expanded Use Case examples, and revised Reference(s) list.
Skill-card edits (Routing Python)
skills/cuopt-routing-api-python/skill-card.md
Refined Use Case wording and removed “Configuration instructions” from Output Type(s), leaving Code and API Calls.
Skill-card edits (User rules)
skills/cuopt-user-rules/skill-card.md
Rephrased Use Case for safe interaction patterns and removed “Configuration instructions” from Output Type(s).
Sigstore DSSE bundles
skills/*/skill.oms.sig
Replaced dsseEnvelope.payload and updated signatures[0].sig in four skill sig bundles to match new payloads; verificationMaterial and tlogEntries unchanged.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/cuopt#1287: Related sigstore/eval updates that also touch skill signature bundles and evaluation artifacts.

Suggested reviewers

  • Iroy30
  • tmckayus
  • bdice
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding evaluation specifications (evals.json) for four cuopt skills, indicating this is part A of a larger effort.
Description check ✅ Passed The description is directly related to the changeset, explaining the purpose of adding evals.json entries, the grouping as 'group A', and the rationale for text-pattern matching scoring.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch skills-add-evals-suite-a

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills/cuopt-numerical-optimization-api-python/evals/evals.json`:
- Line 13: The evals.json entry listing PascalCase problem.Status.name values
incorrectly includes "FeasibleFound"; remove "FeasibleFound" from the expected
status list so the LP/continuous check only expects "Optimal" and
"PrimalFeasible" (i.e., align with LPTerminationStatus rather than
MILPTerminationStatus) and ensure any ground_truth checks that compare
problem.Status.name use ['Optimal','PrimalFeasible'] only.

In `@skills/cuopt-user-rules/evals/evals.json`:
- Line 7: The eval's ground_truth string in
skills/cuopt-user-rules/evals/evals.json currently lists "Python / C / REST" but
SKILL.md's Interface includes "CLI"; update the ground_truth entry to include
"CLI" among the interface choices (e.g., "Python / C / REST / CLI") so the eval
questions match SKILL.md, and ensure any downstream expected_behavior text that
enumerates interfaces also reflects CLI (or, if omitting CLI intentionally, add
a brief justification string in the ground_truth referencing SKILL.md's
interface list).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5374cae1-b836-4352-98a8-d33b6c9db304

📥 Commits

Reviewing files that changed from the base of the PR and between 16276d2 and b3f7c8f.

📒 Files selected for processing (4)
  • skills/cuopt-numerical-optimization-api-cli/evals/evals.json
  • skills/cuopt-numerical-optimization-api-python/evals/evals.json
  • skills/cuopt-routing-api-python/evals/evals.json
  • skills/cuopt-user-rules/evals/evals.json

"Lists the API calls in order without producing a full runnable script",
"Names Problem, addVariable (with vtype=CONTINUOUS), addConstraint, setObjective (sense=MAXIMIZE)",
"Names SolverSettings, set_parameter('time_limit', ...), and problem.solve(settings)",
"Names problem.Status.name and the PascalCase status values (Optimal / PrimalFeasible / FeasibleFound)",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Search for Status enum/class definition and all valid status values.

# Search for Status class or enum definition
ast-grep --pattern 'class Status'

# Search for status value references in examples and documentation
rg -n 'Status\.name|Status\.' skills/cuopt-numerical-optimization-api-python/ -g '*.{py,md}' -C2

# Search specifically for FeasibleFound
rg -n 'FeasibleFound|Feasible' skills/cuopt-numerical-optimization-api-python/ -g '*.{py,md}'

Repository: NVIDIA/cuopt

Length of output: 17947


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show evals.json around the referenced line
echo "---- evals.json (lines 1-40) ----"
nl -ba skills/cuopt-numerical-optimization-api-python/evals/evals.json | sed -n '1,60p'

# Locate likely model.py referenced by the evals and inspect around line 29
echo "---- candidate model.py files under evals/ ----"
find skills/cuopt-numerical-optimization-api-python -maxdepth 3 -type f -name 'model.py' -path '*evals*' -print

echo "---- show any model.py under evals/ (best-effort around line 20-45) ----"
for f in $(find skills/cuopt-numerical-optimization-api-python -maxdepth 3 -type f -name 'model.py' -path '*evals*' -print); do
  echo "==== $f ===="
  nl -ba "$f" | sed -n '15,60p'
done

# Also search for FeasibleFound usage in evals.json region if present
echo "---- FeasibleFound references in evals/ ----"
rg -n 'FeasibleFound' skills/cuopt-numerical-optimization-api-python/evals -S || true

Repository: NVIDIA/cuopt

Length of output: 151


🏁 Script executed:

#!/bin/bash
set -euo pipefail

EVALS_JSON="skills/cuopt-numerical-optimization-api-python/evals/evals.json"

echo "---- evals.json (first 60 lines) ----"
cat -n "$EVALS_JSON" | sed -n '1,60p'

echo "---- evals.json (around line 13) ----"
cat -n "$EVALS_JSON" | sed -n '8,18p'

echo "---- model.py files under evals/ (if any) ----"
find "skills/cuopt-numerical-optimization-api-python/evals" -type f -name "model.py" -print

echo "---- FeasibleFound references in evals/ ----"
rg -n "FeasibleFound" "skills/cuopt-numerical-optimization-api-python/evals" -S || true

echo "---- FeasibleFound references anywhere in skill examples/docs (context) ----"
rg -n "FeasibleFound" "skills/cuopt-numerical-optimization-api-python" -g "*.{py,md,json}" -S | head -n 50

Repository: NVIDIA/cuopt

Length of output: 4774


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "---- repo-wide occurrences of FeasibleFound ----"
rg -n "FeasibleFound" -S . | head -n 50

echo "---- repo-wide occurrences of PrimalFeasible ----"
rg -n "PrimalFeasible" -S . | head -n 50

echo "---- search for Status enum/class definitions ----"
rg -n "enum\s+.*Status|class\s+Status|Status\s*=\s*enum|Status.*(Optimal|PrimalFeasible|FeasibleFound)" -S . | head -n 50

echo "---- search for PascalCase status name mapping (common patterns) ----"
rg -n "\"Optimal\"|\"PrimalFeasible\"|\"FeasibleFound\"|\"PrimalInfeasible\"|\"DualInfeasible\"|\"IterationLimit\"|\"NoTermination\"|\"TimeLimit\"" -S . | head -n 50

Repository: NVIDIA/cuopt

Length of output: 25671


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect Python bindings for termination status enums
echo "---- solver_wrapper.pyx: termination status enum mappings ----"
sed -n '1,140p' python/cuopt/cuopt/linear_programming/solver/solver_wrapper.pyx

echo "---- solver.pxd: declarations for termination statuses ----"
sed -n '1,120p' python/cuopt/cuopt/linear_programming/solver/solver.pxd

# Inspect server-side mapping / status lists
echo "---- data_definition.py: status lists (around PrimalFeasible / FeasibleFound) ----"
sed -n '740,880p' python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py

echo "---- solver.py: where MILPTerminationStatus.FeasibleFound is used ----"
sed -n '240,320p' python/cuopt_server/cuopt_server/utils/linear_programming/solver.py

# Show any LPTerminationStatus / pdlp termination status values
echo "---- Search for LPTerminationStatus definition in python/cuopt_server ----"
rg -n "LPTerminationStatus|pdlp_termination_status_t|PDLP.*TerminationStatus" python/cuopt_server/cuopt_server/utils/linear_programming -S || true

echo "---- Search for LPTerminationStatus definition in python/cuopt ----"
rg -n "LPTerminationStatus|pdlp_termination_status_t|PDLP.*TerminationStatus" python/cuopt/cuopt -S || true

Repository: NVIDIA/cuopt

Length of output: 25054


Clarify FeasibleFound status naming for the LP eval.
FeasibleFound exists as a MILP/IP termination status (MILPTerminationStatus) but is not part of the LP/continuous termination statuses (LPTerminationStatus); LP uses Optimal / PrimalFeasible (plus other PDLP statuses). Remove FeasibleFound from the expected problem.Status.name PascalCase values in skills/cuopt-numerical-optimization-api-python/evals/evals.json (line 13), since it also conflicts with the eval’s LP ground_truth check for ['Optimal', 'PrimalFeasible'].

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cuopt-numerical-optimization-api-python/evals/evals.json` at line 13,
The evals.json entry listing PascalCase problem.Status.name values incorrectly
includes "FeasibleFound"; remove "FeasibleFound" from the expected status list
so the LP/continuous check only expects "Optimal" and "PrimalFeasible" (i.e.,
align with LPTerminationStatus rather than MILPTerminationStatus) and ensure any
ground_truth checks that compare problem.Status.name use
['Optimal','PrimalFeasible'] only.

"question": "Help me optimize my routing.",
"expected_skill": "cuopt-user-rules",
"expected_script": null,
"ground_truth": "The prompt is incomplete on every dimension. Per the user-rules skill, the agent must ask before assuming. It asks: (a) Language / interface — Python, C, or REST server? (b) Problem type — TSP, VRP, or PDP? (c) Data — does the user have a cost / distance matrix, order locations, fleet definition, or should the agent generate a small sample dataset for demonstration? (d) Constraints — time windows, vehicle capacities, precedence, service times? (e) Problem size — number of locations, vehicles, orders? (f) Performance — time limit, solution-quality target? It does not produce code, does not silently choose Python+VRP and emit a starter script, and does not invent constraint values. If the user later says 'just create a sample dataset', the agent will state clearly what it synthesized (size, depot assumption, time windows used) before producing code.",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check if CLI is listed as a cuOpt interface in user-rules or related documentation

# Search for interface lists in the user-rules skill documentation
rg -n -C3 -i 'interface.*python.*rest|python.*c.*rest|python.*c.*cli' skills/cuopt-user-rules/

# Check if CLI is mentioned alongside Python/C/REST in any cuopt skills
rg -n -C3 'CLI.*Python.*REST|Python.*REST.*CLI' skills/

Repository: NVIDIA/cuopt

Length of output: 2668


Include CLI in the cuopt-user-rules eval interface questions.
skills/cuopt-user-rules/evals/evals.json (ground_truth/expected_behavior) asks only “Python / C / REST”, but skills/cuopt-user-rules/SKILL.md lists Interface: [Python/REST/C/CLI]; update the eval expectations to include CLI (or explicitly justify why routing omits it).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cuopt-user-rules/evals/evals.json` at line 7, The eval's ground_truth
string in skills/cuopt-user-rules/evals/evals.json currently lists "Python / C /
REST" but SKILL.md's Interface includes "CLI"; update the ground_truth entry to
include "CLI" among the interface choices (e.g., "Python / C / REST / CLI") so
the eval questions match SKILL.md, and ensure any downstream expected_behavior
text that enumerates interfaces also reflects CLI (or, if omitting CLI
intentionally, add a brief justification string in the ground_truth referencing
SKILL.md's interface list).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/ok to test cedb505

@rgsl888prabhu rgsl888prabhu merged commit d415bb6 into main May 27, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants