skills: add evals/evals.json smoke suite (group A)#1308
Conversation
Adds one happy-path Q&A entry per skill, parallel to the existing benchmark/ directories. Split from #1302 to keep CI runtime per PR under the job timeout. This PR covers: - cuopt-user-rules - cuopt-routing-api-python - cuopt-numerical-optimization-api-python - cuopt-numerical-optimization-api-cli Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
|
/nvskills-ci |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (8)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds four evaluation cases (CLI MPS, Python LP API, Python routing VRPTW, user-rules clarification), updates several skill-card frontmatter texts, and regenerates Sigstore DSSE signature bundles for the affected skills. ChangescuOpt Skill Evaluation Specifications and Metadata
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@skills/cuopt-numerical-optimization-api-python/evals/evals.json`:
- Line 13: The evals.json entry listing PascalCase problem.Status.name values
incorrectly includes "FeasibleFound"; remove "FeasibleFound" from the expected
status list so the LP/continuous check only expects "Optimal" and
"PrimalFeasible" (i.e., align with LPTerminationStatus rather than
MILPTerminationStatus) and ensure any ground_truth checks that compare
problem.Status.name use ['Optimal','PrimalFeasible'] only.
In `@skills/cuopt-user-rules/evals/evals.json`:
- Line 7: The eval's ground_truth string in
skills/cuopt-user-rules/evals/evals.json currently lists "Python / C / REST" but
SKILL.md's Interface includes "CLI"; update the ground_truth entry to include
"CLI" among the interface choices (e.g., "Python / C / REST / CLI") so the eval
questions match SKILL.md, and ensure any downstream expected_behavior text that
enumerates interfaces also reflects CLI (or, if omitting CLI intentionally, add
a brief justification string in the ground_truth referencing SKILL.md's
interface list).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5374cae1-b836-4352-98a8-d33b6c9db304
📒 Files selected for processing (4)
skills/cuopt-numerical-optimization-api-cli/evals/evals.jsonskills/cuopt-numerical-optimization-api-python/evals/evals.jsonskills/cuopt-routing-api-python/evals/evals.jsonskills/cuopt-user-rules/evals/evals.json
| "Lists the API calls in order without producing a full runnable script", | ||
| "Names Problem, addVariable (with vtype=CONTINUOUS), addConstraint, setObjective (sense=MAXIMIZE)", | ||
| "Names SolverSettings, set_parameter('time_limit', ...), and problem.solve(settings)", | ||
| "Names problem.Status.name and the PascalCase status values (Optimal / PrimalFeasible / FeasibleFound)", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Search for Status enum/class definition and all valid status values.
# Search for Status class or enum definition
ast-grep --pattern 'class Status'
# Search for status value references in examples and documentation
rg -n 'Status\.name|Status\.' skills/cuopt-numerical-optimization-api-python/ -g '*.{py,md}' -C2
# Search specifically for FeasibleFound
rg -n 'FeasibleFound|Feasible' skills/cuopt-numerical-optimization-api-python/ -g '*.{py,md}'Repository: NVIDIA/cuopt
Length of output: 17947
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show evals.json around the referenced line
echo "---- evals.json (lines 1-40) ----"
nl -ba skills/cuopt-numerical-optimization-api-python/evals/evals.json | sed -n '1,60p'
# Locate likely model.py referenced by the evals and inspect around line 29
echo "---- candidate model.py files under evals/ ----"
find skills/cuopt-numerical-optimization-api-python -maxdepth 3 -type f -name 'model.py' -path '*evals*' -print
echo "---- show any model.py under evals/ (best-effort around line 20-45) ----"
for f in $(find skills/cuopt-numerical-optimization-api-python -maxdepth 3 -type f -name 'model.py' -path '*evals*' -print); do
echo "==== $f ===="
nl -ba "$f" | sed -n '15,60p'
done
# Also search for FeasibleFound usage in evals.json region if present
echo "---- FeasibleFound references in evals/ ----"
rg -n 'FeasibleFound' skills/cuopt-numerical-optimization-api-python/evals -S || trueRepository: NVIDIA/cuopt
Length of output: 151
🏁 Script executed:
#!/bin/bash
set -euo pipefail
EVALS_JSON="skills/cuopt-numerical-optimization-api-python/evals/evals.json"
echo "---- evals.json (first 60 lines) ----"
cat -n "$EVALS_JSON" | sed -n '1,60p'
echo "---- evals.json (around line 13) ----"
cat -n "$EVALS_JSON" | sed -n '8,18p'
echo "---- model.py files under evals/ (if any) ----"
find "skills/cuopt-numerical-optimization-api-python/evals" -type f -name "model.py" -print
echo "---- FeasibleFound references in evals/ ----"
rg -n "FeasibleFound" "skills/cuopt-numerical-optimization-api-python/evals" -S || true
echo "---- FeasibleFound references anywhere in skill examples/docs (context) ----"
rg -n "FeasibleFound" "skills/cuopt-numerical-optimization-api-python" -g "*.{py,md,json}" -S | head -n 50Repository: NVIDIA/cuopt
Length of output: 4774
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- repo-wide occurrences of FeasibleFound ----"
rg -n "FeasibleFound" -S . | head -n 50
echo "---- repo-wide occurrences of PrimalFeasible ----"
rg -n "PrimalFeasible" -S . | head -n 50
echo "---- search for Status enum/class definitions ----"
rg -n "enum\s+.*Status|class\s+Status|Status\s*=\s*enum|Status.*(Optimal|PrimalFeasible|FeasibleFound)" -S . | head -n 50
echo "---- search for PascalCase status name mapping (common patterns) ----"
rg -n "\"Optimal\"|\"PrimalFeasible\"|\"FeasibleFound\"|\"PrimalInfeasible\"|\"DualInfeasible\"|\"IterationLimit\"|\"NoTermination\"|\"TimeLimit\"" -S . | head -n 50Repository: NVIDIA/cuopt
Length of output: 25671
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect Python bindings for termination status enums
echo "---- solver_wrapper.pyx: termination status enum mappings ----"
sed -n '1,140p' python/cuopt/cuopt/linear_programming/solver/solver_wrapper.pyx
echo "---- solver.pxd: declarations for termination statuses ----"
sed -n '1,120p' python/cuopt/cuopt/linear_programming/solver/solver.pxd
# Inspect server-side mapping / status lists
echo "---- data_definition.py: status lists (around PrimalFeasible / FeasibleFound) ----"
sed -n '740,880p' python/cuopt_server/cuopt_server/utils/linear_programming/data_definition.py
echo "---- solver.py: where MILPTerminationStatus.FeasibleFound is used ----"
sed -n '240,320p' python/cuopt_server/cuopt_server/utils/linear_programming/solver.py
# Show any LPTerminationStatus / pdlp termination status values
echo "---- Search for LPTerminationStatus definition in python/cuopt_server ----"
rg -n "LPTerminationStatus|pdlp_termination_status_t|PDLP.*TerminationStatus" python/cuopt_server/cuopt_server/utils/linear_programming -S || true
echo "---- Search for LPTerminationStatus definition in python/cuopt ----"
rg -n "LPTerminationStatus|pdlp_termination_status_t|PDLP.*TerminationStatus" python/cuopt/cuopt -S || trueRepository: NVIDIA/cuopt
Length of output: 25054
Clarify FeasibleFound status naming for the LP eval.
FeasibleFound exists as a MILP/IP termination status (MILPTerminationStatus) but is not part of the LP/continuous termination statuses (LPTerminationStatus); LP uses Optimal / PrimalFeasible (plus other PDLP statuses). Remove FeasibleFound from the expected problem.Status.name PascalCase values in skills/cuopt-numerical-optimization-api-python/evals/evals.json (line 13), since it also conflicts with the eval’s LP ground_truth check for ['Optimal', 'PrimalFeasible'].
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cuopt-numerical-optimization-api-python/evals/evals.json` at line 13,
The evals.json entry listing PascalCase problem.Status.name values incorrectly
includes "FeasibleFound"; remove "FeasibleFound" from the expected status list
so the LP/continuous check only expects "Optimal" and "PrimalFeasible" (i.e.,
align with LPTerminationStatus rather than MILPTerminationStatus) and ensure any
ground_truth checks that compare problem.Status.name use
['Optimal','PrimalFeasible'] only.
| "question": "Help me optimize my routing.", | ||
| "expected_skill": "cuopt-user-rules", | ||
| "expected_script": null, | ||
| "ground_truth": "The prompt is incomplete on every dimension. Per the user-rules skill, the agent must ask before assuming. It asks: (a) Language / interface — Python, C, or REST server? (b) Problem type — TSP, VRP, or PDP? (c) Data — does the user have a cost / distance matrix, order locations, fleet definition, or should the agent generate a small sample dataset for demonstration? (d) Constraints — time windows, vehicle capacities, precedence, service times? (e) Problem size — number of locations, vehicles, orders? (f) Performance — time limit, solution-quality target? It does not produce code, does not silently choose Python+VRP and emit a starter script, and does not invent constraint values. If the user later says 'just create a sample dataset', the agent will state clearly what it synthesized (size, depot assumption, time windows used) before producing code.", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check if CLI is listed as a cuOpt interface in user-rules or related documentation
# Search for interface lists in the user-rules skill documentation
rg -n -C3 -i 'interface.*python.*rest|python.*c.*rest|python.*c.*cli' skills/cuopt-user-rules/
# Check if CLI is mentioned alongside Python/C/REST in any cuopt skills
rg -n -C3 'CLI.*Python.*REST|Python.*REST.*CLI' skills/Repository: NVIDIA/cuopt
Length of output: 2668
Include CLI in the cuopt-user-rules eval interface questions.
skills/cuopt-user-rules/evals/evals.json (ground_truth/expected_behavior) asks only “Python / C / REST”, but skills/cuopt-user-rules/SKILL.md lists Interface: [Python/REST/C/CLI]; update the eval expectations to include CLI (or explicitly justify why routing omits it).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@skills/cuopt-user-rules/evals/evals.json` at line 7, The eval's ground_truth
string in skills/cuopt-user-rules/evals/evals.json currently lists "Python / C /
REST" but SKILL.md's Interface includes "CLI"; update the ground_truth entry to
include "CLI" among the interface choices (e.g., "Python / C / REST / CLI") so
the eval questions match SKILL.md, and ensure any downstream expected_behavior
text that enumerates interfaces also reflects CLI (or, if omitting CLI
intentionally, add a brief justification string in the ground_truth referencing
SKILL.md's interface list).
|
/ok to test cedb505 |
Summary
evals/evals.jsonfor 4 skills (group A).benchmark/directory.