Spaces:
Running
ML Debug Env β Teaching AI to Debug PyTorch Like an Engineer
Meta Γ PyTorch Γ Scaler OpenEnv Hackathon β April 2026
π₯ Watch the Demo Video | π» GitHub | π€ HF Space | π Training Notebook
The Problem
Every ML engineer knows this moment. Your training job fails. You open the terminal and see:
Training job crashed. No epochs completed. Exit code 1.
No code. No traceback. Just that one line.
Current AI debugging benchmarks hand the agent the full broken script and say "fix this." That is not how debugging works in the real world. Real engineers have to investigate β run the code, read the output, check the gradients, form a hypothesis, and only then commit to a fix.
We asked: what if we trained an AI to debug the same way?
What We Built
ML Debug Env is a reinforcement learning environment where an AI agent learns to debug broken PyTorch training scripts β using diagnostic tools, not oracles.
The agent starts every episode completely blind. It receives one line. It has five tools and five steps to figure out what went wrong and fix it.
On reset, the agent sees only this:
{
"alert": "Training job failed. Final loss: nan.",
"available_tools": ["run_code", "get_traceback", "inspect_gradients", "print_shapes", "view_source"],
"step_budget": 5
}
No source code. No traceback. No hints.
This is called partial observability β the agent cannot see the full picture, just like a real engineer getting paged at 3am with no context. It has to decide: do I run the code first? Check the gradients? Every tool call costs a step. Fix it in two steps and earn a 1.2Γ efficiency bonus.
The Five Tools
The agent investigates using five diagnostic tools:
- run_code β runs the buggy script and returns the output or crash
- get_traceback β returns the full Python error traceback
- inspect_gradients β injects gradient logging and returns per-layer gradient norms after one batch
- print_shapes β injects forward hooks and returns tensor shapes at each layer
- view_source β reveals the full buggy source code
The clever part: inspect_gradients and print_shapes inject diagnostic code into the script before running it β deep introspection without revealing the source. The agent decides which tools to call and in what order. That strategy is what gets learned.
The 8 Tasks
Eight broken PyTorch scripts, easy to expert:
| Task | Difficulty | What's broken |
|---|---|---|
shape_mismatch |
π’ Easy | Wrong nn.Linear dimensions β explicit crash |
training_collapse |
π‘ Medium | Bad learning rate β NaN loss |
wrong_device |
π‘ Medium | Model on GPU, data on CPU β explicit crash |
gradient_not_zeroed |
π Medium-Hard | Missing zero_grad() β loss explodes silently |
data_leakage |
π΄ Hard | Normalized before split β inflated metrics, no crash |
missing_eval_mode |
π΄ Hard | No model.eval() β non-deterministic metrics |
compound_shape_device |
π Medium-Hard | Two bugs: shape + device |
compound_leakage_eval |
π£ Expert | Two bugs: data leakage + missing eval mode |
The compound tasks are the hardest β two completely silent bugs in one script. Fix one and miss the other: 0.60. Fix both: 0.99.
How Scoring Works
Scoring is a staircase, not binary:
0.01 β Wrong bug type
0.20 β Right type, fixed code crashes
0.40 β Code runs, training incomplete
0.60 β Training completes, root cause not fixed
0.80 β Root cause fixed, success signal missing
0.99 β Perfect fix β
The grader actually runs the fixed code in a subprocess. No pattern matching. No string similarity. The code has to work.
The Training Story
Run 1 β The Exploit
We trained Qwen2.5-1.5B with GRPO on T4 for 200 steps. Reward went down.
We investigated. Our grader had a bug β it was giving partial credit to fundamentally wrong fixes. The agent found the exploit before we did. It learned to game the reward function instead of actually debugging.
The agent was right. Our environment was wrong.
Run 1: reward trending down as agent exploits broken grader
Run 2 β The Breakout
We fixed the grader. The agent had no shortcut anymore β it had to actually learn to debug.
Result: 0.024 β 0.190. 690% improvement in 200 steps on a free T4 GPU.
Run 2: 0.024 β 0.190, +690% improvement after grader fix
Run 3 β The Self-Improvement Loop
We fixed the training loop β added short-output filtering so the model couldn't game reward by outputting garbage, and implemented proper GRPO over all completions instead of just the best one.
The baseline reward at step zero lifted from 2.4% to 15.2%. The floor raised β proof the grader fixes held across runs.
Run 3: baseline lifted from 2.4% to 15.2% β training loop fixed
Every run taught us something. The environment improved itself because of what training revealed.
Before vs After Training
| Average Reward | |
|---|---|
| Untrained baseline (partial obs, blind start) | 0.024 |
| After GRPO training (200 steps, T4) | 0.190 |
| Improvement | +690% |
What the Agent Learned
These behaviors emerged from reward signal alone β never explicitly programmed:
- Investigate before fixing β learned to call
run_codeorinspect_gradientsbefore attempting a fix - Tool selection by bug type β gradient issues β
inspect_gradientsfirst. Crashes βget_tracebackfirst - Avoid
view_sourceβ learned that the traceback alone is usually enough and reading full source wastes a step - Efficiency matters β learned to fix in fewer steps to maximize the 1.2Γ efficiency bonus
The Engine Under the Hood
Adversarial Scheduler β tracks which bug types the agent struggles with and serves those 70% of the time with novel random seeds. The environment gets harder as the agent improves.
LLM Judge β after every fix, a Groq LLM scores the agent's diagnosis quality β root cause correctness and mechanistic explanation β adding up to 0.15 reasoning reward.
Subprocess Grader β fixed code is written to a temp file and executed with a 40-second timeout. AST checks verify structure. Output parsed for success signals. No shortcuts.
Try It
import requests
session = requests.Session()
BASE = "https://rak2315-ml-debug-env.hf.space"
obs = session.post(f"{BASE}/reset", json={"task_id": "shape_mismatch"}).json()["observation"]
print(obs["alert"]) # "Training job crashed immediately..."
result = session.post(f"{BASE}/step", json={"action": {
"action_type": "inspect", "tool_name": "run_code"
}}).json()
result = session.post(f"{BASE}/step", json={"action": {
"action_type": "fix",
"bug_type": "shape_mismatch",
"diagnosis": "nn.Linear input dim wrong",
"fixed_code": "... complete fixed script ..."
}}).json()
print(result["observation"]["grader_score"]) # 0.99
Or visit rak2315-ml-debug-env.hf.space/ui to try it interactively.
Links
- π€ HF Space: rak2315-ml-debug-env.hf.space
- π» GitHub: github.com/RAK2315/ml-debug-env
- π Training Notebook: ml_debug_env_grpo_fixed.ipynb
- π₯ YouTube: youtu.be/TjEavKODTQQ
Meta Γ PyTorch Γ Scaler OpenEnv Hackathon β April 2026