Instructions to use UCSB-SURFI/TermiGen-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use UCSB-SURFI/TermiGen-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="UCSB-SURFI/TermiGen-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("UCSB-SURFI/TermiGen-32B")
model = AutoModelForCausalLM.from_pretrained("UCSB-SURFI/TermiGen-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use UCSB-SURFI/TermiGen-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "UCSB-SURFI/TermiGen-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UCSB-SURFI/TermiGen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/UCSB-SURFI/TermiGen-32B

SGLang

How to use UCSB-SURFI/TermiGen-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "UCSB-SURFI/TermiGen-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UCSB-SURFI/TermiGen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "UCSB-SURFI/TermiGen-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UCSB-SURFI/TermiGen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use UCSB-SURFI/TermiGen-32B with Docker Model Runner:
```
docker model run hf.co/UCSB-SURFI/TermiGen-32B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

TermiGen-32B

TermiGen-32B achieves 31.3% pass@1 on TerminalBench 1.0, establishing a new open-weight state-of-the-art and surpassing proprietary models like o4-mini with Codex CLI (20.0%).

📄 Paper: TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents
💻 Environments: https://github.com/ucsb-mlsec/terminal-bench-env
🧪 Benchmark: https://github.com/laude-institute/terminal-bench

Model Description

This model is fine-tuned from Qwen2.5-Coder-32B-Instruct using the TermiGen pipeline, which synthesizes high-fidelity training data through two phases:

Phase I: Environment Synthesis

Multi-agent system generates 3,500+ verified Docker environments
Tasks span 11 categories: system administration, security forensics, scientific computing, MLOps, etc.
420 unique command-line tools across 16 functional domains
Automated unit test validation ensures task solvability

Phase II: Error-Correction Trajectory Collection

Generator-Critic framework with 20% error injection rate
Teaches error → diagnosis → recovery cycles
3,291 trajectories (avg. 25.5 turns, 8,722 tokens each)
Teacher model: Claude-4.5-Sonnet

Training Details

Training Hyperparameters:

Base Model: Qwen2.5-Coder-32B-Instruct
Learning Rate: 5e-6 (cosine schedule, 10% warmup)
Batch Size: 32 (8 GPUs × 4 gradient accumulation)
Sequence Length: 20,000 tokens
Epochs: 5
Precision: BF16 with DeepSpeed ZeRO-3
Hardware: 8× AMD MI325X GPUs

Dataset Statistics:

3,500+ verified environments across 11 task categories
3,291 training trajectories
Tool diversity: 420 unique CLI tools
Average trajectory: 25.5 turns, 8,722 tokens

Evaluation Results

TerminalBench Performance

Benchmark	Pass@1
TerminalBench 1.0	31.3%
TerminalBench 2.0	18.0%

Usage

We implemented a minimal BashAgent framework based on TerminalBench for agentic terminal execution. The agent interacts with Docker containers via bash shell, generating ReAct-style responses at each turn.

For detailed usage and integration examples, please refer to our GitHub repository.

Citation

@article{zhu2026termigen,
  title={TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents},
  author={Zhu, Kaijie and Nie, Yuzhou and Li, Yijiang and Huang, Yiming and Wu, Jialian and Liu, Jiang and Sun, Ximeng and Yin, Zhenfei and Wang, Lun and Liu, Zicheng and Barsoum, Emad and Wang, William Yang and Guo, Wenbo},
  journal={arXiv preprint arXiv:2602.07274},
  url={https://arxiv.org/abs/2602.07274}, 
  year={2026}
}