Instructions to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF",
	filename="GPT5.1-high-reasoning-codex-0.4B.Q4_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M

Ollama
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with Ollama:
```
ollama run hf.co/WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF to start chatting

Docker Model Runner
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with Docker Model Runner:
```
docker model run hf.co/WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
```

Lemonade

How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.GPT5.1-High.Reasoning.Codex-0.4B-GGUF-Q4_K_M

List all available models

lemonade list

GPT5.1-high-reasoning-codex-0.4B-GGUF

GPT5.1-high-reasoning-codex-0.4B-GGUF is a compact GGUF language model release from WithIn Us AI, intended for local inference and lightweight coding or reasoning-oriented experiments.

This repository provides quantized GGUF builds for efficient use with llama.cpp and compatible runtimes.

Model Summary

This model is designed for:

lightweight local inference
coding and prompt-based development assistance
compact reasoning-style experiments
offline chat and text generation workflows
small-footprint deployments

Because this is a 0.4B parameter class model, it is best suited for fast iteration, simple coding tasks, prompt experiments, structured text generation, and lightweight assistant workflows rather than heavy long-context reasoning or complex production-grade coding autonomy.

Repository Contents

This repository currently includes the following files:

GPT5.1-high-reasoning-codex-0.4B.Q4_K_M.gguf
GPT5.1-high-reasoning-codex-0.4B.Q5_K_M.gguf
GPT5.1-high-reasoning-codex-0.4B.f16.gguf

Quantization Variants

Q4_K_M

A smaller and more memory-efficient quantization for lower RAM usage and faster local inference.

Q5_K_M

A slightly larger quantization that may provide somewhat better output quality while remaining efficient.

F16

A higher-precision GGUF variant intended for users who want the least quantization loss and have more memory available.

Architecture

The repository metadata currently identifies the architecture as:

gpt2

Intended Use

Recommended use cases include:

local coding assistant experiments
toy and lightweight software-help workflows
code completion and code drafting
debugging ideas and implementation suggestions
instruction-following tests
prompt engineering experiments
low-resource local deployments

Out-of-Scope Use

This model should not be relied on for:

legal advice
medical advice
financial advice
safety-critical automation
production code generation without review
security-sensitive decisions without human verification

All generated code should be reviewed, tested, and validated before use.

Performance Expectations

As a compact 0.4B model, this release trades raw capability for speed, portability, and lower hardware requirements. It may perform well for:

short code snippets
compact prompts
structured assistant replies
lightweight reasoning-style tasks

It may struggle with:

long and complex codebases
deep multi-step reasoning
strict factual reliability
advanced tool orchestration
heavy instruction retention over long prompts

Prompting Tips

For best results, use prompts that are:

specific
short to medium length
explicit about the desired language or format
clear about constraints
direct about whether you want code, explanation, or both

Example prompts

Code generation

Write a Python function that reads a JSON file, validates required fields, and returns a cleaned list of records.

Refactoring

Refactor this JavaScript function to be more readable and add basic error handling.

Debugging

Explain why this Python code raises a KeyError and show a corrected version.

Hardware and Runtime Notes

This model is packaged in GGUF format, which is suitable for llama.cpp-style local inference stacks and related frontends / runtimes that support GGUF models.

Typical choices:

use Q4_K_M for smaller memory usage
use Q5_K_M for a quality / size balance
use F16 when memory allows and you want higher precision

Limitations

Like other small language models, this model may:

hallucinate APIs, functions, or package behavior
generate incorrect code
produce insecure code patterns
make reasoning mistakes
lose instruction fidelity on longer prompts
require prompt retries for acceptable output quality

Human oversight is strongly recommended.

Training / Lineage

This repository is presented as a WithIn Us AI model release and GGUF packaging distribution.

If you want, this section can be expanded later with:

base model lineage
fine-tuning details
merge methodology
dataset attribution
training objective
chat template recommendations

License

This repository currently uses a custom / non-standard license field approach in this model card draft:

license: other

You can replace this section with your exact WithIn Us AI custom license terms. If this model is derived from upstream weights or datasets, include:

attribution to the original base model creators
attribution to any third-party datasets used
clear statement that WithIn Us AI claims authorship of the fine-tuning / merging / packaging process, not ownership of third-party source materials unless applicable