Instructions to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF", filename="GPT5.1-high-reasoning-codex-0.4B.Q4_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
- Ollama
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with Ollama:
ollama run hf.co/WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
- Unsloth Studio new
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF to start chatting
- Docker Model Runner
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with Docker Model Runner:
docker model run hf.co/WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
- Lemonade
How to use WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull WithinUsAI/GPT5.1-High.Reasoning.Codex-0.4B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.GPT5.1-High.Reasoning.Codex-0.4B-GGUF-Q4_K_M
List all available models
lemonade list
GPT5.1-high-reasoning-codex-0.4B-GGUF
GPT5.1-high-reasoning-codex-0.4B-GGUF is a compact GGUF language model release from WithIn Us AI, intended for local inference and lightweight coding or reasoning-oriented experiments.
This repository provides quantized GGUF builds for efficient use with llama.cpp and compatible runtimes.
Model Summary
This model is designed for:
- lightweight local inference
- coding and prompt-based development assistance
- compact reasoning-style experiments
- offline chat and text generation workflows
- small-footprint deployments
Because this is a 0.4B parameter class model, it is best suited for fast iteration, simple coding tasks, prompt experiments, structured text generation, and lightweight assistant workflows rather than heavy long-context reasoning or complex production-grade coding autonomy.
Repository Contents
This repository currently includes the following files:
GPT5.1-high-reasoning-codex-0.4B.Q4_K_M.ggufGPT5.1-high-reasoning-codex-0.4B.Q5_K_M.ggufGPT5.1-high-reasoning-codex-0.4B.f16.gguf
Quantization Variants
Q4_K_M
A smaller and more memory-efficient quantization for lower RAM usage and faster local inference.
Q5_K_M
A slightly larger quantization that may provide somewhat better output quality while remaining efficient.
F16
A higher-precision GGUF variant intended for users who want the least quantization loss and have more memory available.
Architecture
The repository metadata currently identifies the architecture as:
- gpt2
Intended Use
Recommended use cases include:
- local coding assistant experiments
- toy and lightweight software-help workflows
- code completion and code drafting
- debugging ideas and implementation suggestions
- instruction-following tests
- prompt engineering experiments
- low-resource local deployments
Out-of-Scope Use
This model should not be relied on for:
- legal advice
- medical advice
- financial advice
- safety-critical automation
- production code generation without review
- security-sensitive decisions without human verification
All generated code should be reviewed, tested, and validated before use.
Performance Expectations
As a compact 0.4B model, this release trades raw capability for speed, portability, and lower hardware requirements. It may perform well for:
- short code snippets
- compact prompts
- structured assistant replies
- lightweight reasoning-style tasks
It may struggle with:
- long and complex codebases
- deep multi-step reasoning
- strict factual reliability
- advanced tool orchestration
- heavy instruction retention over long prompts
Prompting Tips
For best results, use prompts that are:
- specific
- short to medium length
- explicit about the desired language or format
- clear about constraints
- direct about whether you want code, explanation, or both
Example prompts
Code generation
Write a Python function that reads a JSON file, validates required fields, and returns a cleaned list of records.
Refactoring
Refactor this JavaScript function to be more readable and add basic error handling.
Debugging
Explain why this Python code raises a KeyError and show a corrected version.
Hardware and Runtime Notes
This model is packaged in GGUF format, which is suitable for llama.cpp-style local inference stacks and related frontends / runtimes that support GGUF models.
Typical choices:
- use Q4_K_M for smaller memory usage
- use Q5_K_M for a quality / size balance
- use F16 when memory allows and you want higher precision
Limitations
Like other small language models, this model may:
- hallucinate APIs, functions, or package behavior
- generate incorrect code
- produce insecure code patterns
- make reasoning mistakes
- lose instruction fidelity on longer prompts
- require prompt retries for acceptable output quality
Human oversight is strongly recommended.
Training / Lineage
This repository is presented as a WithIn Us AI model release and GGUF packaging distribution.
If you want, this section can be expanded later with:
- base model lineage
- fine-tuning details
- merge methodology
- dataset attribution
- training objective
- chat template recommendations
License
This repository currently uses a custom / non-standard license field approach in this model card draft:
license: other
You can replace this section with your exact WithIn Us AI custom license terms. If this model is derived from upstream weights or datasets, include:
- attribution to the original base model creators
- attribution to any third-party datasets used
- clear statement that WithIn Us AI claims authorship of the fine-tuning / merging / packaging process, not ownership of third-party source materials unless applicable
Acknowledgments
Thanks to:
- the open-source local inference ecosystem
- GGUF and llama.cpp tooling contributors
- the broader Hugging Face community
- all upstream creators whose work may have contributed to the model’s lineage
Disclaimer
This model may produce inaccurate, biased, insecure, or incomplete outputs.
Use responsibly, and verify important results before real-world use.
- Downloads last month
- 319
4-bit
5-bit
16-bit