Instructions to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF", filename="Qwen3-0.6B-Qrazy-Qoder.i1-Q4_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
- Ollama
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with Ollama:
ollama run hf.co/WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
- Unsloth Studio new
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF to start chatting
- Docker Model Runner
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with Docker Model Runner:
docker model run hf.co/WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
- Lemonade
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3-Qrazy.Qoder-0.6B-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3-0.6B-Qrazy-Qoder-i1-GGUF
Qwen3-0.6B-Qrazy-Qoder-i1-GGUF is a compact GGUF release from WithIn Us AI, designed for local inference and lightweight coding-oriented text generation.
This repository packages a 0.6B-parameter Qwen3-family model in GGUF format for efficient use with llama.cpp and compatible local inference runtimes.
Model Summary
This model is intended for:
- lightweight local coding assistance
- code drafting and code completion
- short prompt engineering workflows
- offline experimentation
- compact reasoning-style assistant tasks
- low-resource deployments
Because this is a 0.6B-class model, it is best used for small, fast, practical tasks rather than deep multi-step reasoning or large-scale production code generation.
Repository Contents
This repository currently includes the following GGUF files:
Qwen3-0.6B-Qrazy-Qoder.i1-Q4_K_M.ggufQwen3-0.6B-Qrazy-Qoder.i1-Q5_K_M.ggufQwen3-0.6B-Qrazy-Qoder.i1-Q6_K.gguf
Architecture
The repository metadata identifies the architecture as:
- qwen3
Quantization Variants
Q4_K_M
A smaller quantization for lower memory use and faster inference on limited hardware.
Q5_K_M
A balanced option for users who want a stronger quality-to-size tradeoff.
Q6_K
A heavier quantization with potentially better output quality when memory budget allows.
Intended Use
Recommended use cases include:
- local coding assistant experiments
- offline chatbot or helper tools
- code explanation and refactoring drafts
- compact prompt-response applications
- embedded or low-resource AI workflows
- rapid testing of small coding models
Suggested Use Cases
This model can be useful for:
- generating short utility functions
- explaining simple code snippets
- drafting boilerplate
- rewriting small functions for readability
- proposing debugging ideas
- producing structured text outputs for developer workflows
Out-of-Scope Use
This model should not be relied on for:
- legal advice
- medical advice
- financial advice
- safety-critical automation
- unsupervised production code generation
- security-sensitive engineering without human review
All generated code should be reviewed and tested before deployment.
Performance Expectations
As a compact 0.6B model, this release prioritizes:
- portability
- low memory use
- quick local inference
- simple coding workflows
It may struggle with:
- long-context tasks
- highly complex debugging
- strict factual accuracy
- advanced architectural planning
- deep multi-step reasoning
- large multi-file codebase understanding
Prompting Tips
For best results, use prompts that are:
- specific
- direct
- limited in scope
- explicit about the language
- clear about the desired output format
Example prompt styles
Code generation
Write a Python function that removes duplicate email addresses from a CSV file and saves the cleaned output.
Debugging
Explain why this JavaScript function throws
undefinedand provide a corrected version.
Refactoring
Refactor this Python function to improve readability and add error handling.
Runtime Notes
This model is distributed in GGUF format and is intended for use with runtimes that support GGUF, such as:
- llama.cpp
- compatible local desktop frontends
- supported lightweight inference backends
Choose your quantization based on your hardware:
- use Q4_K_M for smaller RAM usage
- use Q5_K_M for a quality / efficiency balance
- use Q6_K when you want a stronger output-quality tilt and can afford the extra memory
Limitations
Like other small language models, this model may:
- hallucinate APIs or library behavior
- generate incorrect or incomplete code
- lose instruction fidelity on longer prompts
- produce repetitive responses
- make reasoning mistakes
- require prompt iteration to get clean outputs
Human review is strongly recommended.
Creator
WithIn Us AI is the creator of this model release, including the packaging, naming, quantized GGUF distribution, and any fine-tuning / merging process associated with this release.
License
This model card uses:
license: other
You can replace this with your exact WithIn Us AI custom license terms.
If this release is derived from upstream models, merged checkpoints, or third-party datasets, include:
- attribution to the original base model creators
- attribution to any third-party datasets used
- a clear statement that WithIn Us AI claims authorship of the fine-tuning / merging / packaging process, not ownership of third-party source materials unless applicable
Acknowledgments
Thanks to:
- the original Qwen creators
- the GGUF and llama.cpp ecosystem
- Hugging Face hosting infrastructure
- the broader open-source AI community
Disclaimer
This model may produce inaccurate, biased, insecure, or incomplete outputs.
Use responsibly, and verify all important results before real-world use.
- Downloads last month
- 134
4-bit
5-bit
6-bit