Instructions to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF",
	filename="Qwen3-0.6B-Qrazy-Qoder.i1-Q4_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M

Ollama
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with Ollama:
```
ollama run hf.co/WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF to start chatting

Docker Model Runner
How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with Docker Model Runner:
```
docker model run hf.co/WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M
```

Lemonade

How to use WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull WithinUsAI/Qwen3-Qrazy.Qoder-0.6B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-Qrazy.Qoder-0.6B-GGUF-Q4_K_M

List all available models

lemonade list

Qwen3-0.6B-Qrazy-Qoder-i1-GGUF

Qwen3-0.6B-Qrazy-Qoder-i1-GGUF is a compact GGUF release from WithIn Us AI, designed for local inference and lightweight coding-oriented text generation.

This repository packages a 0.6B-parameter Qwen3-family model in GGUF format for efficient use with llama.cpp and compatible local inference runtimes.

Model Summary

This model is intended for:

lightweight local coding assistance
code drafting and code completion
short prompt engineering workflows
offline experimentation
compact reasoning-style assistant tasks
low-resource deployments

Because this is a 0.6B-class model, it is best used for small, fast, practical tasks rather than deep multi-step reasoning or large-scale production code generation.

Repository Contents

This repository currently includes the following GGUF files:

Qwen3-0.6B-Qrazy-Qoder.i1-Q4_K_M.gguf
Qwen3-0.6B-Qrazy-Qoder.i1-Q5_K_M.gguf
Qwen3-0.6B-Qrazy-Qoder.i1-Q6_K.gguf

Architecture

The repository metadata identifies the architecture as:

qwen3

Quantization Variants

Q4_K_M

A smaller quantization for lower memory use and faster inference on limited hardware.

Q5_K_M

A balanced option for users who want a stronger quality-to-size tradeoff.

Q6_K

A heavier quantization with potentially better output quality when memory budget allows.

Intended Use

Recommended use cases include:

local coding assistant experiments
offline chatbot or helper tools
code explanation and refactoring drafts
compact prompt-response applications
embedded or low-resource AI workflows
rapid testing of small coding models

Suggested Use Cases

This model can be useful for:

generating short utility functions
explaining simple code snippets
drafting boilerplate
rewriting small functions for readability
proposing debugging ideas
producing structured text outputs for developer workflows

Out-of-Scope Use

This model should not be relied on for:

legal advice
medical advice
financial advice
safety-critical automation
unsupervised production code generation
security-sensitive engineering without human review

All generated code should be reviewed and tested before deployment.

Performance Expectations

As a compact 0.6B model, this release prioritizes:

portability
low memory use
quick local inference
simple coding workflows

It may struggle with:

long-context tasks
highly complex debugging
strict factual accuracy
advanced architectural planning
deep multi-step reasoning
large multi-file codebase understanding

Prompting Tips

For best results, use prompts that are:

specific
direct
limited in scope
explicit about the language
clear about the desired output format

Example prompt styles

Code generation

Write a Python function that removes duplicate email addresses from a CSV file and saves the cleaned output.

Debugging

Explain why this JavaScript function throws undefined and provide a corrected version.

Refactoring

Refactor this Python function to improve readability and add error handling.

Runtime Notes

This model is distributed in GGUF format and is intended for use with runtimes that support GGUF, such as:

llama.cpp
compatible local desktop frontends
supported lightweight inference backends

Choose your quantization based on your hardware:

use Q4_K_M for smaller RAM usage
use Q5_K_M for a quality / efficiency balance
use Q6_K when you want a stronger output-quality tilt and can afford the extra memory

Limitations

Like other small language models, this model may:

hallucinate APIs or library behavior
generate incorrect or incomplete code
lose instruction fidelity on longer prompts
produce repetitive responses
make reasoning mistakes
require prompt iteration to get clean outputs

Human review is strongly recommended.

Creator

WithIn Us AI is the creator of this model release, including the packaging, naming, quantized GGUF distribution, and any fine-tuning / merging process associated with this release.

License

This model card uses:

license: other

You can replace this with your exact WithIn Us AI custom license terms.

If this release is derived from upstream models, merged checkpoints, or third-party datasets, include:

attribution to the original base model creators
attribution to any third-party datasets used
a clear statement that WithIn Us AI claims authorship of the fine-tuning / merging / packaging process, not ownership of third-party source materials unless applicable