Text Generation
Transformers
PyTorch
English
custom-architecture
rope
rmsnorm
swiglu
flash-attention
16k-context
Eval Results (legacy)
Instructions to use Austin207/Map-NEO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Austin207/Map-NEO with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Austin207/Map-NEO")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Austin207/Map-NEO", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Austin207/Map-NEO with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Austin207/Map-NEO" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Austin207/Map-NEO", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Austin207/Map-NEO
- SGLang
How to use Austin207/Map-NEO with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Austin207/Map-NEO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Austin207/Map-NEO", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Austin207/Map-NEO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Austin207/Map-NEO", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Austin207/Map-NEO with Docker Model Runner:
docker model run hf.co/Austin207/Map-NEO
| language: | |
| - en | |
| license: mit | |
| library_name: transformers | |
| tags: | |
| - text-generation | |
| - pytorch | |
| - custom-architecture | |
| - rope | |
| - rmsnorm | |
| - swiglu | |
| - flash-attention | |
| - 16k-context | |
| pipeline_tag: text-generation | |
| widget: | |
| - text: "The future of artificial intelligence is" | |
| example_title: "AI Future" | |
| - text: "Write a short story about" | |
| example_title: "Story Generation" | |
| - text: "Explain quantum computing in simple terms:" | |
| example_title: "Technical Explanation" | |
| datasets: | |
| - tiiuae/falcon-refinedweb | |
| metrics: | |
| - perplexity | |
| model-index: | |
| - name: MAP-NEO Mini | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Text Generation | |
| dataset: | |
| name: RefinedWeb (100K subset) | |
| type: tiiuae/falcon-refinedweb | |
| metrics: | |
| - type: perplexity | |
| value: 3.9 | |
| name: Final Training Loss | |
| # MAP-NEO Mini | |
| ## Model Description | |
| **MAP-NEO Mini** is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices. | |
| - **Developed by**: Antony Austin | |
| - **Model type**: Autoregressive Language Model | |
| - **Language(s)**: English | |
| - **License**: MIT | |
| - **Architecture**: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention | |
| ## Key Features | |
| - **Efficient Training**: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours | |
| - **Extended Context**: 16,384 token context window (16x typical small models) | |
| - **Memory Efficient**: Only 1.3GB VRAM for 1,800 tokens inference | |
| - **Fast Inference**: ~150+ tokens/second on consumer GPU | |
| - **High Quality Data**: Trained on curated RefinedWeb subset | |
| ## Architecture Details | |
| ### Model Architecture | |
| - **Parameters**: 253,085,696 (253M) | |
| - **Layers**: 16 transformer blocks | |
| - **Hidden Size**: 1,024 | |
| - **Attention Heads**: 16 | |
| - **Head Dimension**: 64 | |
| - **FFN Hidden Size**: 2,736 (2.67x hidden size) | |
| - **Vocabulary Size**: 50,257 (GPT-2 tokenizer) | |
| - **Max Sequence Length**: 16,384 tokens | |
| ### Architectural Innovations | |
| - **RMSNorm**: Root Mean Square Layer Normalization for training stability | |
| - **RoPE**: Rotary Positional Embeddings for better positional understanding | |
| - **SwiGLU**: Swish-Gated Linear Units for improved FFN performance | |
| - **Flash Attention**: Memory-efficient attention computation | |
| - **Weight Tying**: Input/output embeddings shared for parameter efficiency | |
| ## Training Data | |
| ### Dataset | |
| - **Source**: `tiiuae/falcon-refinedweb` (curated subset) | |
| - **Size**: 100,000 high-quality web documents | |
| - **Tokens**: ~41 million tokens | |
| - **Sequence Length**: 1,024 tokens per sequence | |
| - **Sequences**: 40,965 packed sequences | |
| ### Data Quality | |
| - Length filtering: 200-10,000 characters | |
| - Language detection: English only | |
| - Quality scoring: High-quality web content | |
| - Deduplication: Exact and near-duplicate removal | |
| ## Training Procedure | |
| ### Training Configuration | |
| - **Hardware**: NVIDIA RTX 5070 Laptop GPU (8GB VRAM) | |
| - **Precision**: bfloat16 mixed precision | |
| - **Batch Size**: 1 per device | |
| - **Gradient Accumulation**: 32 steps | |
| - **Effective Batch Size**: 32 | |
| - **Learning Rate**: 3e-4 | |
| - **Scheduler**: Cosine with linear warmup | |
| - **Warmup Steps**: 3,750 | |
| - **Total Steps**: 150,000 | |
| - **Training Time**: ~4 hours | |
| ### Optimization Details | |
| - **Optimizer**: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01) | |
| - **Gradient Clipping**: 1.0 | |
| - **Gradient Checkpointing**: Enabled for memory efficiency | |
| - **Loss Function**: Cross-entropy loss | |
| ### Context Extension | |
| - **Base Context**: 2,048 tokens | |
| - **Extended Context**: 16,384 tokens | |
| - **Method**: Linear interpolation of positional embeddings | |
| - **Validation**: Successfully tested up to 3,600 tokens | |
| ## Performance | |
| ### Training Metrics | |
| - **Final Loss**: 3.907 | |
| - **Training Speed**: ~10 iterations/second | |
| - **Peak Memory**: ~8GB VRAM | |
| - **Convergence**: Smooth loss curve, no overfitting | |
| ### Inference Performance | |
| - **Speed**: ~150+ tokens/second (RTX 5070) | |
| - **Memory Usage**: 1.3GB for 1,800 token context | |
| - **Context Limit**: 3,600 tokens practical limit | |
| - **Temperature**: Recommended 0.7-0.9 for creative tasks | |
| ## Usage | |
| ### Quick Start | |
| ``` | |
| import torch | |
| from transformers import AutoTokenizer | |
| from model_neo import NeoMini, NeoMiniConfig | |
| # Load model | |
| config = NeoMiniConfig() | |
| model = NeoMini(config) | |
| checkpoint = torch.load("extended_context_model.pt") | |
| model.load_state_dict(checkpoint['model_state_dict']) | |
| model.eval() | |
| # Load tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("gpt2") | |
| # Generate text | |
| prompt = "The future of AI is" | |
| input_ids = tokenizer.encode(prompt, return_tensors="pt") | |
| with torch.no_grad(): | |
| output = model.generate(input_ids, max_length=100, temperature=0.8) | |
| print(tokenizer.decode(output)) | |
| ``` | |
| ### Interactive Chat | |
| ``` | |
| python interactive_chat.py | |
| ``` | |
| ### Generation Parameters | |
| - **Temperature**: 0.7-0.9 for creative tasks, 0.3-0.5 for factual | |
| - **Top-k**: 40-50 | |
| - **Top-p**: 0.8-0.9 | |
| - **Repetition Penalty**: 1.1-1.3 | |
| ## Limitations | |
| ### Current Limitations | |
| - **Base Model Only**: Not instruction-tuned (requires fine-tuning for chat) | |
| - **Context Window**: Practical limit of ~3,600 tokens despite 16K architecture | |
| - **Hardware Requirements**: Requires CUDA-capable GPU for optimal performance | |
| - **Knowledge Cutoff**: Limited to web data patterns, no specific knowledge cutoff | |
| ### Known Issues | |
| - Occasionally generates repetitive patterns (fixable with fine-tuning) | |
| - May not follow instructions well (base model behavior) | |
| - Sometimes produces formatting artifacts from web data | |
| ## Ethical Considerations | |
| ### Bias and Fairness | |
| - Trained on web data which may contain societal biases | |
| - No explicit bias mitigation applied during training | |
| - Users should be aware of potential biased outputs | |
| ### Use Cases | |
| **Intended Uses:** | |
| - Research and experimentation | |
| - Text generation and completion | |
| - Creative writing assistance | |
| - Educational purposes | |
| **Out-of-Scope Uses:** | |
| - Medical or legal advice | |
| - High-stakes decision making | |
| - Content that could cause harm | |
| ## Environmental Impact | |
| ### Carbon Footprint | |
| - **Training Hardware**: Single RTX 5070 Laptop GPU (100W) | |
| - **Training Time**: 4 hours | |
| - **Estimated CO₂**: ~0.3 kg CO₂ equivalent | |
| - **Efficiency**: 253M parameters per 0.3 kg CO₂ | |
| ## Model Card Authors | |
| [Antony Austin] - Model development and training | |
| [30/08/2025] - Model card creation | |
| ## Citation | |
| ``` | |
| @misc{mapneo_mini_2025, | |
| title={MAP-NEO Mini: An Efficient 253M Parameter Language Model}, | |
| author={[Antony Austin]}, | |
| year={2025}, | |
| howpublished={\url{https://huggingface.co/Austin207/Map-NEO}}, | |
| note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data} | |
| } | |
| ``` | |
| ## Technical Details | |
| ### Hardware Requirements | |
| - **Minimum**: 4GB VRAM for inference | |
| - **Recommended**: 8GB VRAM for extended context | |
| - **Training**: 8GB+ VRAM with mixed precision | |
| - **CPU**: Any modern CPU (inference possible but slow) | |
| ## Future Work | |
| ### Planned Improvements | |
| - [ ] Conversational fine-tuning with UltraChat dataset | |
| - [ ] Instruction following capabilities | |
| - [ ] Multi-language support | |
| - [ ] Quantized versions (4-bit, 8-bit) | |
| - [ ] ONNX export for edge deployment | |
| ### Research Directions | |
| - Context window optimization beyond 16K | |
| - More efficient attention mechanisms | |
| - Improved training data curation | |
| - Specialized domain fine-tuning | |
| ## Acknowledgments | |
| - **Falcon RefinedWeb**: High-quality training data | |
| - **Hugging Face**: Transformers library and infrastructure | |
| - **Community**: Open-source ML community for architectural insights | |
| --- | |
| **Last Updated**: August 30, 2025 | |
| **Model Version**: 1.0.0 | |
| **Status**: Base model (pre-conversational fine-tuning) |