Map-NEO / README.md

Update README.md

479dd18 verified 10 months ago

7.66 kB

	---
	language:
	- en
	license: mit
	library_name: transformers
	tags:
	- text-generation
	- pytorch
	- custom-architecture
	- rope
	- rmsnorm
	- swiglu
	- flash-attention
	- 16k-context
	pipeline_tag: text-generation
	widget:
	- text: "The future of artificial intelligence is"
	example_title: "AI Future"
	- text: "Write a short story about"
	example_title: "Story Generation"
	- text: "Explain quantum computing in simple terms:"
	example_title: "Technical Explanation"
	datasets:
	- tiiuae/falcon-refinedweb
	metrics:
	- perplexity
	model-index:
	- name: MAP-NEO Mini
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: RefinedWeb (100K subset)
	type: tiiuae/falcon-refinedweb
	metrics:
	- type: perplexity
	value: 3.9
	name: Final Training Loss
	---

	# MAP-NEO Mini

	## Model Description

	MAP-NEO Mini is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices.

	- Developed by: Antony Austin
	- Model type: Autoregressive Language Model
	- Language(s): English
	- License: MIT
	- Architecture: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention

	## Key Features

	- Efficient Training: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours
	- Extended Context: 16,384 token context window (16x typical small models)
	- Memory Efficient: Only 1.3GB VRAM for 1,800 tokens inference
	- Fast Inference: ~150+ tokens/second on consumer GPU
	- High Quality Data: Trained on curated RefinedWeb subset

	## Architecture Details

	### Model Architecture
	- Parameters: 253,085,696 (253M)
	- Layers: 16 transformer blocks
	- Hidden Size: 1,024
	- Attention Heads: 16
	- Head Dimension: 64
	- FFN Hidden Size: 2,736 (2.67x hidden size)
	- Vocabulary Size: 50,257 (GPT-2 tokenizer)
	- Max Sequence Length: 16,384 tokens

	### Architectural Innovations
	- RMSNorm: Root Mean Square Layer Normalization for training stability
	- RoPE: Rotary Positional Embeddings for better positional understanding
	- SwiGLU: Swish-Gated Linear Units for improved FFN performance
	- Flash Attention: Memory-efficient attention computation
	- Weight Tying: Input/output embeddings shared for parameter efficiency

	## Training Data

	### Dataset
	- Source: `tiiuae/falcon-refinedweb` (curated subset)
	- Size: 100,000 high-quality web documents
	- Tokens: ~41 million tokens
	- Sequence Length: 1,024 tokens per sequence
	- Sequences: 40,965 packed sequences

	### Data Quality
	- Length filtering: 200-10,000 characters
	- Language detection: English only
	- Quality scoring: High-quality web content
	- Deduplication: Exact and near-duplicate removal

	## Training Procedure

	### Training Configuration
	- Hardware: NVIDIA RTX 5070 Laptop GPU (8GB VRAM)
	- Precision: bfloat16 mixed precision
	- Batch Size: 1 per device
	- Gradient Accumulation: 32 steps
	- Effective Batch Size: 32
	- Learning Rate: 3e-4
	- Scheduler: Cosine with linear warmup
	- Warmup Steps: 3,750
	- Total Steps: 150,000
	- Training Time: ~4 hours

	### Optimization Details
	- Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
	- Gradient Clipping: 1.0
	- Gradient Checkpointing: Enabled for memory efficiency
	- Loss Function: Cross-entropy loss

	### Context Extension
	- Base Context: 2,048 tokens
	- Extended Context: 16,384 tokens
	- Method: Linear interpolation of positional embeddings
	- Validation: Successfully tested up to 3,600 tokens

	## Performance

	### Training Metrics
	- Final Loss: 3.907
	- Training Speed: ~10 iterations/second
	- Peak Memory: ~8GB VRAM
	- Convergence: Smooth loss curve, no overfitting

	### Inference Performance
	- Speed: ~150+ tokens/second (RTX 5070)
	- Memory Usage: 1.3GB for 1,800 token context
	- Context Limit: 3,600 tokens practical limit
	- Temperature: Recommended 0.7-0.9 for creative tasks

	## Usage

	### Quick Start
	```
	import torch
	from transformers import AutoTokenizer
	from model_neo import NeoMini, NeoMiniConfig

	# Load model
	config = NeoMiniConfig()
	model = NeoMini(config)
	checkpoint = torch.load("extended_context_model.pt")
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("gpt2")

	# Generate text
	prompt = "The future of AI is"
	input_ids = tokenizer.encode(prompt, return_tensors="pt")
	with torch.no_grad():
	output = model.generate(input_ids, max_length=100, temperature=0.8)
	print(tokenizer.decode(output))
	```
	### Interactive Chat
	```
	python interactive_chat.py
	```

	### Generation Parameters
	- Temperature: 0.7-0.9 for creative tasks, 0.3-0.5 for factual
	- Top-k: 40-50
	- Top-p: 0.8-0.9
	- Repetition Penalty: 1.1-1.3

	## Limitations

	### Current Limitations
	- Base Model Only: Not instruction-tuned (requires fine-tuning for chat)
	- Context Window: Practical limit of ~3,600 tokens despite 16K architecture
	- Hardware Requirements: Requires CUDA-capable GPU for optimal performance
	- Knowledge Cutoff: Limited to web data patterns, no specific knowledge cutoff

	### Known Issues
	- Occasionally generates repetitive patterns (fixable with fine-tuning)
	- May not follow instructions well (base model behavior)
	- Sometimes produces formatting artifacts from web data

	## Ethical Considerations

	### Bias and Fairness
	- Trained on web data which may contain societal biases
	- No explicit bias mitigation applied during training
	- Users should be aware of potential biased outputs

	### Use Cases
	Intended Uses:
	- Research and experimentation
	- Text generation and completion
	- Creative writing assistance
	- Educational purposes

	Out-of-Scope Uses:
	- Medical or legal advice
	- High-stakes decision making
	- Content that could cause harm

	## Environmental Impact

	### Carbon Footprint
	- Training Hardware: Single RTX 5070 Laptop GPU (100W)
	- Training Time: 4 hours
	- Estimated CO₂: ~0.3 kg CO₂ equivalent
	- Efficiency: 253M parameters per 0.3 kg CO₂

	## Model Card Authors

	[Antony Austin] - Model development and training
	[30/08/2025] - Model card creation

	## Citation

	```
	@misc{mapneo_mini_2025,
	title={MAP-NEO Mini: An Efficient 253M Parameter Language Model},
	author={[Antony Austin]},
	year={2025},
	howpublished={\url{https://huggingface.co/Austin207/Map-NEO}},
	note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data}
	}
	```

	## Technical Details

	### Hardware Requirements
	- Minimum: 4GB VRAM for inference
	- Recommended: 8GB VRAM for extended context
	- Training: 8GB+ VRAM with mixed precision
	- CPU: Any modern CPU (inference possible but slow)

	## Future Work

	### Planned Improvements
	- [ ] Conversational fine-tuning with UltraChat dataset
	- [ ] Instruction following capabilities
	- [ ] Multi-language support
	- [ ] Quantized versions (4-bit, 8-bit)
	- [ ] ONNX export for edge deployment

	### Research Directions
	- Context window optimization beyond 16K
	- More efficient attention mechanisms
	- Improved training data curation
	- Specialized domain fine-tuning

	## Acknowledgments

	- Falcon RefinedWeb: High-quality training data
	- Hugging Face: Transformers library and infrastructure
	- Community: Open-source ML community for architectural insights

	---

	Last Updated: August 30, 2025
	Model Version: 1.0.0
	Status: Base model (pre-conversational fine-tuning)