DeepSeekV3
Implementation of the DeepSeek-V3 architecture (8x2 MoE).
Architecture Details
- Architecture: Mixture of Experts (MoE) + Multi-Head Latent Attention (MLA)
- Parameters: ~196 million
- Layers: 6
- Hidden Dimension: 512
- Experts: 8 routed experts (Top-2) + 1 Shared Expert
- Attention: MLA (Multi-Head Latent Attention) with 8 heads and 64-dim Latent Space
- Position Embeddings: Sinusoidal
Training
- Dataset: TinyStories
- Platform: Kaggle (2x T4 GPUs)
- Optimizer: AdamW
How to use
This model uses a custom architecture implementation. To load it, you can use the state_dict found in model.safetensors alongside a loader.
from safetensors.torch import load_file
weights = load_file("model.safetensors")
Dataset
Trained on the TinyStories dataset.
- Downloads last month
- 185