DeepSeekV3

Implementation of the DeepSeek-V3 architecture (8x2 MoE).

Architecture Details

  • Architecture: Mixture of Experts (MoE) + Multi-Head Latent Attention (MLA)
  • Parameters: ~196 million
  • Layers: 6
  • Hidden Dimension: 512
  • Experts: 8 routed experts (Top-2) + 1 Shared Expert
  • Attention: MLA (Multi-Head Latent Attention) with 8 heads and 64-dim Latent Space
  • Position Embeddings: Sinusoidal

Training

  • Dataset: TinyStories
  • Platform: Kaggle (2x T4 GPUs)
  • Optimizer: AdamW

How to use

This model uses a custom architecture implementation. To load it, you can use the state_dict found in model.safetensors alongside a loader.

from safetensors.torch import load_file
weights = load_file("model.safetensors")

Dataset

Trained on the TinyStories dataset.

Downloads last month
185
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support