parkneurals
/

DeepSeekV3

Text Generation

deepseek_v3_mini

Mixture of Experts

Model card Files Files and versions

DeepSeekV3

Implementation of the DeepSeek-V3 architecture (8x2 MoE).

Architecture Details

Architecture: Mixture of Experts (MoE) + Multi-Head Latent Attention (MLA)
Parameters: ~196 million
Layers: 6
Hidden Dimension: 512
Experts: 8 routed experts (Top-2) + 1 Shared Expert
Attention: MLA (Multi-Head Latent Attention) with 8 heads and 64-dim Latent Space
Position Embeddings: Sinusoidal

Training

Dataset: TinyStories
Platform: Kaggle (2x T4 GPUs)
Optimizer: AdamW

How to use

This model uses a custom architecture implementation. To load it, you can use the state_dict found in model.safetensors alongside a loader.

from safetensors.torch import load_file
weights = load_file("model.safetensors")

Dataset

Trained on the TinyStories dataset.

Downloads last month: 185