Transformers & MoE - a RichardForests Collection

RichardForests 's Collections

Language Models

CV

RL

Diffusion models

3D/4D Gaussian Splatting

Transformers & MoE

(3D) Foundation Models

DL & Software DStructures

Flash Attention in Triton

Lora variations

Parameter Efficient - LLMs

Robotics - Cross Attention

DMs - Lighting Conditions

Transformers & MoE

updated May 21, 2024

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Paper • 2312.07987 • Published Dec 13, 2023 • 41
Interfacing Foundation Models' Embeddings

Paper • 2312.07532 • Published Dec 12, 2023 • 12
Point Transformer V3: Simpler, Faster, Stronger

Paper • 2312.10035 • Published Dec 15, 2023 • 22
TheBloke/quantum-v0.01-GPTQ

Text Generation • 7B • Updated Dec 18, 2023 • 3 • 2
TheBloke/PiVoT-MoE-GPTQ

Text Generation • 36B • Updated Dec 17, 2023 • 3 • 1
mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ

Text Generation • Updated Feb 5, 2025 • 14 • 38
Denoising Vision Transformers

Paper • 2401.02957 • Published Jan 5, 2024 • 31
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11, 2024 • 59
Buffer Overflow in Mixture of Experts

Paper • 2402.05526 • Published Feb 8, 2024 • 9
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

Paper • 2405.08707 • Published May 14, 2024 • 34