Chess Transformer 200M v2

A 204M parameter chess transformer trained on Stockfish-labeled positions from Lichess games.

Current Results

Best Accuracy: 16.3% (step 0)
Total Positions Trained: 0 across 4 GPUs
Last Updated: 2026-03-28T15:20:15.471373+00:00

Training

Experiment: exp075_ddp_4gpu (Local SGD, 4x NVIDIA A40)
Dataset: avewright/chess-positions-lichess-sf (~832M positions, 3275 source parquets)
Architecture: FusedBoardEncoder 256d → 1024d transformer, 16 layers, 16 heads, FFN 4×, SpatialPolicyHead
Strategy: 4 independent workers each training on 1/4 of data, weights averaged every 500 optimizer steps
Batch: 256 × accum 4 = effective 1024 per worker
LR: 1e-4 cosine schedule → 5% floor, 1% warmup
Parent: Continued from exp074 best checkpoint

Eval History

Step	Positions	Accuracy	Top-3	SF Rank	Value Acc
0	0	16.3%	41.8%	66.6	78.5%

Architecture

ChessTransformer200M (~204M params)
├── FusedBoardEncoder (embed_dim=256)
├── Linear projection (256 → 1024)
├── CLS token + positional embeddings (68 positions)
├── TransformerEncoder (16 layers, 16 heads, FFN 4096, GELU, norm_first)
├── LayerNorm
├── SpatialPolicyHead (head_dim=512) → 1968 moves
└── Value head (1024 → 512 → 3 WDL)

Files

best_model.pt — best checkpoint (state_dict only)
training_log.json — full eval history
config.json — training configuration
train.log — aggregated worker logs

Usage

from huggingface_hub import hf_hub_download
import torch

path = hf_hub_download("avewright/chess-transformer-200m-v2", "best_model.pt")
state_dict = torch.load(path, map_location="cpu", weights_only=True)
# Load into ChessTransformer200M architecture

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

avewright
/

chess-transformer-200m-v2

Chess Transformer 200M v2

Current Results

Training

Eval History

Architecture

Files

Usage

Dataset used to train avewright/chess-transformer-200m-v2