A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents Paper • 2602.08964 • Published 22 days ago • 1
Faithful Persona-based Conversational Dataset Generation with Large Language Models Paper • 2312.10007 • Published Dec 15, 2023 • 11
Language Models Change Facts Based on the Way You Talk Paper • 2507.14238 • Published Jul 17, 2025 • 1
Demographic Probing of Large Language Models Lacks Construct Validity Paper • 2601.18486 • Published Jan 26 • 1
view article Article 🪄 Interpreto: A Unified Toolkit for Interpretability of Transformer Models Jan 20 • 37
GIM: Improved Interpretability for Large Language Models Paper • 2505.17630 • Published May 23, 2025 • 1
Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability Collection A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated Dec 16, 2025 • 24
Accumulating Context Changes the Beliefs of Language Models Paper • 2511.01805 • Published Nov 3, 2025 • 2
🧩 Word games Collection A collection of resources for word games in various languages • 16 items • Updated Sep 24, 2025 • 2
Latent Reasoning in LLMs as a Vocabulary-Space Superposition Paper • 2510.15522 • Published Oct 17, 2025 • 3
Interpreting Language Models Through Concept Descriptions: A Survey Paper • 2510.01048 • Published Oct 1, 2025 • 2
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? Paper • 2507.08802 • Published Jul 11, 2025 • 1
Hallucination Probes Collection https://arxiv.org/abs/2509.03531 • 5 items • Updated Oct 15, 2025 • 2