AI & ML interests

None defined yet.

Recent Activity

KingNishย 
posted an update 20 days ago
view post
Post
2343
Muon vs MuonClip vs Muon+Adamw

Muon has gone from an experiment to a mainstream optimizer, but does it hold up for fineโ€‘tuning? We ran headโ€‘toโ€‘head tests on Qwen3โ€‘4B (10k+ highโ€‘quality instruction rows) to find out.

Short story: Pure Muon converged fastest at the start, but its gradientโ€‘norm spikes made training unstable. MuonClip (Kimi K2โ€™s clipping) stabilizes long pretraining runs, yet in our smallโ€‘scale fineโ€‘tune it underperformed, lower token accuracy and slower convergence. The winner was the hybrid: Muon for 2D layers + AdamW for 1D layers. It delivered the best balance of stability and final performance and even beat vanilla AdamW.

Takeaway: for small-scale fine-tuning, hybrid = practical and reliable.

Next Step: scale to larger models/datasets to see if Muonโ€™s spikes become catastrophic or if clipping wins out.

Full Blog Link: https://huggingface.co/blog/KingNish/optimizer-part1
KingNishย 
posted an update 22 days ago
mrfakenameย 
posted an update 26 days ago
view post
Post
4443
Excited to share that I've joined the Hugging Face Fellows program! ๐Ÿค—

Looking forward to contributing to & working more closely with the open-source ecosystem - huge thanks to everyone who's supported me on this journey! ๐Ÿš€
nroggendorffย 
posted an update about 1 month ago
view post
Post
2849
I am now being charged for paused and unstarted spaces out of the blue.
I think this is it, folks. o7


The unstarted spaces I can get behind. I would've appreciated a warning email first, but whatever. However, every time I restart the active usage goes up, despite all of my spaces being moved to CPU (free), and being paused.
ยท
nroggendorffย 
posted an update about 2 months ago
view post
Post
2364
Developing with ZeroGPU without a PRO account is painful. They give you so many requests at once, but then have like a 24 hour cooldown. I vote less requests in a batch, but then a shorter cooldown.


or just less of a cooldown, but i understand if that is not allowed
  • 3 replies
ยท
lunarfluย 
posted an update about 2 months ago
lunarfluย 
posted an update about 2 months ago
view post
Post
601
The new King ๐Ÿ‘‘has arrived!

Moonshot AI now the top model on Hugging Face ๐Ÿ”ฅ
moonshotai/Kimi-K2-Thinking
lunarfluย 
posted an update about 2 months ago
view post
Post
2742
๐Ÿ’ธ๐Ÿค‘You donโ€™t need 100 GPUs to train something amazing!

Our Smol Training Playbook teaches you a better path to world-class LLMs, for free!

Check out the #1 trending space on ๐Ÿค— :
HuggingFaceTB/smol-training-playbook
mrfakenameย 
posted an update 2 months ago
view post
Post
6085
Trained a model for emotion-controllable TTS based on MiMo audio on LAION's dataset.

Still very early and does have an issue with hallucinating but results seem pretty good so far, given that it is very early into the training run.

Will probably kick off a new run later with some settings tweaked.

Put up a demo here: https://huggingface.co/spaces/mrfakename/EmoAct-MiMo

(Turn ๐Ÿ”Š on to hear audio samples)
ยท
nroggendorffย 
posted an update 2 months ago
view post
Post
3710
Is it hot in here, or is it just me?
ยท
merveย 
posted an update 2 months ago
view post
Post
7427
deepseek-ai/DeepSeek-OCR is out! ๐Ÿ”ฅ my take โคต๏ธ
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
ยท
nroggendorffย 
posted an update 2 months ago
view post
Post
3345
I love getting emails telling me when there's somebody else's active access token in one of my commit SHAs. HF should really only tell you if it is your token, otherwise I could just make a dataset with a bunch of random strings and wait for a valid token.
user,permission,token
nroggendorff,write,hf_...
pepper13,finegrained,hf_...
...,...,...
...

Also, don't comment about how unlikely this is. I've gotten a warning email about a token I 'leaked' at least four times.
In all cases, it has been in the digest hash.
  • 2 replies
ยท
m-ricย 
posted an update 3 months ago
view post
Post
866
Tokenization is one of the most important processes in AI - yet many would like to kill it ๐Ÿ’€

What's tokenization? The neural networks inside LLMs actually only process numbers, not text: tokenization is the process that makes text readable for them, by converting sentences into lists of numbers.

โžก๏ธ For instance, "This is tokenization" would be split into "This | is | token | ization", then each of the parts (tokens) are converted to IDs according to a predefined mapping: for instance "ization" could map to id 2438.
Thus "This is tokenization" can become 1335 | 135 | 2980 | 2438 => now the model can process the sentence!

Most tokenizers today use pre-specified mappings called "vocabularies", generally built about the compression algorithme Byte-Pair Encoding (BPE) that learns from a big corpuses of texts an optimized split to efficiently encode any text from the same distribution into a list token IDs.

๐Ÿคจ Now, these current tokenizers have flaws.
For instance, the rigidity of their mapping creates losses ; the prime example being that a tokenizer designed for English (thus optimized for tokens like "has", "been", "clock", etc) will not have the right tokens to approach Burmese, thus being terribly inefficient at it.

Many alternative approaches have emerged as a result: for instance "tokenizer-free tokenizers". One that I really liked was "entropy-based": it monitors the stream of text, and trigger a split whenever the entropy increases too much, i.e. when something "surprising" happens.

But this great article argues that tokenizers are a lesser evil. Read and decide for yourself!
https://huggingface.co/blog/catherinearnett/in-defense-of-tokenizers
Severianย 
posted an update 3 months ago
view post
Post
365
New Technique to Deeply Poison AI on Images and Prove Creative Provenance

I've developed a new method to protect creative work from unauthorized AI training. My Poisonous Shield for Images algorithm embeds a deep, removal-resistant poison into the mathematical structure of your images. It's designed to be toxic to machine learning models, achieving up to 20-348% disruption in AI training convergence in benchmark tests.

Unlike traditional watermarks, this protection survives compression and resizing and is not removed by standard tools. The technique also embeds cryptographic proof of provenance directly into the image, verifying ownership and detecting tampering.

You can see examples and learn more about how and WHY it works better than current methods:

https://severian-poisonous-shield-for-images.static.hf.space

If you are interested in using this technology to protect your work from AI training and unauthorized use, please reach out to me. It is currently in the prototype phase but fully functioning and effective. Still working on expanding it to a production-grade usable app.

This is not intended as a pure self-promotion post. I am genuinely wanting to help creators and want to gauge interest from different communities. I've spent the past year and a half building this from scratch with new math and code to try and solve this massive problem.ย 
m-ricย 
posted an update 3 months ago
view post
Post
4904
STOP EVERYTHING NOW - we might finally have a radical architecture improvement over Transformers!!! ๐Ÿšจ

A lone scientist just proposed Tiny Recursive Model (TRM), and it is literally the most impressive model that I've seen this year.

โžก๏ธ Tiny Recursive Model is 7M parameters
โžก๏ธ On ARC-AGI, it beats flagship models like Gemini-2.5-pro

Consider how wild this is: Gemini-2.5-pro must be over 10,000x bigger
and had 1,000 as many authors ๐Ÿ˜‚ (Alexia is alone on the paper)

What's this sorcery?
In short: it's a very tiny Transformers, but it loops over itself at two different frequencies, updating two latent variables: one for the proposed answer and one for the reasoning.

@AlexiaJM started from the paper Hierarchical Reasoning Model, published a few months ago, that already showed breakthrough improvement on AGI for its small size (27M)

Hierarchical Reasoning Model had introduced one main feature:
๐Ÿ”Ž Deep supervision
In their model, one part (here one layer) would run at high frequency, and another would be lower frequency, running only every n steps.

They had used a recurrent architecture, where these layers would repeat many times ; but to make it work they had to do many approximations, including not fully backpropagating the loss through all layers.

Alexia studied what was useful and what wasn't, and cleaned the architecture as follows :
Why use a recurrent architecture, when you can just make it a loop?
โžก๏ธ She made the network recursive, looping over itself

Why use 2 latent variables ?
โžก๏ธ She provides a crystal clear explanation : the one that changes frequently is the reasoning, the one that changes at low frequency is the proposed answer.
โžก๏ธ She runs ablation studies to validate that 2 is indeed optimal.

This new setup is a much more elegant way to process reasoning than generating huge chains of tokens as all flagship models currently do.

This might be the breakthrough we've been awaiting for so long!
ยท