The largest attention-fundamentals asymmetry

I hope all 9511 of you had a great week 🔥

Welcome to the 25 new members of the TOC community.

If you're enjoying my writing, please share Terminally Onchain with your friends in Crypto!

Hi TOC fam, happy Friday!

Great news. I'm really starting to understand this decentralized AI stuff.

Not saying that I'm a subject matter expert just yet, but I've been locked in the last two weeks on all the material I can possibly find on this vertical and have been reading non-stop.

I was talking to my friend (and TOC subscriber!) Derek from Collab Currency about how there's probably less than 50 people in the world who actually care about DeAI. He said something that hooked me even more..."there's a huge asymmetry between attention and fundamentals. Maybe one of the largest I’ve seen in my career" .

We also discussed how it's only a matter of time before the traditional open source AI community realizes they'll have to depend on crypto for scaling their models. In my Crypto enabled accelerationism bubble post from November, I mentioned the following:

Crypto needs its "poster children".
The real fun will start when the cream of the crop developers and researchers from other tech verticals take note that their chances of success are higher through this new crypto-agentic enabled model.

DeAI will need its "poster children" to showcase to the rest of the AI world that these experiments actually work at some reasonable capacity. Projects like Nous and Prime Intellect "de-risks DeAI quickly if others can see these ideas in production".

And it's clear that we're getting close.

For example, this is a timeline of released models kept up to date by Hugging Face. It was cool to see that they added in Hermes 3 by Nous (check the August 2024 column).

And in a recent interview, Dylan Patel (founder of Semianalysis) mentioned the following:

Ronan

@Ronangmi

@dylan522p (leading AI researcher) was asked about what spaces in the AI startup world he is is excited about.

his first answer?

distributed training and inference. in particular, @NousResearch and @PrimeIntellect

Also, it's worth noting that the speed of shipping from these DeAI teams is increasing a lot more quickly than people realize. Just in the last week...there were announcements from:

Prime launching SYNTHETIC-1 dataset
Bittensor going through its huge dTAO upgrade today
Nous launching DeepHermes-3

I won't get into the details of these today as I don't think any of them are time sensitive (there will be extreme volatility and sell pressure on the Bittensor subnet tokens at the beginning) so we'll save them for next week.

With that being said, let's dive into today's post.

Last Friday, I gave all of you a framework on how to think about crypto x AI. Here's a refresher:

Today, we're going to zoom in on the bottom-left quadrant and breakdown the decentralized AI efforts happening.

There were 3 main reads that were game changers in terms of getting me up to speed on DeAI. They're technical and dense but I'll do my best to help all of you understand the overarching narrative and the key points to keep in mind.

Here are the posts I'm referring to:

Decentralized Training by Naman Kapasi
Scaling through Decentralization by Ronan
Frontier Training by Sam Lehman

Let's dive in.

1. Model Supercomputer

At its core, decentralized AI is not a new concept. The idea is to basically build a world computer (sound familiar?) that can train a state of the art (SOTA) model that has no central node and is fault tolerant.

It's worth remembering that Bitcoin is the world's largest supercomputer by multiple orders of magnitude. We've technically proved it's possible to build a decentralized network that combines compute + economic incentives + utility. If we can do it for "digital gold", why not AI?

Of course, there will be a different set of challenges just like Ethereum vs Bitcoin. But these ideas are all variants of each other serving different real world assets (money, programmability, intelligence).

2. Centralized Clusters

The big unlock that slowly formed from 2017 (release of transformers paper) to 2019 (gpt-2 launch) was that it's possible to make these "assistants smarter" by adding more compute and scaling up the underlying models.

This marked the start of the GPU gold rush and the AI space started shifting to being more closed source as pre-training became the competitive margin.

And since chatGPT launched in 2022, we've seen unfathomable numbers on how much the largest tech companies are spending on data centers (i.e. Meta, xAI, Microsoft, Google).

This Dylan Patel piece on how much goes into these data centers on the backend and how the hundreds of billions of dollars will be spent.

3. Training Optimization

Now the key part to understand is that it's not just about who has the most number of GPUs. Rather, the north star is how well these teams set up their data centers to maximize efficiency. This is can be calculated by:

MFU (model flop utilization) / PUE (power usage effectiveness) = total cluster efficiency

This accounts for GPU failures, continuity of systems, etc. For reference, Llama 3 had a MFU of ~40%.

As these models get exponentially larger, these optimizations become crucial. This leads into the topic of Parallelism which allows model training to happen in parallel.

There's 3 main types: data, tensor, and pipeline. I won't go into detail here but just know that most of these big tech data center setups use a combo of all three known as 3D parallelism.

4. Low Bandwidth, High Latency

This brings us to DeAI's first challenge which is distributed training. Centralized data centers have the advantage that their hardware is co-located so GPU communication is very fast.

Nvidia NLINK connects GPUs with speeds of 1800 GB / sec while a normal internet speed is ~500 MB / sec.

If you don't know how pre-training works, here's an eli5 of why GPUs need to communicate.

Let's say you have 10 data sets that you need to train the model with. After each data set, the model needs to update and adjust based on how poorly it did (loss function). You may have heard of gradient descent before? Basically, the goal is to iteratively bring the gradient down after each data pass.

Now, remember, there are tens of thousands of GPUs that all need to sync up after each pass...you can imagine how that gets heavy quickly.

So, the main question for distributed teams becomes: how do we reduce the communication needs?

Is there a way to not have every single GPU sync up in the system after each run?

5. Reducing Communication Needs

Four parts in this section.

Prime & DiLoCo

In my opinion, the key breakthrough for distributed training was when Google released their DiLoCO paper in November 2023. @Ar_Douillard is a fantastic follow to keep up with updates here.

The key learning from that paper is inner-outer optimization which reduces GPU communication needs by 500x!

Inner refers to a single node of GPUs making quick, local updates. Outer refers to a less frequent system wide updates between nodes. Think: intra-node vs inter-node.

From here, the PrimeIntellect team decided to implement this paper at scale. They used the Hivemind library and published OpenDiLoCo which also included a fault tolerance mechanism known as the "heartbeat". Basically, if any hardware didn't send a beat it would be removed from the system. This enables hardware to come in and out of the network.

They launched the PRIME-INTELLECT model in November 2024 which was a 10b parameter model trained on Open DiLoCo. It's amazing to see that @vincentweisser and team were actually apply this Google research at scale. This model release is a "poster-child" moment for DeAI.

Nous & DeMo

The second approach is led by @NousResearch and their DeMo & Distro paper. This method focused on decoupled momentum optimization. Remember how we discussed the eli5 on changing gradients after every data pass?

I won't get into details here but optimizers are used to update the model weights to minimize the loss function (margin of error).

Most training methods today use the AdamW optimizer which was released back in 2014 by Diederik Kingma.

Kingma worked with the Nous team (@bloc97_ & @theemozilla) to improve the AdamW optimizer and use discrete cosign transformation (DCT) to decouple fast and slow momentum changes.

The core concept here is that it's possible to split the importance of gradient changes into fast & slow. There are some changes that need to be immediately communicated and others that can be done over time.

It turned out that the gradient compression solution ended up being as good as if not better than models that just used the AdamW +allReduce optimizer. And most importantly, there was an 857x reduction (4-5 orders of magnitude) in communication needs!

The Nous team also announced their 15B parameter model in December 2024 - the second poster child for DeAI.

SWARM Parallelism
The third main approach is SWARM parallelism which heavily relies on pipeline parallelism. You know the classic picture of a neural net which has columns of circles that go from left to right? Each of those columns is a layer and pipeline parallelism basically splits the training vertically by layers.

So some GPUs are focused on the initial inputs, some are focused on the middle blackbox layers, and others are focused on the weights and layers closes to the output.

As models get super large, this approach is fantastic as it removes the GPU RAM bottleneck.

However, as Sam puts it concisely, "the key innovation with SWARM is their approach to handling the flow of data through the network of devices in a messy training network".

Basically, the architecture manages all the different kinds of hardware (that all have separate compute power) during a run.

I still need to dive deeper into how the routing mechanism works, but one of the most interesting discoveries by @m_ryabinin, @Tim_Dettmers, and team was the square-cube law of distributed training which notes that compute time grows much faster than communication time as model sizes increases.

This is a huge win for distributed model training because the biggest bottleneck of quick GPU communication becomes less important.

Non-Transformer Architecture Routes

I haven't looked into any of the following but just noting as Naman included it in his post. There's some non-transformer related architectures also trying to make progress in this vertical. Some examples he lists:

- Mixtral

- Switch transformers

- DiPaCo (distributed path composition)

6. Optimization Mix & Match

All the approaches discussed above in reducing GPU communications are basically different optimizations on the current centralized pre-training process.

The part that stood out to me most - and something I think should be the key takeaway of this post - is that the real magic will come when these approaches are combined together to immediately 100x or even 1000x the training efficiency.

For example, Distro + DiLoCo or SWARM + DiLoCo.

It seems like the teams are actively exploring each other's works and we'll probably see a lot more these optimization permutations in 2025.

7. Distributed vs Decentralized

One big call out here is making sure to differentiate two terms I see incorrectly interchanged. Distributed just means the hardware is not co-located...even big tech companies like Microsoft are working on this.

Decentralized specifically means that there's no central node and the training process is fault tolerant - exactly what the Bitcoin network preaches.

So what are the specific needs for decentralized training to happen?

This gets into the topic of security, encryption, and verification.

Again, details are out of scope rn but wanted to include a few examples that Sam included in his post:

- Zero knowledge proofs (@ezklxyz)

- Fully homomorphic encryption

- Multi party computation

- Trusted environment executions (@PhalaNetwork, Flashbots, @sxysun1)

Incentive Alignment

I've compared decentralized AI networks to Bitcoin a couple of times in this post.

It's worth noting that the same financial incentive mechanisms will need to apply here as well.

- slashing mechanisms to penalize malicious actors

- huge reduction in up front cost with hardware providers taking a bet on the long term revenues and appreciation of network ownership

- network effects as markets form around a liquid model token

@_AlexanderLong from @PluralisHQ has some great insights here on protocol models (post linked below as well).

Also worth noting that Nous Research recently announced Nous Psyche which aims to bring the verification and incentive mechanisms in partnership with Solana.

8. Inference-Time Compute

Since the Deepseek craziness from a month ago, the next big thing in AI seems to be researching new reinforcement learning (RL) techniques and using chain of thought (CoT) to let the models learn from the current context.

Deepseek showed us that it's possible to match current model performance benchmarks without the same hardware requirements by doing other kinds of optimizations.

We can go into a lot more depth here, but for the scope of this post I want to call out the fact that if in fact the next big thing in AI is moving from pre-training scaling to RL & inference time compute then it seems like the uphill battle of compute constraints that decentralized AI faces somewhat goes away.

I'm not well versed here so don't want to make any claims just yet but Ronan had a fantastic sentence in his post I wanted to copy over:

"The newest vector to scale model performance — reasoning and inference-time compute — naturally enjoys this described communication reduction. There is no inherent disadvantage to a distributed approach: the importance of this paradigm shift for distributed (and decentralized) training cannot be understated."

That's all I have for today's post!

I hope all you have a great weekend and I'll see you next week 🤝

- YB