Rendered at 12:06:36 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
jimbo_joe 26 minutes ago [-]
The blog post nails the problem well: with distributed training handled by many teams the problem becomes organizational rather then tech-based. Having a technology solution to more easily persist pipeline and artifact configs is a good idea, but it only yields an incremental improvement. IMO this can only solved with a shared culture around experimentation which has to be maintained and reinvented as the company (and hence the teams responsible for training) grows.
Basically, if every layer of the stack from tooling to training to applied are aiming for repeatable training with predictable capability improvements that end users can see, then it's more likely to happen.
itsdesmond 15 minutes ago [-]
I speak from ignorance, having not worked on this type of thing: are people NOT doing this? Are folks… what, passing CSVs and clicking GUIs to kick off one-off expensive, long running training runs? That’s absolutely crazy to me. Test on production head-ass.
mschwaig 58 minutes ago [-]
I've been thinking if this could be done in Nix and toyed around with the idea just a little bit using nano-gpt.
Hermeticity always seems to mean isolation, but depending on who you ask it does not always mean computing some sort of hash over all build input as the 'identity' of a particular step in the pipeline, like Nix does.
If you do that hash-based identity part, looking up intermediary results and resuming from there happens using this sort of hash.
Does savanah do that, or will it resume where it left off based on a less strict notion of identity?
I could see arguments for either approach.
michaelchicory 28 minutes ago [-]
Savanna works in the same way as Nix: each training stage is defined fully in code, so it computes a hash over all inputs (training image, code version, environment, training configuration, etc.) and maps this to the eventual output. When the same inputs come along again, Savanna just reads the output from cache. As the post mentions, this is particularly helpful in sweeps that span multiple training stages.
delichon 12 hours ago [-]
Some good stuff here from Dwarkesh around mashing up training and inference:
He predicts this kind of model factory will become central to organizational learning and operations. Updating and upgrading the model stack becomes the core staff function.
faangguyindia 9 hours ago [-]
Interesting points made in the video.
But models did not become good at coding just because coding is replayable. It’s because there are countless repos, issues, Stack Overflow threads, and Reddit posts/comments/questions where a solution is clearly marked as “solved” or “that helped,” and AI can learn from that feedback.
Being replayable does play a role because a solution can be tested against a compiler, and the resulting errors or lack of errors/warnings can reveal whether it worked.
This becomes much harder in fields like fitness, where changes take much longer and cause and effect relationships are not straightforward to establish.
Your muscle gain increased but was it because you increased protein intake? Or was it because you started eating more carbs, which added more energy to the system?
Once protein needs are already met, calories may become the limiting factor. In that case, the additional gains may come primarily from increased calorie intake rather than the higher protein intake itself.
AI is bad at fitness, evidently.
Many people forget, conversation with a model also generates training data. This is how your problems, algorithms, solutions end up in training data and end up right at your competitors without your competitor trying to actively steal your code.
I simply do not expose core algorithms which improve my product to AI agents.
NitpickLawyer 5 hours ago [-]
> But models did not become good at coding just because coding is replayable. It’s because there are countless repos, issues, Stack Overflow threads, and Reddit posts/comments/questions where a solution is clearly marked as “solved” or “that helped,” and AI can learn from that feedback.
That's at least 2yo take. Today's gains for SotA (either closed or open models) come from RLVR 100%. The model unrolls many iterations, those iterations get verified w/ tests/known tests/rubrics and the model learns from that (grpo or similar).
And what's cool about this (and why scale really matters now) is that you can mostly get this process automated (i.e. take a known good repo, ask one agent to remove one feature, keep the tests, ask another model to add that feature back, verify that old tests work on new implementation, repeat). This is why top labs are pulling away in the breadth of their capabilities, compared to open models. It's scale, pure and simple. And the better their models become, the larger the gap due to automating better cases.
typs 5 hours ago [-]
Some interesting ideas in this, but I think his argument is somewhat undermined by the fact that his main example of computer use has actually gotten much better recently because of RLVR
jaggederest 11 hours ago [-]
I think this is an interesting thing that will happen once the rate of change slows down a little bit - imagine a world where there's more or less a couple base models and everyone trains on top of them, and the bitter lesson is defunct just via sheer physics (maybe we have the best models we can physically run in reasonable energy density substrates, or something), then it becomes "your personal model" with your overlay, training, or feedback on top.
michaelchicory 42 minutes ago [-]
Author here, happy to answer any questions!
virajk_31 49 minutes ago [-]
I doubt if anything can really solve "oops"
SpyCoder77 12 hours ago [-]
What is this "aleph" thing in names now? First aleph neuro, and now aleph alpha.
fxwin 5 hours ago [-]
fwiw aleph alpha have been around since 2019
verelo 11 hours ago [-]
I'm glad you're asking because I've seen it too and don't get it either. I assumed initially it was alpha as a typo, then I Googled it and got even more confused.
boothby 11 hours ago [-]
First letter of the Hebrew alphabet, used by mathematicians to denote infinities.
verelo 11 hours ago [-]
That's what Google told me, but i still don't see how it links to this?
UltraSane 8 hours ago [-]
It is just vibes man. It sounds cool, nothing more.
akoboldfrying 9 hours ago [-]
It doesn't -- it's marketing, much like adding "Labs" to the end of your company's name. Its association with infinity makes the company sound cooler to potential customers, many of whom are software engineers who consciously or unconsciously view pure mathematics as a prestige "final form" of their own logic-focused mental ability.
random3 12 hours ago [-]
> TL;DR: Model training has grown complex
So they’ve built Savanah - a workflow engine because the existing zoo of hundreds of workflow engines didn’t cut it :)
joelschw 3 hours ago [-]
No this is a composability surface, Flyte is their workflow engine for durable execution etc
usernametaken29 5 hours ago [-]
I was thinking the same. Airflow does exactly the same thing. The only benefit here is that it’s their little workflow engine so they can get all their little edge case accounted for…
Hermeticity always seems to mean isolation, but depending on who you ask it does not always mean computing some sort of hash over all build input as the 'identity' of a particular step in the pipeline, like Nix does.
If you do that hash-based identity part, looking up intermediary results and resuming from there happens using this sort of hash.
Does savanah do that, or will it resume where it left off based on a less strict notion of identity?
I could see arguments for either approach.
https://youtu.be/20p5-kQXF_Q?is=72ImTNxkOEKmOXQ9
He predicts this kind of model factory will become central to organizational learning and operations. Updating and upgrading the model stack becomes the core staff function.
But models did not become good at coding just because coding is replayable. It’s because there are countless repos, issues, Stack Overflow threads, and Reddit posts/comments/questions where a solution is clearly marked as “solved” or “that helped,” and AI can learn from that feedback.
Being replayable does play a role because a solution can be tested against a compiler, and the resulting errors or lack of errors/warnings can reveal whether it worked.
This becomes much harder in fields like fitness, where changes take much longer and cause and effect relationships are not straightforward to establish.
Your muscle gain increased but was it because you increased protein intake? Or was it because you started eating more carbs, which added more energy to the system?
Once protein needs are already met, calories may become the limiting factor. In that case, the additional gains may come primarily from increased calorie intake rather than the higher protein intake itself.
AI is bad at fitness, evidently.
Many people forget, conversation with a model also generates training data. This is how your problems, algorithms, solutions end up in training data and end up right at your competitors without your competitor trying to actively steal your code.
I simply do not expose core algorithms which improve my product to AI agents.
That's at least 2yo take. Today's gains for SotA (either closed or open models) come from RLVR 100%. The model unrolls many iterations, those iterations get verified w/ tests/known tests/rubrics and the model learns from that (grpo or similar).
And what's cool about this (and why scale really matters now) is that you can mostly get this process automated (i.e. take a known good repo, ask one agent to remove one feature, keep the tests, ask another model to add that feature back, verify that old tests work on new implementation, repeat). This is why top labs are pulling away in the breadth of their capabilities, compared to open models. It's scale, pure and simple. And the better their models become, the larger the gap due to automating better cases.
So they’ve built Savanah - a workflow engine because the existing zoo of hundreds of workflow engines didn’t cut it :)