Wow, it’s amazing that just 3.3% of the training set coming from the same model can already start to mess it up.
Wow, it’s amazing that just 3.3% of the training set coming from the same model can already start to mess it up.
I’ve read some snippets of AI written books and it really does feel like my brain is short circuiting
At least in this case, we can be pretty confident that there’s no higher function going on. It’s true that AI models are a bit of a black box that can’t really be examined to understand why exactly they produce the results they do, but they are still just a finite amount of data. The black box doesn’t “think” any more than a river decides its course, though the eventual state of both is hard to predict or control. In the case of model collapse, we know exactly what’s going on: the AI is repeating and amplifying the little mistakes it’s made with each new generation. There’s no mystery about that part, it’s just that we lack the ability to directly tune those mistakes out of the model.
I guess the question is, what happens to the kernel when all the people who learned on C are gone? The majority of even the brightest new devs aren’t going to cut their teeth on C, and will feel the same resistance to learning a new language when they think that there are diminishing returns to be had compared to what’s new and modern and, most importantly, familiar.
I honestly get the hostility, the fast pace of technology has left a lot of older devs being seen as undesirable because the don’t know the new stuff, even if their fundamental understanding of low level languages could be a huge asset. Their knowledge of C is vast and valuable, and they’re working on a project that thrives because of it. To have new people come to the project and say “Yeah, we could do this without having to worry about all that stuff” feels like throwing away a lot of the skill they’ve built. I’m not sure what the solution is, I really don’t think there are enough new C developers in the world to keep the project going strong into the future though. Maybe a fork is just the way to go; time will tell which is more sustainable.
Permissive licenses mean faster and more widespread adoption, it’s up to project maintainers if the tradeoff is worth it. Ideally a company would realize that an open source part of their project probably isn’t radically going to affect their revenue stream, but you don’t just have to convince devs, you have to convince the suits and lawyers, and they will tell you to just build your own rather than give up any precious IP.
I mean, we’ve seen already that AI companies are forced to be reactive when people exploit loopholes in their models or some unexpected behavior occurs. Not that they aren’t smart people, but these things are very hard to predict, and hard to fix once they go wrong.
Also, what do you mean by synthetic data? If it’s made by AI, that’s how collapse happens.
The problem with curated data is that you have to, well, curate it, and that’s hard to do at scale. No longer do we have a few decades’ worth of unpoisoned data to work with; the only way to guarantee training data isn’t from its own model is to make it yourself