Great article. Since Language Models are only learning the distriburion of words and with certain clever hacks (like attention) how to contextualize it, the next big models will appear to learn basic compositionality of most common examples, but will fail to generalize for complex unseen ones. There must be a better way that just building larger and larger models to approach AGI, and I think is important for the field to start explore alternatives to get closer before a new AI winter arrives.
Are the numerous grammatical and "writing" errors in this interesting piece intentional (i.e. purposefully creating dirty data...) or the result of speech to text input problems, or lack of rewrite time? Seems strange to discuss language-centric issues without making it easier for one's readers.
If Musk takes you up on your bet, you should wait till 2029, then ask the AI who wins. If it says "Elon should pay Gary," demand your money. If it says "Gary should pay Elon," point out that it clearly still hasn't solved compositionality, and demand your money.
Hi there, I am the mysterious entity known only as Vitor, and I really don't see what my credentials have to do with anything. A bet like this is obviously not meant to settle the question scientifically. Rather, it's a tool that helps people articulate their intuitions in a falsifiable way, and makes it harder to shift the goalposts after the fact. Even a sloppy bet like this provides some sort of signal, though I'd say it's best interpreted as being analogous to the playful, exploratory phase of scientific research, trying to make my intuitions a bit more precise.
Now, in this case I'd agree with you that the bet is too generous towards Scott's side. By taking only the best sample, the AI's job is made too easy. I realized this shortly after I made the bet, but of course I wouldn't try to pull out over something like this.
Vitor, do you agree that Scott Alexander won the bet?
To be clear, I don't think he did. In Scott's images, the cat and the robot are not in a factory and the robot isn't looking at the cat, the lamma desn't have a bell on its tail and the robot "farmer" is not in a cathedral and doesn't look like a farmer. I assumed that to win the bet that was about compositionality, all the details in the prompt would have to be right, and unambiguously so, but that is not the case.
The commercial sector requires constant low-cost hype-able news. That drains from the resources required to actually do deep investigation of the issue. Will AI get caught up in a hare-tortoise?
Sure, the bar was set very low in the bet with Vitor, but Scott Alexander did say "Without wanting to claim that Imagen has fully mastered compositionality....."
i seriously doubt it has even partly “mastered compositionality”; at best it has (roughly) hash tabled a bit larger fraction of cases, without any really comprehension. (which is what we see eg in ever bigger models doing multiplication)
With respect to compositionality, there is a big difference between Dall-E and Imagen. Dall-E is trained by contrastive learning: it matches a set of captions against a set of images, and treating each caption as a simple bag of words is usually enough to do well at that task. Given the way it is trained, it would be shocking if it *could* understand compositional descriptions.
Imagen is an image generator hooked up to a large language model, and the LLM is pre-trained on vast reams of actual text (not just image captions). The predict-the-next token loss function of LLMs is not necessarily a great task for learning compositional relationships, but it does have some compositional elements, since the LLM must at least learn to produce gramatically-correct output, and it has seen many, many detailed descriptions of scenes within its training set, e.g. in novels and news articles.
Moreover, although the transformer architecture as a whole is not fully recursive to unlimited depth, it can capture limited recursion with bounded depth by using many stacked layers. The attention mechanism is known to be capable of representing parse trees internally. (See "Global Relational Models of Source Code"). LLMs have recently had a fair amount of success at producing other recursive, compositional data structures, such as source code.
Thus, an LLM hooked up to an image generator should be capable, in theory, of parsing a compositional scene description, and generating an image from it. Better pre-training tasks on text (beyond predict-the-next-token) and on images (perhaps tracking objects in video over multiple frames) would doubtless further improve the model.
What is surprising to me is not that these models have limitations (of course they do), but that they do so astonishingly well, even when trained with brain-dead loss functions like predict-the-next-token.
re Imagen, see my previous essay (with Google in the title); I have repeatedly tried to get Google to test a serious benchmark on Imagen and they refuse to respond.
You definitely make a good point regarding Google's unwillingness to allow access to their models for legitimate research.
I think part of the problem is that these models are getting so powerful that the big players (Google, OpenAI) are truly frightened about the potential for abuse. Or about the potential for lawsuits -- e.g. copyright violation, DMCA, libel, privacy, deep fakes, etc. There are many good reasons why management/legal might want to lock down these models tighter than an <insert colorful metaphor here>. I would hesitate to assume malice, when you can blame bureaucracy.
That being said, it only took ~6 months to get from Dall-E/Imagen to the open-source Stable Diffusion, so it's not clear that locking down the models has accomplished much, other than to frustrate people like you, and send everyone else off to use the freely-available alternatives.
potential for abuse has *zero* to do with Google themselves testing and releasing results on a legitimate benchmark generated by researchers at Meta and HuggingFace.
You're right, but as an ML researcher myself, I am keenly aware there are an almost unlimited number of experiments, comparisons, and benchmarks that I could potentially do, but I personally only have time to do a tiny fraction of them. In fact, every time I try to get a paper past peer review, at least one reviewer is miffed that I didn't run my model on their favorite benchmark. There are a LOT of experiments being done with LLMs all the time these days, and the fact that Google (or OpenAI, or anybody else) didn't prioritize the exact one that you wanted them to do doesn't mean anything, except that they happen to have other priorities. (Which, honestly, given the topic of your blog, is not all that surprising... :-))
However, the way that science is supposed to work is that if there's an experiment that you personally care about, then you should have the ability to investigate it yourself. In that sense, having these big models locked up within corporate research labs does present a legitimate concern with respect to scientific transparency and reproducibility.
Since Stable Diffusion was released, there's been absolute explosion of enthusiasts and tinkerers experimenting with what the system is capable of. On the one hand, that proves my point about the benefits of openness. On the other hand, I find it scary as hell.
I'm not sure I know what the right balance is. Perhaps, like the James Webb Telescope, interested scientists could apply for "time" with these big models? I think that's what OpenAI did with Dall-E...
The problems in AI are well documented - re: the number of papers and articles claiming GAI will be the end of us all, due to the tradeoff between usefully relaxing requirements and limitations to allow improvisational improvements and having those limitations keep us all safe. When you allow peer review you are putting code where it can be used by anyone without the same ethical limitations use code in progress... do no evil? Or don't allow it to be done. We need a way to share with trusted reviewers, but not the unvetted general public. We can see the complications ths adds in action with the cyber arms race.
dude, you are in serious need of a proofreader. I mean, it's a good article, and I agree with your premise, but look at your footnote: "... asked if I could they could it for me..." srsly? Many similar examples, they detract. I share your frustration with AI hype, however. Thanks also for providing us yet more evidence that Elon Musk ain't the genius he thinks he is (and I know, he's got lots of company).
I do think DeepFold should maybe get a Nobel Prize for the results they got on protein folding. This technology is powerful in its own way, as long as results are discrete or need not be very precise.
But general AI and equivalent on digital computers: no chance in hell. And if you want to laugh at a sign about this 'second run of AI hype' look at US Patent 11396271 (for an app that warns pedestrians on a crossing that an oncoming self-driving car (should but) will not stop... "a method and system for communication between a vulnerable road user and an autonomous vehicle using augmented reality to highlight information to the vulnerable road user regarding potential interactions between the autonomous vehicle and the vulnerable road user.")
Great article. Since Language Models are only learning the distriburion of words and with certain clever hacks (like attention) how to contextualize it, the next big models will appear to learn basic compositionality of most common examples, but will fail to generalize for complex unseen ones. There must be a better way that just building larger and larger models to approach AGI, and I think is important for the field to start explore alternatives to get closer before a new AI winter arrives.
Sure, it's a comment on Scott's post: https://astralcodexten.substack.com/p/i-won-my-three-year-ai-progress-bet/comment/9068389
Are the numerous grammatical and "writing" errors in this interesting piece intentional (i.e. purposefully creating dirty data...) or the result of speech to text input problems, or lack of rewrite time? Seems strange to discuss language-centric issues without making it easier for one's readers.
If Musk takes you up on your bet, you should wait till 2029, then ask the AI who wins. If it says "Elon should pay Gary," demand your money. If it says "Gary should pay Elon," point out that it clearly still hasn't solved compositionality, and demand your money.
May I suggest using Grammarly before posting? Without typos the article will be even better.
Hi there, I am the mysterious entity known only as Vitor, and I really don't see what my credentials have to do with anything. A bet like this is obviously not meant to settle the question scientifically. Rather, it's a tool that helps people articulate their intuitions in a falsifiable way, and makes it harder to shift the goalposts after the fact. Even a sloppy bet like this provides some sort of signal, though I'd say it's best interpreted as being analogous to the playful, exploratory phase of scientific research, trying to make my intuitions a bit more precise.
Now, in this case I'd agree with you that the bet is too generous towards Scott's side. By taking only the best sample, the AI's job is made too easy. I realized this shortly after I made the bet, but of course I wouldn't try to pull out over something like this.
cheers!
ha ha! next time let me know and i will help you craft a more rigorous bet :)
Vitor, do you agree that Scott Alexander won the bet?
To be clear, I don't think he did. In Scott's images, the cat and the robot are not in a factory and the robot isn't looking at the cat, the lamma desn't have a bell on its tail and the robot "farmer" is not in a cathedral and doesn't look like a farmer. I assumed that to win the bet that was about compositionality, all the details in the prompt would have to be right, and unambiguously so, but that is not the case.
As I commented on the original post, I disagree with Scott and am not conceding the bet.
can you link or paste here your objection?
Thank you for clarifying, Vitor. Sorry that I missed your comment on Scott Alexander's post.
The commercial sector requires constant low-cost hype-able news. That drains from the resources required to actually do deep investigation of the issue. Will AI get caught up in a hare-tortoise?
"One of Silicon Valley’s Sharpest Minds"
LOL. Musk has proven that he's not that.
Edit: Oops, I wrote that before reading the whole article. It's true, though.
Sure, the bar was set very low in the bet with Vitor, but Scott Alexander did say "Without wanting to claim that Imagen has fully mastered compositionality....."
i seriously doubt it has even partly “mastered compositionality”; at best it has (roughly) hash tabled a bit larger fraction of cases, without any really comprehension. (which is what we see eg in ever bigger models doing multiplication)
Are grammar and spelling errors suppose to make us believe this wasn't actually written by GPT-3? Nice try.
With respect to compositionality, there is a big difference between Dall-E and Imagen. Dall-E is trained by contrastive learning: it matches a set of captions against a set of images, and treating each caption as a simple bag of words is usually enough to do well at that task. Given the way it is trained, it would be shocking if it *could* understand compositional descriptions.
Imagen is an image generator hooked up to a large language model, and the LLM is pre-trained on vast reams of actual text (not just image captions). The predict-the-next token loss function of LLMs is not necessarily a great task for learning compositional relationships, but it does have some compositional elements, since the LLM must at least learn to produce gramatically-correct output, and it has seen many, many detailed descriptions of scenes within its training set, e.g. in novels and news articles.
Moreover, although the transformer architecture as a whole is not fully recursive to unlimited depth, it can capture limited recursion with bounded depth by using many stacked layers. The attention mechanism is known to be capable of representing parse trees internally. (See "Global Relational Models of Source Code"). LLMs have recently had a fair amount of success at producing other recursive, compositional data structures, such as source code.
Thus, an LLM hooked up to an image generator should be capable, in theory, of parsing a compositional scene description, and generating an image from it. Better pre-training tasks on text (beyond predict-the-next-token) and on images (perhaps tracking objects in video over multiple frames) would doubtless further improve the model.
What is surprising to me is not that these models have limitations (of course they do), but that they do so astonishingly well, even when trained with brain-dead loss functions like predict-the-next-token.
re Imagen, see my previous essay (with Google in the title); I have repeatedly tried to get Google to test a serious benchmark on Imagen and they refuse to respond.
You definitely make a good point regarding Google's unwillingness to allow access to their models for legitimate research.
I think part of the problem is that these models are getting so powerful that the big players (Google, OpenAI) are truly frightened about the potential for abuse. Or about the potential for lawsuits -- e.g. copyright violation, DMCA, libel, privacy, deep fakes, etc. There are many good reasons why management/legal might want to lock down these models tighter than an <insert colorful metaphor here>. I would hesitate to assume malice, when you can blame bureaucracy.
That being said, it only took ~6 months to get from Dall-E/Imagen to the open-source Stable Diffusion, so it's not clear that locking down the models has accomplished much, other than to frustrate people like you, and send everyone else off to use the freely-available alternatives.
potential for abuse has *zero* to do with Google themselves testing and releasing results on a legitimate benchmark generated by researchers at Meta and HuggingFace.
You're right, but as an ML researcher myself, I am keenly aware there are an almost unlimited number of experiments, comparisons, and benchmarks that I could potentially do, but I personally only have time to do a tiny fraction of them. In fact, every time I try to get a paper past peer review, at least one reviewer is miffed that I didn't run my model on their favorite benchmark. There are a LOT of experiments being done with LLMs all the time these days, and the fact that Google (or OpenAI, or anybody else) didn't prioritize the exact one that you wanted them to do doesn't mean anything, except that they happen to have other priorities. (Which, honestly, given the topic of your blog, is not all that surprising... :-))
However, the way that science is supposed to work is that if there's an experiment that you personally care about, then you should have the ability to investigate it yourself. In that sense, having these big models locked up within corporate research labs does present a legitimate concern with respect to scientific transparency and reproducibility.
Since Stable Diffusion was released, there's been absolute explosion of enthusiasts and tinkerers experimenting with what the system is capable of. On the one hand, that proves my point about the benefits of openness. On the other hand, I find it scary as hell.
I'm not sure I know what the right balance is. Perhaps, like the James Webb Telescope, interested scientists could apply for "time" with these big models? I think that's what OpenAI did with Dall-E...
The problems in AI are well documented - re: the number of papers and articles claiming GAI will be the end of us all, due to the tradeoff between usefully relaxing requirements and limitations to allow improvisational improvements and having those limitations keep us all safe. When you allow peer review you are putting code where it can be used by anyone without the same ethical limitations use code in progress... do no evil? Or don't allow it to be done. We need a way to share with trusted reviewers, but not the unvetted general public. We can see the complications ths adds in action with the cyber arms race.
dude, you are in serious need of a proofreader. I mean, it's a good article, and I agree with your premise, but look at your footnote: "... asked if I could they could it for me..." srsly? Many similar examples, they detract. I share your frustration with AI hype, however. Thanks also for providing us yet more evidence that Elon Musk ain't the genius he thinks he is (and I know, he's got lots of company).
I made a prediction market based on this post: https://manifold.markets/SneakySly/will-ai-image-generating-models-sco
I'm kinda intrigued as to whether this compositionality issue will leave AI-based content moderation systems vulnerable
hugely so. there is no reliable automated content moderation yet
Another great article.
I do think DeepFold should maybe get a Nobel Prize for the results they got on protein folding. This technology is powerful in its own way, as long as results are discrete or need not be very precise.
But general AI and equivalent on digital computers: no chance in hell. And if you want to laugh at a sign about this 'second run of AI hype' look at US Patent 11396271 (for an app that warns pedestrians on a crossing that an oncoming self-driving car (should but) will not stop... "a method and system for communication between a vulnerable road user and an autonomous vehicle using augmented reality to highlight information to the vulnerable road user regarding potential interactions between the autonomous vehicle and the vulnerable road user.")