Dear colleagues, while I fully agree with your findings, I'm puzzled by the sentiment.
You write: "Relating language and the visual world robustly may require altogether new approaches to learning language."
Why do you think that the current technology (all of diffusion, word2vec as well as GPT-based LLMs are fundamentally known and I don't believe that any party has made meaningful structural changes) *would* be able to deal with these kinds of prompts in the first place?
To me it doesn't just seem to be the case, but it is virtually certain that neither CLIP nor GPT has any world-model-like epistemology and IT CANNOT HAVE. Why do you suppose it may?
Dear colleagues, while I fully agree with your findings, I'm puzzled by the sentiment.
You write: "Relating language and the visual world robustly may require altogether new approaches to learning language."
Why do you think that the current technology (all of diffusion, word2vec as well as GPT-based LLMs are fundamentally known and I don't believe that any party has made meaningful structural changes) *would* be able to deal with these kinds of prompts in the first place?
To me it doesn't just seem to be the case, but it is virtually certain that neither CLIP nor GPT has any world-model-like epistemology and IT CANNOT HAVE. Why do you suppose it may?
i don’t suppose that it would. but a lot of people are confused, and it’s good to remind them that scaling is not a panacea.