1 Comment
⭠ Return to thread

Very nice examples. Especially to show that ['understanding' statistical relations between pixels of images] is only an approximation for [understanding images], whatever the size of the statistics.

It is really weird that you can see how correct these examples and interpretations are and still the belief in actual AGI-like understanding by these systems remains alive.

It seems easier to fool humans with text than it it is with images. Good grammar/sentences is a proxy for our intelligence to establish intelligence of the author and thus trust in what is told. Good grammar/sentences is much easier to approximate with token statistics than good meaning, and on many subjects there is enough token statistics to get 'good enough' results. Which is of course a fascinating result.

But good pixel statistics is even further from meaning than good token statistics and thus has a much harder job, to which the fact that we have two kinds of statistics (on the text and on the pixels) working together here. Could that be why the image 'correctness' is so much more difficult than the text 'correctness'?

Expand full comment