I think that image models are a completely different beast from language models, and I’m simply not informed enough about image models. So take what I’m going to say with a grain of salt.
I think that it’s possible that image models do some sort of abstraction that resembles how humans handle images. Including modelling a third dimension not present in a 2D picture, or abstractions like foreground vs. background. If it does it or not, I don’t know.
And unlike for language models, the image model hallucinations (e.g. people with six fingers) don’t seem to contradict the idea that the model still recognises individual objects.
Interesting take, what do you make of this one?
I think that image models are a completely different beast from language models, and I’m simply not informed enough about image models. So take what I’m going to say with a grain of salt.
I think that it’s possible that image models do some sort of abstraction that resembles how humans handle images. Including modelling a third dimension not present in a 2D picture, or abstractions like foreground vs. background. If it does it or not, I don’t know.
And unlike for language models, the image model hallucinations (e.g. people with six fingers) don’t seem to contradict the idea that the model still recognises individual objects.
This video gives a decent explanation of what might be going on with the hands if you’re interested.
Thanks - I’ll check it out.