I think that image models are a completely different beast from language models, and I’m simply not informed enough about image models. So take what I’m going to say with a grain of salt.
I think that it’s possible that image models do some sort of abstraction that resembles how humans handle images. Including modelling a third dimension not present in a 2D picture, or abstractions like foreground vs. background. If it does it or not, I don’t know.
And unlike for language models, the image model hallucinations (e.g. people with six fingers) don’t seem to contradict the idea that the model still recognises individual objects.
I think that image models are a completely different beast from language models, and I’m simply not informed enough about image models. So take what I’m going to say with a grain of salt.
I think that it’s possible that image models do some sort of abstraction that resembles how humans handle images. Including modelling a third dimension not present in a 2D picture, or abstractions like foreground vs. background. If it does it or not, I don’t know.
And unlike for language models, the image model hallucinations (e.g. people with six fingers) don’t seem to contradict the idea that the model still recognises individual objects.
This video gives a decent explanation of what might be going on with the hands if you’re interested.
Thanks - I’ll check it out.