::: GROUNDING MULTIMODAL LARGE LANGUAGE MODELS TO THE WORLD
Table of Contents
Kosmos-2 model is inroduced.
It trains on a internet-scale visual-grounding dataset (text correlate to a bounding box of an image) produced by a seperate grounding model spaCy. It intepret visual-grounding as text-expressions linking with bounding box(which is represented with discretized tokens: split image in to grid of area of uniform size, each token refers to one of them; top-left and bottom-right token are used to represent the boundingbox). Expriment shows that the gounding and referring performance indeeed is improved, strong zero-shot and few-shot learning abilities, while maintaining similar performance on other multimodal and text-only tasks as Kosmos-1 (from which the model is initialized)
1. visual grounding
text of llm’s output would refer to a specific visual object, represented as bounding box in an image