Joint Learning of Distributed Representations for Images and Texts

04/13/2015
by   Xiaodong He, et al.
0

This technical report provides extra details of the deep multimodal similarity model (DMSM) which was proposed in (Fang et al. 2015, arXiv:1411.4952). The model is trained via maximizing global semantic similarity between images and their captions in natural language using the public Microsoft COCO database, which consists of a large set of images and their corresponding captions. The learned representations attempt to capture the combination of various visual concepts and cues.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset