Dynamic Multimodal Instance Segmentation guided by natural language queries
In this paper, we address the task of segmenting an object given a natural language expression that references it, i.e. a referring expression. Current techniques tackle this task by either (i) directly or recursively merging the linguistic and visual information in the channel dimension and then performing convolutions; or by (ii) mapping the expression to a space in which it can be thought of as a filter, whose response is directly related to the presence of the object at a given spatial coordinate in the image, so that a convolution can be applied to look for the object. We propose a novel method that merges the best of both worlds to exploit the recursive nature of language, and that also, during the upsampling process, takes advantage of the intermediate information generated when downsampling the image, so that detailed segmentations can be obtained. Our method is compared with the state-of-the-art approaches in four standard datasets, in which it yields high performance and surpasses all previous methods in six of eight of the standard dataset splits for this task. Code will be made available in the final version of this paper. Full implementation of our method and training routines, written in PyTorch, can be found at <https://github.com/andfoy/query-objseg>
READ FULL TEXT