[1707.06320] Learning Visually Grounded Sentence Representations


We have seen in some papers in the last issue that learning supervised sentence representations is useful for many tasks. Kiela et al. propose to use a multimodal task, i.e. image captioning to learn grounded sentence representations. They show that the learned representations are useful for different classification tasks, entailment, and word similarity. The takeaway: Large enough datasets that require some form of NLU are useful for inducing good general-purpose representations. On which task will representations learned on the image-recipe data from above perform well?


Want to receive more content like this in your inbox?