Authors
Yezhou Yang, Ching Lik Teo, Hal Daumé III, Yiannis Aloimonos
Publication date
2011/7/27
Conference
Proceedings of the Conference on Empirical Methods in Natural Language Processing
Pages
444-454
Publisher
Association for Computational Linguistics
Description
We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and descriptive sentences compared to naive strategies that use vision alone.
Total citations
201220132014201520162017201820192020202120222023202413263347505751504150393721
Scholar articles
Y Yang, C Teo, H Daumé III, Y Aloimonos - Proceedings of the 2011 conference on empirical …, 2011
Y Yang, CL Teo - Aloimonos. Corpus-guided sentence generation of …, 2011