Seminar on Vision+Language
Navigation
|
CSE 704: Seminar:
Readings in Joint Visual, Lingual and Physical Models and Inference Algorithms
|
Instructors: | Jason Corso (jcorso) |
Course Webpage: | http://www.cse.buffalo.edu/~jcorso/t/2014S_SEM |
Meeting Times: | T 1200-1400 |
Location: | Davis 113A |
Office Hours: | R 1200-1400 |
Course Overview: This seminar will study modeling and inference in the case that multimodal data is available. The data situations of focus on vision and language, but others will be considered, such as action (physical motion), audition, etc. The seminar will focus on reading and discussing topic-relevant research papers.
[1 credit hour] Each student will be required to present a paper during the course of the semester, attend all classes, actively participate in discussions. [3 credit hours] All of the above plus implementation of one research paper on real data and an experimental report writeup. S grades are given for satisfactory performance and are not guaranteed. Students are expected to attend all classes, to present with informed and professional content, and to actively participate in discussions. Students are also expected to have a strong working knowledge on two or more of computer vision, linguistics, machine learning, and data mining.
Prerequisites: It is assumed that the students have significant experience with at least two of computer vision, linguistics/NLP, machine learning, and data mining. Note, this is an advanced course. Don't register if you are not prepared.
Grading: Grading is P/F by departmental policy.
Date | Paper | Presenter(s) | Download |
1/28 | G. Monaci, P. Vandergheynst, and F. T. Sommer, “Learning bimodal structure in audio-visual data,” IEEE Transactions on Neural Networks, vol. 20, no. 12, pp. 1898–910, 2009. | Suren Kumar | |
2/4 | R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010. | Ran Xu | |
2/11 | Ladicky et al. "Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction," IJCV 2012. | Chenliang Xu | |
2/17 | Roller, S. and Schulte im Walde, S., "A Multimodal LDA Model Integrating Textual, Cognitive and Visual Modalities," EMNLP 2013. | Sarma Tangirala | |
2/24 | A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov, "DeViSE: A deep visual-semantic embedding model,” in Proceedings of Advance in Neutral Information Processing, 2013. | Caiming Xiong | |
3/4 | NO CLASS | ||
3/11 | Guadarrama et al. "YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition." ICCV 2013. | Wei Chen | |
A. Barbu, N. Siddharth, and J. M. Siskind, "Saying what you’re looking for: Linguistics meets video search," No. arxiv:1309.5174, Sept. 2013. | Suchismit Mahapatra | ||
3/18 | SPRING RECESS | ||
3/25 | D. L. Chen and R. J. Mooney, "Learning to Interpret Natural Language Navigation Instructions from Observations." AAAI 2011. | Kaushal Bondada | |
4/1 | Batmanghelich, N. K. et al. "Joint Modeling of Imaging and Genetics," IPMI 2013. | Duygu Sarikaya | |
4/8 | C. Matuszek, N. Fitzgerald, L. Zettlemoyer, L. Bo, and D. Fox, “A joint model of language and percep- tion for grounded attribute learning,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12), 2012. | Vikas Dhiman | |
K. Barnard and D. Forsyth, "Learning the semantics of words and pictures," in International Conference on Computer Vision, vol. 2, pp. 408–415, 2001. | Aswin Gokulachandran | ||
4/15 | N. Siddharth, A. Barbu, and J. M. Siskind, "Seeing what you're told: Sentence-guided activity recognition in video," No. arxiv:1308.4189, Aug. 2013. | Shao-Hang Hsieh | |
A. Farhadi, M. Hehrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences for images,” in Proceedings of European Conference on Computer Vision, 2010. | Vaijayanti Maitra | ||
4/22 | Guillaumin, M. et al. "Multimodal Semi-Supervised Learning for Image Classification," CVPR 2010. | Smriti Jha | |
G. Kulkami, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, "Baby talk: Understanding and generating simple image descriptions," CVPR 2011. | Radhakrishna Dasari | ||
4/29 | Fidler et al. "A Sentence is Worth a Thousand Pixels." CVPR 2013. | Jiasen Lu | |
Le, D.-T., and Bernardi, R. and Uijlings, J. "Exploiting language models to recognize unseen actions," ACM ICMR 2013. | Liang Zhao | ||
5/6 | Wrap-Up Discussion Class |