Instructors:	Jason Corso (jcorso)
Course Webpage:	http://www.cse.buffalo.edu/~jcorso/t/2014S_SEM
Meeting Times:	T 1200-1400
Location:	Davis 113A
Office Hours:	R 1200-1400

News

Schedule changed slightly: 3/4 canceled and paper shifted to 3/11.
2/11 Full schedule for the semester is now available.
First day of class 1/28.
Site just initially up; check back later for more detailed information.

Main Course Material

Topic Description

Course Overview: This seminar will study modeling and inference in the case that multimodal data is available. The data situations of focus on vision and language, but others will be considered, such as action (physical motion), audition, etc. The seminar will focus on reading and discussing topic-relevant research papers.

[1 credit hour] Each student will be required to present a paper during the course of the semester, attend all classes, actively participate in discussions.

[3 credit hours] All of the above plus implementation of one research paper on real data and an experimental report writeup.

S grades are given for satisfactory performance and are not guaranteed. Students are expected to attend all classes, to present with informed and professional content, and to actively participate in discussions. Students are also expected to have a strong working knowledge on two or more of computer vision, linguistics, machine learning, and data mining.

Prerequisites: It is assumed that the students have significant experience with at least two of computer vision, linguistics/NLP, machine learning, and data mining. Note, this is an advanced course. Don't register if you are not prepared.

Grading: Grading is P/F by departmental policy.

Course Outline and Schedule

Here is the schedule of presentations.

Date	Paper	Presenter(s)	Download
1/28	G. Monaci, P. Vandergheynst, and F. T. Sommer, “Learning bimodal structure in audio-visual data,” IEEE Transactions on Neural Networks, vol. 20, no. 12, pp. 1898–910, 2009.	Suren Kumar	pdf
2/4	R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010.	Ran Xu
2/11	Ladicky et al. "Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction," IJCV 2012.	Chenliang Xu
2/17	Roller, S. and Schulte im Walde, S., "A Multimodal LDA Model Integrating Textual, Cognitive and Visual Modalities," EMNLP 2013.	Sarma Tangirala
2/24	A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov, "DeViSE: A deep visual-semantic embedding model,” in Proceedings of Advance in Neutral Information Processing, 2013.	Caiming Xiong
3/4	NO CLASS
3/11	Guadarrama et al. "YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition." ICCV 2013.	Wei Chen
	A. Barbu, N. Siddharth, and J. M. Siskind, "Saying what you’re looking for: Linguistics meets video search," No. arxiv:1309.5174, Sept. 2013.	Suchismit Mahapatra
3/18	SPRING RECESS
3/25	D. L. Chen and R. J. Mooney, "Learning to Interpret Natural Language Navigation Instructions from Observations." AAAI 2011.	Kaushal Bondada
4/1	Batmanghelich, N. K. et al. "Joint Modeling of Imaging and Genetics," IPMI 2013.	Duygu Sarikaya
4/8	C. Matuszek, N. Fitzgerald, L. Zettlemoyer, L. Bo, and D. Fox, “A joint model of language and percep- tion for grounded attribute learning,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12), 2012.	Vikas Dhiman
	K. Barnard and D. Forsyth, "Learning the semantics of words and pictures," in International Conference on Computer Vision, vol. 2, pp. 408–415, 2001.	Aswin Gokulachandran
4/15	N. Siddharth, A. Barbu, and J. M. Siskind, "Seeing what you're told: Sentence-guided activity recognition in video," No. arxiv:1308.4189, Aug. 2013.	Shao-Hang Hsieh
	A. Farhadi, M. Hehrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences for images,” in Proceedings of European Conference on Computer Vision, 2010.	Vaijayanti Maitra
4/22	Guillaumin, M. et al. "Multimodal Semi-Supervised Learning for Image Classification," CVPR 2010.	Smriti Jha
	G. Kulkami, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, "Baby talk: Understanding and generating simple image descriptions," CVPR 2011.	Radhakrishna Dasari
4/29	Fidler et al. "A Sentence is Worth a Thousand Pixels." CVPR 2013.	Jiasen Lu
	Le, D.-T., and Bernardi, R. and Uijlings, J. "Exploiting language models to recognize unseen actions," ACM ICMR 2013.	Liang Zhao
5/6	Wrap-Up Discussion Class

Here are possible papers to choose from. This is not an exclusive list and students should seek additional papers. Email any to me.

N. Siddharth, A. Barbu, and J. M. Siskind, “Seeing what you’re told: Sentence-guided activity recog- nition in video,” No. arxiv:1308.4189, Aug. 2013.
A. Barbu, N. Siddharth, and J. M. Siskind, “Saying what you’re looking for: Linguistics meets video search,” No. arxiv:1309.5174, Sept. 2013.
A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J. M. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin, and Z. Zhang, “Video in sentences out,” in Proceedings of Uncertainty in Artificial Intelligence, 2012.
K. Barnard, P. Duygulu, D. Forsyth, N. de Frietas, D. M. Blei, and M. I. Jordan, “Matching words and pictures,” Journal of Machine Learning Research, vol. 3, pp. 1107–1135, 2003.
K. Barnard and D. Forsyth, “Learning the semantics of words and pictures,” in International Conference on Computer Vision, vol. 2, pp. 408–415, 2001.
J. R. Bender, “Connecting language and vision using a conceptual semantics,” Master’s thesis, Mas- sachusetts Institute of Technology, 2001.
D. L. Chen and R. J. Mooney, “Learning to interpret natural language navigation instructions from observations,” in Proceedings of AAAI Conference on Artificial Intelligence, 2011.
P. Das, R. K. Srihari, and J. J. Corso, “Translating related words to videos and back through latent topics,” in Proceedings of Sixth ACM International Conference on Web Search and Data Mining, 2013.
P. Das, C. Xu, R. F. Doell, and J. J. Corso, “A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
A. Farhadi, M. Hehrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences for images,” in Proceedings of European Conference on Computer Vision, 2010.
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov, “DeViSE: A deep visual-semantic embedding model,” in Proceedings of Advance in Neutral Information Processing, 2013.
N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama, “Generating natural-language video descriptions using text-mined knowledge,” in Proceedings of AAAI Conference on Artificial Intelligence, 2013.
G. Kulkami, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, “Baby talk: Understanding and generating simple image descriptions,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011.
C. Matuszek, N. Fitzgerald, L. Zettlemoyer, L. Bo, and D. Fox, “A joint model of language and percep- tion for grounded attribute learning,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12), 2012.
P. McKevitt, “Integration of natural language and vision processing: Grouding representations (edito- rial),” Artificial Intelligence Review, vol. 10, no. 1-2, pp. 7–13, 1996.
C. Meini and A. Paternoster, “Understanding language through vision,” Artificial Intelligence Review, vol. 10, no. 1-2, pp. 37–48, 1996.
D. Roy and N. Mukherjee, “Towards Situated Speech Understanding: Visual Context Priming of Lan- guage Models,” Computer Speech and Language, vol. 19, no. 2, pp. 227–248, -1.
S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. Teller, and N. Roy, “Understanding natural language commands for robotic navigation and mobile manipulation,” Proc. AAAI, 2011.
S. Roller and S. Schulte im Walde, “A multimodal LDA model integrating textual, cognitive and visual modalities,” in Proceedings of Energy Minimization in Natural Language Processing, 2013.

CSE 704: Seminar: Readings in Joint Visual, Lingual and Physical Models and Inference Algorithms SUNY at Buffalo Spring 2014

News

Main Course Material

Course Outline and Schedule

CSE 704: Seminar: Readings in Joint Visual, Lingual and Physical Models and Inference Algorithms
SUNY at Buffalo
Spring 2014