Vision and Language

Georgia Tech, College of Computing

Fall 2017: CS 8803 (CVL)

Class meets Tue, Thu 1:30-2:45pm, Klaus 2456

Learn about, apply and (possibly) advance state-of-the-art techniques in problems at the intersection of computer vision and natural language processing.

Course Information

[Back to top]

In this class, you will learn about, apply and (possibly) advance state-of-the-art techniques in problems at the intersection of computer vision and natural language processing. With progress in AI, researchers are increasingly looking to move from learning from a single modality like vision, language or speech, to jointly reasoning about and leveraging multiple modalities. This course will provide an opportunity to learn about and build AI agents that can see and talk about what they see!

The focus of this course is on reading and critiquing published research papers, and on doing a semester-long (research) project.

We will read and analyze the strengths and weaknesses of research papers on a variety of important topics pertaining to vision and language, and identify open research questions. See the schedule for a list of topics we will cover.

Through the course of the semester, you will also undertake a research project with a concrete objective, likely in teams of 3-4 students (depending on enrollment). While certainly not a requirement for the class, students should actively consider submitting a paper at the end of the course to a top-tier conference in Computer Vision, Natural Language Processing, Machine learning, or AI.

Piazza: We'll be conducting all class-related discussions on Piazza. The quicker you begin asking questions on Piazza (rather than via emails), the quicker you'll benefit from the collective knowledge of your classmates and instructors. Please ask any questions on Piazza first before emailing the TAs or instructor.
Sign up here! Class page on Piazza: https://piazza.com/gatech/fall2017/cs8803cvl/home.

Class meets: Tue, Thu 1:30-2:45pm, Klaus 2456

    
Instructor
Email
Office hours
Devi Parikh
parikh@gatech.edu
Tue, Thu 2:45-3:15pm (right after class) outside Klaus 2456
TAs
Emails
Office hours
Arjun Chandrasekaran, Larry He
carjun@gatech.edu, larry.he@gatech.edu
Arjun: Tue 6-7pm, CCB level 2 lobby.
Larry: Fri 3:45-4:45 PM, CCB commons area (level 1)

Announcements

[Back to top]

Final project spotlight videos for this class (Fall 2017) are available here!

See list of projects below.

Course Structure

[Back to top]

The first few classes will introduce you to active research topics in vision and language - image captioning, visual question answering (VQA) and visual dialog. The goal is to provide a sufficient overview of problems in vision and language to enable an informed decision regarding the course project topic. You will work on the project through the semester as part of a team of 3-4 students (depending on enrollment).

After the introductory classes, you will read and review a paper (listed in the schedule) prior to each class. Each lecture will start with a brief discussion of the paper that was reviewed. The discussion will be led by two students -- one higlighting the strengths of the paper, and the other highlighting the weaknesses.

Following the paper discussion, 3 teams will present their project ideas, updates and the issues they faced for about 10 min. each. The goal is to have enough time for discussion and brainstorming. On average, each team will present updates on their project ~3 times over the course of the semester. At the end of the semester, teams will give final project presentations.

The class will vote on the best discussion participant, best project presentation and best project!

Feedback is very welcome. If you have any questions or concerns about the class or the requirements, please be sure to discuss them with the instructor early on.

No laptops, cell phones or other distractions in class please.

Summary:
  • 16 paper reviews
  • ~1 discussion in favor or against a paper
  • 4 project presentations (idea proposal, 2 updates, final presentation)
  • 1 project spotlight video

NOTE: There will be no final exam!

Auditing the class: Students are required to submit reviews for 3 papers on the topics of image captioning, VQA and visual dialog: Vinyals et al., CVPR 2015 (due before Sep. 7), Agrawal, Lu and Antol et al., ICCV 2015 (due before Sep. 26), and Das et al., CVPR 2017 (due before Nov. 2). Students should attend all lectures and participate in discussions. Students need not lead discussions or do a project.

Pass/Fail - Students are required to submit reviews for 7 papers of their choice from the list of papers in the schedule. Students should attend all lectures and participate in discussions. Students need not lead discussions or do a project.

Recommended Background

[Back to top]

CS 8803: Vision and Language is an ADVANCED class. This should not be your first exposure to computer vision, machine learning or deep learning. You will need:

  • Intro-level Computer Vision
  • Intro-level Machine Learning
  • Basic knowledge of deep learning
    This class assumes a basic knowledge of deep learning. If you don't have this background, this class may not be for you. If you are rusty on Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), take a look at these lectures below:
  • Curiosity and Passion for research
    • The projects will likely address a novel problem or address an existing problem in a novel way. The curiosity to investigate and ability to dive deep into a problem are skills that are essential to succeed in this course.
  • Programming
    • This is a demanding class in terms of programming skills.
    • Projects will typically involve programming in either Python or Lua and using a deep learning library e.g., PyTorch, Tensorflow, Torch, Caffe, Theano, etc.

Schedule

[Back to top]
Date Topic Review paper Project updates
W1: Aug 22 Introduction and Class Administrativia (slides) N/A N/A
W1: Aug 24 No class (Devi at IJCAI) N/A N/A
Aug 29 Sign up for paper discussion here by 11:59 pm
W2: Aug 29 Image captioning talks - Introduction (slides), Jiasen (slides), Rama (slides), Ashwin (slides), Arjun (slides) N/A N/A
W2: Aug 31 Visual Question Answering (VQA) talks - Aishwarya (slides), Jiasen (slides), Yash (slides) N/A N/A
W3: Sep 5 Visual dialog talks - Abhishek (slides), Jiasen (slides) N/A N/A
W3: Sep 7 Image captioning Vinyals et al., CVPR 2015
Submit here
For: Rahul Solanki, Karthik G
Against: Vineet Vinayak
N/A
Sep 8 Sign up your teams here by 11:59 pm Project ideas
W4: Sep 12 Irma! N/A N/A
W4: Sep 14 Image captioning with attention Xu et al., ICML 2015
Submit here
For: Wengling Chen, Hang Wu
Against: Sarah Nguyen
Teams 1-3
1. Yen-Chang, Fu-jen, Zhen, Weiyu
2. Amit, Huda, Cusuh, Vincent
3. Manasa Kadiri, Chaitanya, Kevin Garanger
W5: Sep 19 Image captioning with attention Lu and Xiong et al., CVPR 2017
Submit here
For: Manasa, Amit
Against: Arindam Duttagupta
Teams 4-6
4. Akanksha, James, Murali, Siddharth
5. Sainandan R, Rahul Solanki, Arindam D
6. Sajid, Meera, Abhishek
W5: Sep 21 Image captioning with pragmatics Vedantam et al., CVPR 2017
Submit here
For: Sai Sathiesh, Abhishek
Against: Aditi
Teams 7-9
7. Hang, Yun, Sarah
8. Nirbhay, Samyak, Karthik
9. Nilaksh, Madhuri, Varsha, Anmol
W6: Sep 26 VQA dataset and models Agrawal, Lu and Antol et al., ICCV 2015
Submit here
For: Huda, Ting Gu
Against: Anmol Kalia
Teams 10-12
10. Nauman Ahad, Zhaohua, Sai Sathiesh Rajan, Ting
11. Nishanth Rao, Shivam Agarwal, Aditi Mukhtyar, Wengling Chen
12. Xiaojing, Jyothi, Audrija
W6: Sep 28 Multimodal fusion in VQA Fukui, Huk, Park, Yang and Rohrbach et al., EMNLP 2016
Submit here
For: Shivam, Nilaksh
Against: Yen-Chang Hsu
Teams 13-14
13. Rupak Vignesh, Harish Haresamudram, Varun Agrawal
14. Vineet, Sumit
Project update-1
W7: Oct 3 No class (Devi away) N/A N/A
W7: Oct 5 Compositional models for VQA Andreas et al., CVPR 2016
Submit here
For: Nirbhay, Garanger
Against: Sajid
Teams 1-3
1. Yen-Chang, Fu-jen, Zhen, Weiyu
2. Amit, Huda, Cusuh, Vincent
3. Manasa Kadiri, Chaitanya, Kevin Garanger
W8: Oct 10 No class (Fall break! :D ) N/A N/A
W8: Oct 12 VQA model with attention Lu et al., NIPS 2016
Submit here
For: Nishanth
Against: Samyak
Teams 4-6
4. Akanksha, James, Murali, Siddharth
5. Sainandan R, Rahul Solanki, Arindam D
6. Sajid, Meera, Abhishek
W9: Oct 17 Analyzing behavior of VQA models Agrawal et al., EMNLP 2016
Submit here
For: Cusuh, Rupak Vignesh
Against: Weiyu
Teams 7-9
7. Hang, Yun, Sarah
8. Nirbhay, Samyak, Karthik
9. Nilaksh, Madhuri, Varsha, Anmol
W9: Oct 19 Balancing biases in VQA dataset Goyal and Khot et al., CVPR 2017
Submit here
For: Sainandan R, Madhuri
Against: Varsha
Teams 10-12
10. Nauman Ahad, Zhaohua, Sai Sathiesh Rajan, Ting
11. Nishanth Rao, Shivam Agarwal, Aditi Mukhtyar, Wengling Chen
12. Xiaojing, Jyothi, Audrija
W10: Oct 24 No class (Devi at ICCV) N/A N/A
W10: Oct 26 No class (Devi at ICCV) N/A N/A
W11: Oct 31 Question relevance in VQA Ray et al., EMNLP 2016
Submit here
For: Fu-Jen Chu
Against: Varun
Teams 13-14
13. Rupak Vignesh, Harish Haresamudram, Varun Agrawal
14. Vineet, Sumit
Project update-2
W11: Nov 2 Visual dialog dataset and models Das et al., CVPR 2017
Submit here
For: Harish, Chaitanya
Against: Vincent
Teams 1-3
1. Yen-Chang, Fu-jen, Zhen, Weiyu
2. Amit, Huda, Cusuh, Vincent
3. Manasa Kadiri, Chaitanya, Kevin Garanger
W12: Nov 7 Visual dialog with deep RL Das and Kottur et al., ICCV 2017
Submit here
For: Murali, Audrija
Against: Siddharth
Teams 4-6
4. Akanksha, James, Murali, Siddharth
5. Sainandan R, Rahul Solanki, Arindam D
6. Sajid, Meera, Abhishek
W12: Nov 9 Language emergence Kottur et al., EMNLP 2017
Submit here
For: Zhaohua, Zhen Liu
Against: James
Teams 7-9
7. Hang, Yun, Sarah
8. Nirbhay, Samyak, Karthik
9. Nilaksh, Madhuri, Varsha, Anmol
W13: Nov 14 Image captioning Kulkarni et al., CVPR 2011
Submit here
For: Meera, Akanksha
Against: Jyothi
Teams 10-12
10. Nauman Ahad, Zhaohua, Sai Sathiesh Rajan, Ting
11. Nishanth Rao, Shivam Agarwal, Aditi Mukhtyar, Wengling Chen
12. Xiaojing, Jyothi, Audrija
W13: Nov 16 Image captioning Farhadi et al., ECCV 2010
Submit here
For: Xiaojing
Against: Sumit
Teams 13-14
13. Rupak Vignesh, Harish Haresamudram, Varun Agrawal
14. Vineet, Sumit
W14: Nov 21 Leveraging language to learn visual classifiers Gupta and Davis, ECCV 2008
Submit here
For: Yun
Against: Nauman
N/A
W14: Nov 23 No class (Happy Thanksgiving!) N/A N/A
W15: Nov 28 Final project presentations N/A N/A
W15: Nov 30 Final project presentations N/A N/A
Dec 5, 11:59 pm Project video due

Reviews (30% of final grade)

[Back to top]
  • Due: 11:59 pm on the day before class. Late reviews will not be accepted.
  • The three reviews with lowest scores will be dropped at the end of class.
  • You need not submit paper reviews during the classes where you are leading discussions on the paper.
  • Where to submit reviews? Please submit the reviews at the link (to the Google Form) for each paper in the schedule. See submission instructions below.
  • Length <=1 page
  • 11 pt Times New Roman, 1 inch margins
  • Organization of Review:
  • Summary:
    • What is this paper about?
    • What is the main contribution?
    • Describe the main approach & results. Just facts, no opinions yet.
  • Strengths:
    • Is there a new theoretical insight?
    • Or a significant empirical advance? Did they solve a standing open problem?
    • Or a good formulation for a new problem?
    • Or a faster/better solution for an existing problem?
    • Any good practical outcome (code, algorithm, etc)?
    • Are the experiments well executed?
    • Useful for the community in general?
  • Weaknesses:
    • What can be done better?
    • Any missing baselines? Missing datasets?
    • Any odd design choices in the algorithm not explained well? Quality of writing?
    • Is there sufficient novelty in what they propose? Minor variation of previous work?
    • Why should anyone care? Is the problem interesting and significant?
  • Reflections:
    • How does this relate to other papers we have read?
    • What are the next research directions in this line of work?
    • What (directly or indirectly related) new ideas did this paper give you? What would you be curious to try?
  • Most interesting thought: Describe what you believe is your most insightful thought about the paper. It could be next research directions in this line of work, (directly or indirectly related) new ideas that this paper gave you, things that you would be curious to try, connections you've made to other work, etc.
  • Submission: Please type out your review in a text editor of your choice and ensure that your review is of appropriate length before submitting. Your grades for each paper review along with comments (if there are any) will be emailed to you. Please keep in mind that the reviews are due at 11:59pm the night before the class. Any reviews submitted after that will be returned with a grade of 0.
  • Tips on
  • Examples of good reviews (please note that the examples may not match the recommended format exactly): Example1, Example2, Example3.
  • Sample of high scoring review and "most interesting thought" from the class for Vinyals et al., CVPR 2015 by James Mullenbach: Sample Review.

Paper Discussion (10% of final grade)

[Back to top]

You will be assigned to lead discussion on the paper that you have read, about once (estimated) during the semester. You will be asked to argue either in favor of the paper or against the paper. In each case, come prepared with 5 points of discussion (in favor or against the paper).

NOTE: You need not submit a review for the paper you are leading a discussion on.

Presentations

[Back to top]

Slides should be made as visual (with videos, images, animations) and clear as possible. Students should practice their talks ahead of time to make sure they are of appropriate length -- not shorter by more than a few minutes, and certainly not longer (we will set a timer that will go off). The talks should be well organized and polished. Take a look at some example presentations: Example1, Example2.

Initial and update presentations (30% of final grade. See schedule)

Initial presentation: Each team will present for about 10 min. In the first presentation, teams will present a project proposal organzied as follows:

  • Problem statement: Clearly state the goal of your project. Specify the input and desired output.
  • Related work: Briefly describe existing related work (with citations) and what your project brings to the table that these other works do not. The most relevant papers may not necessarily be papers listed on the schedule, so be sure to also look beyond the list.
  • Approach: Describe the technical approach you plan to employ. Clearly state the assumptions of your approach.
  • Experiments and results: Describe the experimental setup you will follow, which datasets you will use, which existing code you will exploit, what you will implement yourself, and what you would define as a success for the project. If you plan on collecting your own data, describe what data collection protocol you will follow. Specify if you plan on experimentally analyzing different characteristics of your approach, or if you will compare to existing techniques. Provide a list of experiments you will perform. Describe what you expect the experiments to reveal, or what is uncertain about the potential outcomes. If you have any preliminary results, please summarize those as well.
  • Timeline: Present a timeline of the planned tasks/goals. Clearly state what you plan to complete by the next presentation. Break this down into two sets -- tasks that you are sure you will have completed and tasks that are a bit of a long shot but you would like to complete. Please also use a similar breakdown during the update presentations. You will be expected to try hard to stick to this timeline.
Update presentations: In the following two presentations, you will update the class on your progress. You will remind the class of your problem statement, and provide a quick recap of the approach. Remind us of your timeline from your earlier presentation, and then describe your current results, any challenges or issues that you faced, and an updated timeline. Presentations will be <=7 min. long and will be followed by 8 min. of discussion.

Final presentation (15% of final grade. In class, Nov. 28, 30)

Each team will explain their project in a 7 min. presentation with an organization similar to the project proposal presentation, except now describing the actual outcomes rather than plans. In addition, also describe any challenges you faced, any insights on future extensions of the project. 3 min. of QA will follow each presentation.

Project video (15% of final grade. Due Dec. 5, 11:59 pm)

Teams will prepare a 1 min. YouTube video summarizing the project. The video is a teaser to convey the main points, and gain the viewer's interest in wanting to know more. It should be understandable by anyone familiar wtih AI. Please email the YouTube link to the TAs and the instructor.

See: Example 1, Example 2 , Example 3 , Example 4 , Example 5 , Example 6 , Example 7 , Example 8 , Example 9 , Example 10 .

Final Project Spotlights (Fall 2017)

The final project spotlight videos for this course are available here. The list of projects are:

Projects

[Back to top]

Your project will typically involve addressing a novel problem or addressing an existing problem in a novel way. Your goal should be to advance a state-of-the-art technique, or introduce a new task in vision and language along with benchmarking basic approaches, and potentially proposing an interesting model for the new task. Refer to the schedule to find topics of interest. But feel free to be creative and come up with your own! If you need help with ideas for your project please talk to the TAs, any of the introductory speakers, or the instructor. While certainly not a requirement for the class, students should actively consider submitting a paper at the end of the course to a top-tier conference in Computer Vision, Natural Language Processing, Machine learning, or AI.

Projects typically fall under one of these categories:

  • Novel problem / task / application.
  • Application/survey - compare a bunch of algorithms on an application domain of interest. These make most sense if you are expecting some interesting trend / finding in the analysis.
  • Formulation/Development - formulate a new model or algorithm for a new/old problem.
  • Analysis - analyze an existing algorithm.

Project teams should have 3-4 students (depending on enrollment). No more than 15 teams in the class total.

You may combine this with another course project but must delineate the different parts.

Save