Josef Sivic: Learning visual representations from Internet data

sivicJosef Sivic holds a permanent position as an INRIA senior researcher (directeur de recherche) in the Department of Computer Science at the École Normale Supérieure (ENS) in Paris. He received a degree from the Czech Technical University, Prague, in 2002 and PhD from the University of Oxford in 2006. His research interests are in developing learnable image representations for automatic visual search and recognition applied to large image and video collections. Before joining INRIA Dr. Sivic spent six months at the Computer Science and Artificial Intelligence Lab at the Massachusetts Institute of Technology. He has published more than 60 scientific publications, has served as an area chair for major computer vision conferences (CVPR’11, ICCV’11, ECCV’12, CVPR’13 and ICCV’13) and as a program chair for ICCV’15. He currently serves as an associate editor for the International Journal of Computer Vision and is a Senior Fellow in the Learning in Machines & Brains program of the Canadian Institute for Advanced Research. He was awarded an ERC grant in 2013. His talk will take place on Friday, April 22nd, 2016, 10:30am in room E105.

Learning visual representations from Internet data

Unprecedented amount of visual data is now available on the Internet. Wouldn’t it be great if a machine could automatically learn from this data? For example, imagine a machine that can learn how to change a flat tire of a car by watching instruction videos on Youtube, or that can learn how to navigate in a city by observing street-view imagery. Learning from Internet data is, however, a very challenging problem as the data is equipped only with weak supervisory signals such as human narration of the instruction video or noisy geotags for street-level imagery. In this talk, I will describe our recent progress on learning visual representations from such weakly annotated visual data.

In the first part of the talk, I will describe a new convolutional neural network architecture that is trainable in an end-to-end manner for the visual place recognition task. I will show that the network can be trained from weakly annotated Google Street View Time Machine imagery and significantly improves over current state-of-the-art in visual place recognition.

In the second part of the talk, I will describe a technique for automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The method solves two clustering problems, one in text and one in video, linked by joint constraints to obtain a single coherent sequence of steps in both modalities. I will show results on a newly collected dataset of instruction videos from Youtube that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings.

Joint work with J.-B. Alayrac, P. Bojanowski, N. Agrawal, S. Lacoste-Julien, I. Laptev, R. Arandjelovic, P. Gronat, A. Torii and T. Pajdla.