Themos Stafylakis: Deep Word Embeddings for Audiovisual Speech Recognition

Themos Stafylakis is a Marie Curie Research Fellow on audiovisual automatic speech recognition at the Computer Vision Laboratory of University of Nottingham (UK). He holds a PhD from Technical University of Athens (Greece) on Speaker Diarization for Broadcast News. He has a strong publication record on speaker recognition and diarization, as a result of his 5-year post-doc at CRIM (Montreal, Canada), under the supervision of Patrick Kenny. He is currently working on lip-reading and audiovisual speech recognition using deep learning methods. His talk takes place on November 22, 2017 at 13:00 in room A112.

Deep Word Embeddings for Audiovisual Speech Recognition

During the last few years, visual and audiovisual automatic speech recognition (ASR) are witnessing a renaissance, which can largely be attributed to the advent of deep learning methods. Deep architectures and learning algorithms initially proposed for audio-based ASR are combined with powerful computer vision models and are finding their way to lipreading and audiovisual ASR. In my talk, I will go through some of the most recent advances in audiovisual ASR, with emphasis on those based on deep learning. I will then present a deep architecture for visual and audiovisual ASR which attains state-of-the-art results in the challenging lipreading-in-the-wild database. Finally, I will focus on how this architecture can generalize to words unseen during training and discuss its applicability in continuous speech audiovisual ASR.

Tunç Aydın: Extracting transparent image layers for high-quality compositing

Tunç Aydın is a Research Scientist at Disney Research located at the Zürich Lab. My current research primarily focuses on image and video processing problems that address various movie production challenges, such as natural matting, green-screen keying, color grading, edge-aware filtering, and temporal coherence, among others. I have also been interested in analyzing visual content in terms of visual quality and aesthetic plausibility by utilizing knowledge of the human visual system. In my work I tend to utilize High Dynamic Range, Stereoscopic 3D, and High Frame-rate content, in addition to standard 8-bit images and videos.

Prior to joining Disney Research, I worked as a Research Associate at the Max-Planck-Institut für Informatik from 2006-2011, where I obtained my PhD degree under the supervision of Karol Myszkowski and Hans-Peter Seidel. I received the Eurographics PhD award in 2012 for my dissertation. I hold a Master’s degree in Computer Science from the College of Computing at Georgia Institute of Technology, and a Bachelor’s degree in Civil Engineering from Istanbul Teknik Universitesi. His talk takes place on Wednesday, November 1, 2017 at 13:00 in room A112.

Extracting transparent image layers for high-quality compositing

Compositing is an essential task in visual content production. For instance, a contemporary feature film production that doesn’t involve any compositing work is a rare occasion. However, achieving production-level quality often requires a significant amount of manual labor by digital compositing artists, mainly due to the limits of existing tools available for various compositing tasks. In this presentation I will talk about our recent work that aims on improving upon existing compositing technologies, where we focus on natural matting, green-screen keying, and color editing. We tackle natural matting using a novel affinity-based approach, whereas for green-screen keying and color editing we introduce a “color unmixing” framework, which we specialize individually for the two problem domains. Using these new techniques we achieve state-of-the-art results while also significantly reducing the manual interaction time.

 

Jakub Mareček: Urban Traffic Management – Traffic State Estimation, Signalling Games, and Traffic Control

Jakub Mareček is a research staff member at IBM Research. Together with some fabulous colleagues, Jakub develops solvers for optimisation and control problems at IBM’s Smarter Cities Technology Centre. Jakub joined IBM Research from the School of Mathematics at the University of Edinburgh in August 2012. Prior to his brief post-doc in Edinburgh, Jakub had presented an approach to general-purpose integer programming in his dissertation at the University of Nottingham and worked in two start-up companies in Brno, the Czech Republic. His talk takes place on Monday, October 16, 2017 at 13:30 in room D0207.

Urban Traffic Management: Traffic State Estimation, Signalling Games, and Traffic Control

In many engineering applications, one needs to identify a model of a non-linear system, increasingly using large volumes of heterogeneous, streamed data, and apply some form of (optimal) control. First, we illustrate why much of the classical identification and control is not applicable to problems involving time-varying populations of agents, such as in smart grids and intelligent transportations systems. Second, we use tools from robust statistics and convex optimisation to present alternative approaches to closed-loop system identification, and tools from iterated function systems to identify controllers for such systems with certain probabilistic guarantees on the performance for the individual interacting with the controller.

Marc Delcroix and Keisuke Kinoshita: NTT far-field speech processing research

Marc Delcroix is a senior research scientist at NTT Communication Science Laboratories, Kyoto, Japan. He received the M.Eng. degree from the Free University of Brussels, Brussels, Belgium, and the Ecole Centrale Paris, Paris, France, in 2003 and the Ph.D. degree from the Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan, in 2007. His research interests include robust multi-microphone speech recognition, acoustic model adaptation, integration of speech enhancement front-end and recognition back-end, speech enhancement and speech dereverberation. He took an active part in the development of NTT robust speech recognition systems for the REVERB and the CHiME 1 and 3 challenges, that all achieved best performances on the tasks. He was one of the organizers of the REVERB challenge, 2014 and of ARU 2017. He is a visiting lecturer at the Faculty of Science and Engineering of Waseda University, Tokyo, Japan.

Keisuke Kinoshita is a senior research scientist at NTT Communication Science Laboratories, Kyoto, Japan. He received the M.Eng. degree and the Ph.D degree from Sophia University in Tokyo, Japan in 2003 and 2010 respectively. He joined NTT in 2003 and since then has been working on speech and audio signal processing. His research interests include single- and multichannel speech enhancement and robust speech recognition. He was the Chief Coordinator of REVERB challenge 2014, an organizing committee member of ASRU-2017. He was honored to receive IEICE Paper Awards (2006), ASJ Technical Development Awards (2009), ASJ Awaya Young Researcher Award (2009), Japan Audio Society Award (2010), and Maeshima Hisoka Award (2017). He is a visiting lecturer at the Faculty of Science and Engineering of Doshisha University, Tokyo, Japan.

Their talk takes place on Monday, August 28, 2017 at 13:00 in room A112.

NTT far-field speech processing research

The success of voice search applications and voice controlled device such as the Amazon echo confirms that speech is becoming a common modality to access information. Despite great recent progress in the field, it is still challenging to achieve high automatic speech recognition (ASR) performance when using microphone distant from the speakers (Far-field), because of noise, reverberation and potential interfering speakers. It is even more challenging when the target speech consists of spontaneous conversations.

At NTT, we are pursuing research on far-field speech recognition focusing on speech enhancement front-end and robust ASR back-ends towards building next generation ASR systems able to understand natural conversations. Our research achievements have been combined into ASR systems we developed for the REVERB and CHiME 3 challenges, and for meeting recognition.

In this talk, after giving a brief overview of the research activity of our group, we will introduce in more detail two of our recent research achievements. First, we will present our work on speech dereverberation using weighted prediction error (WPE) algorithm. We have recently proposed an extension to WPE to integrate deep neural network based speech modeling into the WPE framework, and demonstrate further potential performance gains for reverberant speech recognition.

Next, we will discuss our recent work on acoustic model adaptation to create ASR back-ends robust to speaker and environment variations. We have recently proposed a context adaptive neural network architecture, which is a powerful way to exploit speaker or environment information to perform rapid acoustic model adaptation.

S. Umesh: Acoustic Modelling of low-resource Indian languages

S. Umesh is a professor in the Department of Electrical Engineering at Indian Institute of Technology – Madras. His research interests are mainly in automatic speech recognition particularly in low-resource modelling and speaker normalization & adaptation. He has also been a visiting researcher at AT&T Laboratories, Cambridge University and RWTH-Aachen under the Humboldt Fellowship. He is currently leading a consortium of 12 Indian institutions to develop speech based systems in agricultural domain. His talk takes place on Tuesday, June 27, 2017 at 13:00 in room A112.

Acoustic Modelling of low-resource Indian languages

In this talk, I will present recent efforts in India to build speech-based systems in agriculture domain to provide easy access to information to about 600 million farmers. This is being developed by a consortium of 12 Indian institutions initially in 12 languages, which will then be expanded to another 12 languages. Since the usage is in extremely noisy environments such as fields, the emphasis is on high accuracy by using directed queries which elicit short phrase-like responses. Within this framework, we explored cross-lingual and multilingual acoustic modelling techniques using subspace-GMMs and phone-CAT approaches. We also extended the use of phone-CAT for phone-mapping and articulatory features extraction which were then fed to a DNN based acoustic model. Further, we explored the joint estimation of acoustic model (DNN) and articulatory feature extractors. These approaches gave significant improvement in recognition performance, when compared to building systems using data from only one language. Finally, since the speech consisted of mostly short and noisy utterances, conventional adaptation and speaker-normalization approaches could not be easily used. We investigated the use of a neural network to map filter-bank features to fMLLR/VTLN features, so that the normalization can be done at frame-level without first-pass decode, or the necessity of long utterances to estimate the transforms. Alternately, we used a teacher-student framework where the teacher trained on normalized features is used to provide “soft targets” to the student network trained on un-normalized features. In both approaches, we obtained recognition performance that is better than ivector-based normalization schemes.

Kwang In Kim: Toward Intuitive Imagery: User Friendly Manipulation and Exploration of Images and Videos

Kwang In Kim is a senior lecturer of computer science at the University of Bath. He received a BSc in computer engineering from the Dongseo University in 1996, and MSc and PhD in computer engineering from the Kyungpook National University in 1998 and 2000, respectively. He was a post-doctoral researcher at KAIST, at the Max-Planck-Institute for Biological Cybernetics, at Saarland University, and at the Max-Planck-Institute for Informatics, from 2000 to 2013. Before joining Bath, he was a lecturer at the School of Computing and Communications, Lancaster University. His research interests include machine learning, vision, graphics, and human-computer interaction. His talk takes place in Wednesday, May 10th, 2017, at 3:30pm in room E105.

Toward Intuitive Imagery: User Friendly Manipulation and Exploration of Images and Videos

With the ubiquity of image and video capture devices, it is easy to form collections of images and video. Two important questions in this context are 1) how to retain the quality of individual images and videos and 2) how to explore the resulting large collections. Unlike professionally captured photographs and videos, the quality of the imageries that are casually captured by regular users are usually low. In this talk, we will discuss manipulating and improving such images and videos in several aspects. The central theme of the talk is user-friendliness. Unlike existing sophisticated algorithms, our approaches focus on enabling non-expert users freely manipulate and improve personal imagery collections. We present two specific examples in this context: image enhancement and video object removal. Existing interfaces to these video collections are often simply lists of text-ranked videos which do not exploit the visual content relationships between videos, or other implicit relationships such as spatial or geographical relationships. In the second part of the talk, we discuss data structures and interfaces that exploit content relationships present in images and videos.

Reinhold Häb-Umbach: Neural Network Supported Acoustic Beamforming

Reinhold Häb-Umbach is a professor of Communications Engineering at the University of Paderborn, Germany. His main research interests are in the fields of statistical signal processing and pattern recognition, with applications to speech enhancement, acoustic beamforming and source separation, as well as automatic speech recognition and unsupervised learning from speech and audio. He has more than 200 scientific publications, and recently co-authored the book Robust Automatic Speech Recognition – a Bridge to Practical Applications (Academic Press, 2015). He is a fellow of the International Speech Communication Association (ISCA). His talk takes place on Monday, April 24th, at 1pm in room D0207.

Neural Network Supported Acoustic Beamforming
for Speech Enhancement and Recognition

Abstract: With multiple microphones spatial information can be exploited to extract a target signal from a noisy environment. While the theory of statistically optimum beamforming is well established the challenge lies in the estimation of the beamforming coefficients from the noisy input signal. Traditionally these coefficients are derived from an estimate of the direction-of-arrival of the target signal, while more elaborate methods estimate the power spectral density matrices (PSD) of the desired and the interfering signals, thus avoiding the assumption of an anechoic signal propagation. We have proposed to estimate these PSD matrices using spectral masks determined by a neural network. This combination of data-driven approaches with statistically optimum multi-channel filtering has delivered competitive results on the recent CHiME challenge. In this talk, we detail this approach and show that the concept is more general and can be, for example, also used for dereverberation. When used as a front-end for a speech recognition system, we further show how the neural network for spectral mask estimation can be optimized w.r.t. a word error rate related criterion in and end-to-end setup.

Jiří Matas: Tracking with Discriminative Correlation Filters

Jiří MatasJiří Matas is a full professor at the Center for Machine Perception, Czech Technical University in Prague. He holds a PhD degree from the University of Surrey, UK (1995). He has published more than 200 papers in refereed journals and conferences. Google Scholar reports about 22 000 citations to his work and an h-index 53.
He received the best paper prize at the International Conference on Document Analysis and Recognition in 2015, the Scandinavian Conference on Image Analysis 2013, Image and Vision Computing New Zealand Conference 2013, the Asian Conference on Computer Vision 2007, and at British Machine Vision Conferences in 2002 and 2005. His students received a number of awards, e.g. Best Student paper at ICDAR 2013, Google Fellowship 2013, and various “Best Thesis” prizes.
J. Matas is on the editorial board of IJCV and was the Associate Editor-in-Chief of IEEE T. PAMI. He is a member of the ERC Computer Science and Informatics panel. He has served in various roles at major international conferences, e.g. ICCV, CVPR, ICPR, NIPS, ECCV, co-chairing ECCV 2004 and CVPR 2007. He is a program co-chair for ECCV 2016.
His research interests include object recognition, text localization and recognition, image retrieval, tracking, sequential pattern recognition, invariant feature detection, and Hough Transform and RANSAC-type optimization. His talk takes place on Thursday, March 2nd, at 1pm in room E105.

Tracking with Discriminative Correlation Filters

Visual tracking is a core video processing problem with many applications, e.g. in surveillance, autonomous driving, sport analysis, augmented reality, film post-production and medical imaging.

In the talk, tracking methods based on Discriminative Correlation Filters (DCF) will be presented. DCF-based trackers are currently the top performers on most commonly used tracking benchmarks. Starting from the oldest and simplest versions of DCF trackers like MOSSE, we will progress to kernel-based and multi-channel variants including those exploiting CNN features. Finally, the Discriminative Correlation Filter with Channel and Spatial Reliability will be introduced.

Time permitting, I will briefly introduce a problem that has been so far largely ignored by the computer vision community – tracking of blurred, fast moving objects.

Video recording of the talk is publicly available.

Piotr Didyk: Perception and Personalization in Digital Content Reproduction

didykPiotr Didyk is an Independent Research Group Leader at the Cluster of Excellence on ”Multimodal Computing and Interaction” at the Saarland University (Germany), where he is heading a group on Perception, Display, and Fabrication. He is also appointed as a Senior Researcher at the Max Planck Institute for Informatics. Prior to this, he spent two years as a postdoctoral associate at Massachusetts Institute of Technology. In 2012, he obtained his PhD from the Max Planck Institute for Informatics and the Saarland University for his work on perceptual display. During his studies, he was also a visiting student at MIT. In 2008, he received his M.Sc. degree in Computer Science from the University of Wrocław (Poland). His research interests include human perception, new display technologies, image/video processing, and computational fabrication. His main focus are techniques that account for properties of the human sensory system and human interaction to improve perceived quality of the final images, videos, and 3D prints. His talk takes place on Wednesday, February 15th, 1pm in room A113.

Perception and Personalization in Digital Content Reproduction

There has been a tremendous increase in quality and number of new output devices, such as stereo and automultiscopic screens, portable and wearable displays, and 3D printers. Unfortunately, abilities of these emerging technologies outperform capabilities of methods and tools for creating content. Also, the current level of understanding of how these new technologies influence user experience is insufficient to fully exploit their advantages. In this talk, I will present our recent efforts in the context of perception-driven techniques for digital content reproduction. I will demonstrate that careful combinations of new hardware, computation, and models of human perception can lead to solutions that provide a significant increase in perceived quality. More precisely, I will discuss two techniques for overcoming limitations of 3D displays. They exploit information about gaze direction as well as the motion-parallax cue. I will also demonstrate a new design of automultiscopic screen for cinema and a prototype of a near-eye augmented reality display that supports focus cues. Next, I will show how careful rendering of frames enables continuous framerate manipulations giving artists a new tool for video manipulation. The technique can, for example, reduce temporal artifacts without sacrificing the cinematic look of a movie content. In the context of digital fabrication, I will present a perceptual model for compliance with its applications to 3D printing.

Manuel M. Oliveira: Efficient Deconvolution Techniques for Computational Photography

Manuel M. Oliveira is an Associate Professor of Computer Science at the Federal University of Rio Grande do Sul (UFRGS), in Brazil. He received his PhD from the University of North Carolina at Chapel Hill, in 2000. Before joining UFRGS in 2002, he was an Assistant Professor of Computer Science at the State University of New York at Stony Brook (2000 to 2002). In the 2009-2010 academic year, he was a Visiting Associate Professor at the MIT Media Lab. His research interests cover most aspects of computer graphics, but especially the frontiers among graphics, image processing, and vision (both human and machine). In these areas, he has contributed a variety of techniques including relief texture mapping, real-time filtering in high-dimensional spaces, efficient algorithms for Hough transform, new physiologically-based models for color perception and pupil-light reflex, and novel interactive techniques for measuring visual acuity. Dr. Oliveira was program co-chair of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games 2010 (I3D 2010), and general co-chair of ACM I3D 2009. He is an Associate Editor of IEEE TVCG and IEEE CG&A, and a member of the CIE Technical Committee TC1-89 “Enhancement of Images for Colour Defective Observers”. He received the ACM Recognition of Service Award in 2009 and in 2010. His talk will take place on Tuesday, January 31st, 1 pm in room E105.

Efficient Deconvolution Techniques for Computational Photography

Abstract: Deconvolution is a fundamental tool for many imaging applications ranging from microscopy to astronomy. In this talk, I will present efficient deconvolution techniques tailored for two important computational photography applications: estimating color and depth from a single photograph, and motion deblurring from camera shake. For the first, I will describe a coded-aperture method based on a family of masks obtained as the convolution of one “hole” with a structural component consisting of an arrangement of Dirac delta functions. We call this arrangement of delta functions the structural component of the mask, and use it to efficiently encode scene distance information. I will then show how one can design well-conditioned masks for which deconvolution can be efficiently performed by inverse filtering.  I will demonstrate the effectiveness of this approach by constructing a mask for distance coding and using it to recover color and depth information from single photographs. This lends to significant speedup, extended range, and higher depth resolution compared to previous approaches. For the second application, I will present an efficient technique for high-quality non-blind deconvolution based on the use of sparse adaptive priors. Despite its ill-posed nature, I will show how to model the non-blind deconvolution problem as a linear system, which is solved in the frequency domain. This clean formulation lends to a simple and efficient implementation, which is faster and whose results tend to have higher peak signal-to-noise ratio than previous methods.

Video recording of the talk is publicly available.

Tomáš Mikolov: Neural Networks for Natural Language Processing

mikolovTomáš Mikolov is a research scientist at Facebook AI Research since 2014. Previously he has been a member of Google Brain team, where he developed efficient algorithms for computing distributed representations of words (word2vec project). He has obtained PhD from Brno University of Technology for work on recurrent neural network based language models (RNNLM). His long term research goal is to develop intelligent machines capable of communicating with people using natural language. His talk will take place on Tuesday, January 3rd, 2017, 5pm in room E112.

Neural Networks for Natural Language Processing

Abstract: Neural networks are currently very successful in various machine learning tasks that involve natural language. In this talk, I will describe how recurrent neural network language models have been developed, as well as their most frequent applications to speech recognition and machine translation. Next, I will talk about distributed word representations, their interesting properties, and efficient ways how to compute them. Finally, I will describe our latest efforts to create novel dataset that would allow researchers to develop new types of applications that include communication with human users in natural language.

Gernot Ziegler: Data Parallelism in Computer Vision

gernot_newGernot Ziegler (Dr.Ing.) is an Austrian engineer with an MSc degree in Computer Science and Engineering from Linköping University, Sweden, and a PhD from the University of Saarbrücken, Germany. He pursued his PhD studies at the Max-Planck-Institute for Informatics in Saarbrücken, Germany, specializing in GPU algorithms for computer vision and data-parallel algorithms for spatial data structures. He then joined NVIDIA’s DevTech team, where he consulted in high performance computing and automotive computer vision on graphics hardware. In 2016, Gernot has founded his own consulting company to explore the applications of his computer vision expertise on graphics hardware in mobile consumer, industrial vision and heritage digitalization. His talk will take place on Wednesday, December 14th, 2016, 1pm in room E105.

Data Parallelism in Computer Vision

Abstract: In algorithmic design, serial data dependencies which accelerate CPU processing for computer vision are often counterproductive for the data-parallel GPU. The talk presents data structures and algorithms that enable data parallelism for connected components, line detection, feature detection, marching cubes or octree generation. We will point out the important aspects of data parallel design that will allow you to design new algorithms for GPGPU-based computer vision and image processing yourself. As food for thought, I will sketch algorithmic ideas that could lead to new collaborative results in real-time computer vision.ziegler-talk

Video recording of the talk is publicly available.

Stefan Jeschke: Recent Advances in Vector Graphics Creation and Display

Stefan Jeschke is a scientist at IST Austria. He received an M.Sc. in 2001 and a Ph.D. in 2005, both in computer science from the University of Rostock, Germany. Afterwards, he spend several years as a post doc researcher in several projects at Vienna University of Technology and Arizona State University. His research interest includes modeling and display of vectorized image representations, applications and solvers for PDEs, as well as modeling and rendering complex natural phenomena, preferably in real time. His talk will take place on Tuesday, November 8th, 2016, 1pm in room G202.

Recent Advances in Vector Graphics Creation and Display

This talk gives an overview of my recent work on vector graphics representations as semantically meaningful image descriptions, in contrast to pixel-based raster images. I will cover the problem of how to efficiently create vector graphics either from scratch or from given raster images. The goal was to support designers to produce complex, high-quality representations with only limited manual input. Furthermore, I will talk about various new developments that are mainly based on the so-called “diffusion curves”. Here the goal is to improve the expressiveness of such representations, for example, by adding textures so that natural images appear more realistic without adding excessive amounts of geometry beyond what can be handled by a designer. Rendering such representations at interactive frame rates on modern GPUs is another aspect I will cover in this talk.

Video recording of the talk is publicly available.

Tomáš Pajdla: 3D Reconstruction from Photographs and Algebraic Geometry

pajdlaTomáš Pajdla is a Distinguished Researcher at the CIIRC – Czech Institute of Informatics, Robotics and Cybernetics (ciirc.cvut.cz) and an Assistant Professor at the Faculty of Electrical Engineering (fel.cvut.cz) of the Czech Technical University in Prague. He works in geometry, algebra and optimization of computer vision and robotics, 3D reconstruction from images, and visual object recognition. He is known for his contributions to geometry of cameras, image matching, 3D reconstruction, visual localization, camera and hand-eye calibration, and algebraic methods in computer vision (Google Scholar citations). He coauthored works awarded the best paper prizes at OAGM 1998 and 2013, BMVC 2002 and ACCV 2014. His talk will take place on Wednesday, November 2nd, 2016, 1pm in room E105.

3D Reconstruction from Photographs and Algebraic Geometry

Abstract: pajdla_workWe will show a connection between the state of the art 3D reconstruction from photographs and algebraic geometry. In particular, we will show how some modern tools from computational algebraic geometry can be used to solve some classical as well as recent problems in computing camera calibration and orientation in space. We will present applications in large scale reconstruction from photographs, robotics and camera calibration.

Video recording of the talk is publicly available.

Ralf Schlüter: On the Relation between Error Measures, Statistical Modeling, and Decision Rules

RalfSchlueter_200kbRalf Schlüter studied physics at RWTH Aachen University, Germany, and Edinburgh University, Scotland. He received the Diplom degree with honors in physics in 1995 and the Dr.rer.nat. degree with honors in computer science in 2000, from RWTH Aachen University. From November 1995 to April 1996 Ralf Schlüter was with the Institute for Theoretical Physics B at RWTH Aachen, where he worked on statistical physics and stochastic simulation techniques. Since May 1996 Ralf Schlüter is with the Faculty of Mathematics, Computer Science and Natural Sciences of RWTH Aachen University, where he currently is Academic Director. He leads the automatic speech recognition group at the Human Language Technology and Pattern Recognition lab. His research interests cover speech recognition in general, discriminative training, neural networks, information theory, stochastic modeling, signal analysis, and theoretic aspects of pattern classification. His talk will take place on Tuesday, August 23rd, 2016, 10am in room A112.

On the Relation between Error Measures, Statistical Modeling, and Decision Rules

Abstract: The aim of automatic speech recognition (ASR), or more generally, pattern classification, is to minimize the expected error rate.  This requires a consistent interaction of the error measure with statistical modeling and the corresponding decision rule. Nevertheless, the error measure often is not considered consistently in ASR:

  • error measures usually are not easily tractable due to their discrete nature,
  • the quantitative relation between modeling and error measure at least analytically is unclear and usually is only exploited empirically,
  • the standard decision rule does not consider word error loss.

In this presentation, bounds on the classification error will be presented that can directly be related to acoustic and language modeling. A first analytic relation between language model perplexity and sentence error is established, and the quantitative effect of context reduction and feature omission on the error rate are derived. The corresponding error bounds were discovered and finally analytically proven within a simulation-induced framework, which will be outlined. Also, first attempts on how to design a training criterion to support the use of the standard decision rule while retaining the target of minimum word error rate are discussed. Finally, conditions will be presented under which the standard decision rule does in fact implicitly optimize word/token error rate in spite of its sentence/segment-based target.

Elmar Eisemann: Everything Counts – Rendering Highly-detailed Environments in Real-time

ElmarEisemannBWElmar Eisemann is a professor at TU Delft, heading the Computer Graphics and Visualization Group. Before he was an associated professor at Telecom ParisTech (until 2012) and a senior scientist heading a research group in the Cluster of Excellence (Saarland University / MPI Informatik) (until 2009). He studied at the École Normale Supérieure in Paris (2001-2005) and received his PhD from the University of Grenoble at INRIA Rhône-Alpes (2005-2008). He spent several research visits abroad; at the Massachusetts Institute of Technology (2003), University of Illinois Urbana-Champaign (2006), Adobe Systems Inc. (2007,2008). His interests include real-time and perceptual rendering, alternative representations, shadow algorithms, global illumination, and GPU acceleration techniques. He coauthored the book “Real-time shadows” and participated in various committees and editorial boards. He was local organizer of EGSR 2010, 2012, HPG 2012, and is paper chair of HPG 2015. His work received several distinction awards and he was honored with the Eurographics Young Researcher Award 2011. His talk will take place on Friday, May 20th, 2016, 2pm in room E105.

Everything Counts – Rendering Highly-detailed Environments in Real-time

A traditional challenge in computer graphics is the simulation of natural scenes, including complex geometric models and a realistic reproduction of physical phenomena, requiring novel theoretical insights, appropriate algorithms, and well-designed data structures. In particular, there is a need for efficient image-synthesis solutions, which is fueled by the development of modern display devices, which support 3D stereo, have high resolution and refresh rates, and deep color palettes.

In this talk, we will present methods for efficient image synthesis to address recent rendering challenges. In particular, we will focus on large-scale data sets and present novel techniques to encode highly detailed geometric information in a compact representation. Further, we will give an outlook on rendering techniques for modern display devices, as these often require very differing solutions. In particular, human perception starts to paly an increasing role and has high potential to be a key factor in future rendering solutions.

Video recording of the talk is publicly available.

Josef Sivic: Learning visual representations from Internet data

sivicJosef Sivic holds a permanent position as an INRIA senior researcher (directeur de recherche) in the Department of Computer Science at the École Normale Supérieure (ENS) in Paris. He received a degree from the Czech Technical University, Prague, in 2002 and PhD from the University of Oxford in 2006. His research interests are in developing learnable image representations for automatic visual search and recognition applied to large image and video collections. Before joining INRIA Dr. Sivic spent six months at the Computer Science and Artificial Intelligence Lab at the Massachusetts Institute of Technology. He has published more than 60 scientific publications, has served as an area chair for major computer vision conferences (CVPR’11, ICCV’11, ECCV’12, CVPR’13 and ICCV’13) and as a program chair for ICCV’15. He currently serves as an associate editor for the International Journal of Computer Vision and is a Senior Fellow in the Learning in Machines & Brains program of the Canadian Institute for Advanced Research. He was awarded an ERC grant in 2013. His talk will take place on Friday, April 22nd, 2016, 10:30am in room E105.

Learning visual representations from Internet data

Abstract:
Unprecedented amount of visual data is now available on the Internet. Wouldn’t it be great if a machine could automatically learn from this data? For example, imagine a machine that can learn how to change a flat tire of a car by watching instruction videos on Youtube, or that can learn how to navigate in a city by observing street-view imagery. Learning from Internet data is, however, a very challenging problem as the data is equipped only with weak supervisory signals such as human narration of the instruction video or noisy geotags for street-level imagery. In this talk, I will describe our recent progress on learning visual representations from such weakly annotated visual data.

In the first part of the talk, I will describe a new convolutional neural network architecture that is trainable in an end-to-end manner for the visual place recognition task. I will show that the network can be trained from weakly annotated Google Street View Time Machine imagery and significantly improves over current state-of-the-art in visual place recognition.

In the second part of the talk, I will describe a technique for automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The method solves two clustering problems, one in text and one in video, linked by joint constraints to obtain a single coherent sequence of steps in both modalities. I will show results on a newly collected dataset of instruction videos from Youtube that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings.

Joint work with J.-B. Alayrac, P. Bojanowski, N. Agrawal, S. Lacoste-Julien, I. Laptev, R. Arandjelovic, P. Gronat, A. Torii and T. Pajdla.

Tomáš Werner: Linear Programming Relaxation Approach to Discrete Energy Minimization

werner-faceTomáš Werner works as a researcher at the Center for Machine Perception, Faculty of Electrical Engineering, Czech Technical University, where he also obtained his PhD degree. In 2001-2002 he worked as a post-doc at the Visual Geometry Group, Oxford University, U.K. In the past, his main interest was multiple view geometry and three-dimensional reconstruction in computer vision. Today, his interest is in machine learning and optimization, in particular graphical models. He is a (co-)author of more than 70 publications, with 350 citations in WoS. His talk will take place on Wednesday, February 24, 2016, 1pm in room G202. THE TALK IS POSTPONED, it will take place on Tuesday, April 12, 2016, 2pm in room A113.

Linear Programming Relaxation Approach to Discrete Energy Minimization

Abstract: Discrete energy minimization consists in minimizing a function of many discrete variables that is a sum of functions, each depending on a small subset of the variables. This is also known as MAP inference in graphical  models (Markov random fields) or weighted constraint satisfaction. Many successful approaches to this useful but NP-complete problem are based on  its natural LP relaxation. I will discuss this LP relaxation in detail,  along with algorithms able to solve it for very large instances, which appear e.g. in computer vision. In particular, I will discuss in detail a convex message passing algorihtm, generalized min-sum diffusion.

Christian Theobalt: Reconstructing the Real World in Motion

Christian Theobalt is a Professor of Computer Science and the head of the research group “Graphics, Vision, & Video” at the Max-Planck-Institute for Informatics, Saarbruecken, Germany. He is also an adjunct faculty at Saarland University. From 2007 until 2009 he was a Visiting Assistant Professor in the Department of Computer Science at Stanford University. Most of his research deals with algorithmic problems that lie on the boundary between the fields of Computer Vision and Computer Graphics, such as dynamic 3D scene reconstruction and marker-less motion capture, computer animation, appearance and reflectance modelling, machine learning for graphics and vision, new sensors for 3D acquisition, advanced video processing, as well as image- and physically-based rendering.

For his work, he received several awards, including the Otto Hahn Medal of the Max-Planck Society in 2007, the EUROGRAPHICS Young Researcher Award in 2009, and the German Pattern Recognition Award 2012. Further, in 2013 he was awarded an ERC Starting Grant by the European Union. In 2015, the German business magazine Capital elected him as one of the top 40 innovation leaders under 40. Christian Theobalt is a Principal Investigator and a member of the Steering Committee of the Intel Visual Computing Institute in Saarbruecken. He is also a co-founder of a spin-off company from his group – www.thecaptury.com – that is commercializing a new generation of marker-less motion and performance capture solutions.

Reconstructing the Real World in Motion

Even though many challenges remain unsolved, in recent years computer graphics algorithms to render photo-realistic imagery have seen tremendous progress. An important prerequisite for high-quality renderings is the availability of good models of the scenes to be rendered, namely models of shape, motion and appearance. Unfortunately, the technology to create such models has not kept pace with the technology to render the imagery. In fact, we observe a content creation bottleneck, as it often takes man months of tedious manual work by animation artists to craft models of moving virtual scenes.

To overcome this limitation, the graphics and vision communities has been developing techniques to capture dense 4D (3D+time) models of dynamic scenes from real world examples, for instance from footage of real world scenes recorded with cameras or other sensors. One example are performance capture methods that measure detailed dynamic surface models, for example of actors or an actor’s face, from multi-view video and without markers in the scene. Even though such 4D capture methods made big strides ahead, they are still at an early stage. Their application is limited to scenes of moderate complexity in controlled environments, reconstructed detail is limited, and captured content cannot be easily modified, to name only a few restrictions. Recently, the need for efficient dynamic scene reconstruction methods has further increased by developments in other thriving research domains, such as virtual and augmented reality, 3D video, or robotics.

In this talk, I will elaborate on some ideas on how to go beyond the current limits of 4D reconstruction, and show some results from our recent work. For instance, I will show how we can take steps to capture dynamic models of humans and general scenes in unconstrained environments with few sensors. I will also show how we can capture higher shape detail as well as material parameters of scenes outside of the lab. The talk will also show how one can effectively reconstruct very challenging scenes of a smaller scale, such a hand motion. Further on, I will discuss how we can capitalize on more sophisticated light transport models to enable high-quality reconstruction in much more uncontrolled scenes, eventually also outdoors, with only few cameras, or even just a single one. Ideas on how to perform deformable scene reconstruction in real-time will also be presented, if time allows.

His talk takes place on Wednesday, March 23, 2016, 1pm in room G202.

Video recording of the talk is publicly available.

Christoph H. Lampert: Classifier Adaptation at Prediction Time

Christoph LampertChristoph Lampert received the PhD degree in mathematics from the University of Bonn in 2003. In 2010 he joined the Institute of Science and Technology Austria (IST Austria) first as an Assistant Professor and since 2015 as a Professor. His research on computer vision and machine learning won several international and national awards, including the best paper prize of CVPR 2008. In 2012 he was awarded an ERC Starting Grant by the European Research Council. He is an Editor of the International Journal of Computer Vision (IJCV), Action Editor of the Journal for Machine Learning Research (JMLR), and Associate Editor in Chief of the IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI). His talk takes place on Tuesday, January 12, 2016, 1pm in room E104.

Classifier Adaptation at Prediction Time

Abstract: In the era of “big data” and a large commercial interest in computer vision, it is only a matter of time until we will buy commercial object recognition systems in pre-trained form instead of training them ourselves. This, however, poses a problem of domain adaptation: the data distribution in which a customer plans to use the system will almost certainly differ from the data distribution that the vendor used during training. Two relevant effects are a change of the class ratios and the fact that the image sequences that needs to be classified in real applications are typically not i.i.d. In my talk I will introduce simple probabilistic technique that can adapt the object recognition system to the test time distribution without having to change the underlying pre-trained classifiers. I will also introduce a framework for creating realistically distributed image sequences that offer a way to benchmark such adaptive recognition systems. Our results show that the above “problem” of domain adaptation can actually be a blessing in disguise: with proper adaptation the error rates on realistic image sequences are typically lower than on standard i.i.d. test sets.