Tag Archives: Past talks

Talks already given.

Barbara Schuppler: Automatic speech recognition for conversational speech, or: What we can learn from human talk in interaction

barbaraBarbara Schuppler (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria) pursued her PhD research at Radboud Universiteit Nijmegen (The Netherlands) and at NTNU Trondheim (Norway) within the Marie Curie Research Training Network “Sound to Sense”. The central topic of ther thesis was the analysis of conditions for variation in large conversational speech corpora using ASR technology. Currently, she is working on a FWF-funded Elise-Richter Grant entitled ”Cross-layer prosodic models for conversational speech,” and in October 2019 starts her follow up project “Cross-layer language models for conversational speech.” Her research continues to be interdisciplinary; it includes the development of automatic tools for the study of prosodic variation, the study of reduction and phonetic detail in conversational speech and the integration of linguistic knowledge into ASR technology.

Automatic speech recognition for conversational speech, or: What we can learn from human talk in interaction

In the last decade, conversational speech has received a lot of attention among speech scientists. On the one hand, accurate automatic speech recognition (ASR) systems are essential for conversational dialogue systems, as these become more interactional and social rather than solely transactional. On the other hand, linguists study natural conversations, as they reveal additional insights to controlled experiments with respect to how speech processing works. Investigating conversational speech, however, does not only require applying existing methods to new data, but developing new categories, new modeling techniques and including new knowledge sources. Whereas traditional models are trained on either text or acoustic information, I propose language models that incorporate information on the phonetic variation of the words (i.e., pronunciation variation and prosody) and relate this information to the semantic context of the conversation and to the communicative functions in the conversation. This approach to language modeling is in line with the theoretical model proposed by Hawkins and Smith (2001), where the perceptual system accesses meaning from speech by using the most salient sensory information from any combination of levels/layers of formal linguistic analysis. The overal aim of my research is to create cross-layer models for conversational speech. In this talk, I will illustrate general challenges for ASR with conversational speech, I will present results from my recent and ongoing projects on pronunciation and prosody modeling, and I will discuss directions for future research.

Her talk takes place on Thursday, October 31, 2019 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).

Pratibha Moogi: India Centric R&D efforts in artificial intelligence

obo-2011Pratibha Moogi holds PhD from OGI, School of Engineering, OHSU, Portland and Masters from IIT Kanpur. She has served SRI International lab and many R&D groups including Texas Instruments, Nokia, and Samsung. Currently she is serving as a Director in Data Science Group (DSG), in a leading B2B customer operation & journey analytics company, [24]7.ai. She is also actively involved in mentoring India-wide training initiatives, start-up setups working in the domain of ML and AI for strengthening local India eco-system. She has got 16+ years of industry experience working on diverse set of Multimedia processing and ML based technologies namely Speech & Audio Recognition, Fingerprint, IRIS Biometric, Computer Vision based solutions & use-case scenarios development. Her current interests are emerging fields of applying Machine Learning to interdisciplinary, cross domain areas e.g. Multichannel Data Sources based Predictive Analytics.

India Centric R&D efforts in artificial intelligence

India, a country of ~1.3 billion people, ~300 million smart phone users, ~600 million internet users are getting on to use and feel AI, ML flavored solutions every single day, more than ever – be it Intelligent Camera which can take picture when you give that perfect smile, beauty your face to an extent you can look beautiful , tagging your pictures based on what content, subject you tried capturing in some of yours perfect shots, hiding your gallery photos from intruders using your Fingerprint, IRIS, or Face biometric, fetching that very product details that you just spotted in a Mall, or something that your friend is having right now – empowered by Content(Image) based Information (Product) Search Algorithms. Speech & Language Technologies are redefining voice interface for Indian common users – who speaks 28+ odd local languages. Voice analytics solutions are empowering BPO (Customer Care) centers –whether it is routing of millions of calls using automatically detected Customer intents, or segregating calls using positive or negative customer Sentiments, or automatically generating Business Insights that can drive more profits, revenues, higher customer satisfaction scores – All powered by Predictive Analytics solutions. This talk covers some of the India Centric R&D efforts – experienced by me while working on variety of products, services, and solutions over last decade. Talk is organized as the following topics 1. AI/ML – Digital India – Context (Problems & Opportunities, GDP landscape, Start-ups scenario), 2. Products & Solutions – recent deployments, 3. Present R&D spectrum – Algorithmic research efforts 4. Overall learnings from Indian Market.

Her talk takes place on Friday, September 13, 2019 at 13:00 in room A112.

Itshak Lapidot: Speaker Diarization and a bit more

Itshak Lapidot emigrated from the USSR to Israel in 1971. He received his B.Sc., M.Sc., and Ph.D. degrees in Electrical and Computer Engineering Department from Ben-Gurion University, Beer-Sheva, Israel in 1991, 1994 and 2001, respectively. During one year (2002-2003) he held a postdoctoral position at IDIAP Switzerland. Dr. Lapidot was previously a lecturer at the Electrical and Electronics Engineering Department at Sami Shamoon College of Engineering (SCE), in Beer-Sheva, Israel and served as a Researcher at the Laboratoire Informatique d’Avignon (LIA), University of Avignon in France during one year (2011-2012). Recently, Dr. Lapidot assumed a teaching position with the Electrical Engineering Department at the Afeka Academic College of Engineering and joined the ACLP research team. Dr. Lapidot’s primary research interests are speaker diarization, speaker clustering and speaker verification. He is also interesting in clustering and time series analysis from theoretical point of view.

Speaker Diarization and a bit more

In the talk will be presented three approaches applied to speaker and speech technologies, but can be applied to other machine learning (ML) technologies:
1. Speaker diarization – it is answering the question “Who spoke when?” when there is no knowledge about the speakers and the environments, no prior knowledge can be used and the problem is of unsupervised type. When no prior information can be use, even to train GMM, Total Variability matrix or PLDA, a different approach must take place, which use only the data of the given conversation. One of the possible solutions is using Viterbi based segmentation of hidden-Markov-models (HMMs). It assumes a high correlation between the log-likelihood and the diarization error rate (DER). This assumption leads to different problems. One possible solution will be sown, not only probabilistic to system but to a much broader family of solution named hidden-distortion-models (HDMs).
2. In different applications like homeland security, clustering of large amount of short segments is very important. The number of segments can be from hundreds to tens of thousands and the number of speakers from 2 up to tens of speakers (about 60 speakers). Several variants of the mean-shift clustering algorithm will be presented to solve the problem. An automatic way to estimate the clustering validity will be presented as well. It is very important, as clustering can be viewed as the preprocessing before other tasks, e.g., speaker verification. Using bad clustering will lead to poor verification results. As manual qualification of the clustering is not visible, an automatic tool is almost “must” tool.
3. Data-homogeneity measure for voice comparison – given two speech utterance for speaker verification, it is important that the utterances are valid for reliable comparison. Maybe the utterances are too short, or do not share enough common information for comparison. In this case high or low likelihood ratio is meaningless. The test of the data quality should be verification system independent. Such entropy based measure will be presented and the relations with verification performance will be shown.
4. Database assessment – when the data divided into train, development and evaluation datasets it sequential data as speech it is very difficult to know whether the sets are statistically meaningful for learning (even a fair coin can fall 100 times on tail). It is important to verify the statistical validity of the datasets prior to the training, development and evaluation process and it should be verified independent from the verification system/approach. Such data assessment will be presented, based on an entropy of the speech waveform.

His talk takes place on Tuesday, January 15, 2019 at 13:00 in room A113.

Misha Pavel: Digital Phenotyping Using Computational Models of Neuropsychological Processes Underlying Behavioral States and their Dynamics

Misha Pavel holds a joint faculty appointment in the College of Computer & Information Science and Bouvé College of Health Sciences. His background comprises electrical engineering, computer science and experimental psychology, and his research is focused on multiscale computational modeling of behaviors and their control, with applications ranging from elder care to augmentation of human performance. Professor Pavel is using these model-based approaches to develop algorithms transforming unobtrusive monitoring from smart homes and mobile devices to useful and actionable knowledge for diagnosis and intervention. Under the auspices of the Northeastern-based Consortium on Technology for Proactive Care, Professor Pavel and his colleagues are targeting technological innovations to support the development of economically feasible, proactive, distributed, and individual-centered healthcare. In addition, Professor Pavel is investigating approaches to inferring and augmenting human intelligence using computer games, EEG and transcranial electrical stimulation. Previously, Professor Pavel was the director of the Smart and Connected Health Program at the National Science Foundation, a program co-sponsored by the National Institutes of Health. Earlier, he served as the chair of the Department of Biomedical Engineering at Oregon Health & Science University, a Technology Leader at AT&T Laboratories, a member of the technical staff at Bell Laboratories, and faculty member at Stanford University and New York University. He is a Senior Life Member of IEEE.

Digital Phenotyping Using Computational Models of Neuropsychological Processes Underlying Behavioral States and their Dynamics

Human behaviors are both key determinants of health and effective indicators of individuals’ health and mental states. Recent advances in sensing, communication technology and computational modeling are supporting unprecedented opportunity to monitor individuals in the wild – in their daily lives. Continuous monitoring, thereby, enables Digital Phenotyping – characterization of health states, inferences of subtle changes in health states and thereby facilitating theoretical insights into human neuropsychology and neurophysiology. Moreover, temporally dense measurements may provide opportunities for optimal just-in-time interventions helping individuals to improve their health behaviors. Harvesting the potential benefits of digital phenotyping is, however, limited by the variability of behaviors as well as contextual and environmental effects that may significantly distort measured data. To mitigate these adverse effects, we have been developing computational models of a variety of physiological, neuropsychological and behavioral phenomena. In this talk, I will briefly discuss a continuum of models ranging from completely data-driven to principle-based, causal and mechanistic. I will then describe a few examples of approaches in several domains including cognition, sensory-motor behaviors and affective states. I will also describe a framework that can use such approaches as components of future proactive and distributed care, tailored to individuals.

His talk takes place on Monday, December 3, 2018 at 13:00 in room A113.

Jiří Schimmel: Spatial Audio Coding Using Ambisonic

Jiří Schimmel has been a doctoral student in the Department of Telecommunications of FEEC BUT since 1999. In 2006 he defends his doctoral thesis on the topic “Audio Effect Synthesis Using Non-Linear Signal Processing” and in 2016 habilitation thesis on “New Methods of Spatial Audio Coding and Rendering”. His professional scientific activity is focused on the research in the area of digital audio signal processing, on the research and development of real-time signal processing systems and multi-channel sound systems. He also cooperates with interior and foreign companies (C-Mexx, DFM, Audified).

Spatial Audio Coding Using Ambisonic

Ambisonic is a mathematically based acoustic signal processing technology that attempts to capture and reproduce information from a complete three-dimensional sound field, including the exact localization of each sound source and the environmental characteristics of the field. Basically this is a simplified solution of the wave equation for the progressive convergent spherical wave using spherical harmonic decomposition of the wave field. Theory and technologies related to ambisonic were developed already in the 1970s but its real-time use has been enabled by modern computing technologies. The output of the coding process are so-called ambisonic components whose number determines the order of the ambisonic as well as accuracy of the encoding and the subsequent reconstruction of the sound field. There are two ways how to obtain the ambisonic components – encoding sound object and capture the sound field using 3D microphone. The encoding process is based on finding weighting factors of ambisonic components according to the position of an audio object. For the 3D sound field capture a set of microphones is used that form virtual 3D microphone whose components are identical to the ambisonic components. The decoding process is based on reconstruction of the sound filed using several sound sources (loudspeakers) which supposes further simplifications. Although the sound field is mathematically fully described in ambisonic, there are still many problems that need to be addressed in its practical use.

His talk takes place on Tuesday, October 2, 2018 at 13:00 in room A113.

Petr Dokládal: Image processing in Non-Destructive Testing

Petr Dokládal is a senior researcher with the Center for Mathematical Morphology, a joint research lab of Armines and MINES ParisTech, Paris, France. He graduated from the Technical University in Brno, Czech Republic, in 1994, as a telecommunication engineer, received his Ph.D. degree in 2000 from the Marne la Vallée University, France, in general computer sciences, specialized in image processing and received his habilitation from the ParisEst University in 2013. His research interests include mathematical morphology, image segmentation, object tracking and pattern recognition.

Image processing in Non-Destructive Testing

Non-destructive testing is a frequent task in industry for material control and structure inspection. There are many imaging techniques available to make defects visible. Effort is being made to automatize the process to make it repeatable, more accurate, cheaper and environment friendly. Others techniques (able to work remotely, easier to automatize) are being developed. Most of these techniques are still followed by a visual inspection performed by a qualified personnel.

In the beginning of this talk we will review a few, various inspection techniques used in industry. In the second part we will focus on the detection of cracks. From the image processing angle of view cracks are thin, curvilinear structures. They are not always easy to detect especially when surrounded by noise. We show in this talk how cracks can be detected by using path openings, an operator from mathematical morphology. Then, inspired by the a contrario approach, we will show how to choose a convenient threshold value to obtain a binary result. The a contrario approach, instead of modeling the structures to detect, models the noise to detect structures deviating from the model. In this scope, we assume noise composed of pixels that are independent random variables. Henceforth, cracks that are curvilinear and not necessarily connected sequences of bright pixels, are detected as abnormal sequences of bright pixels. In the second part, a fast approximation of the solution based on parsimonious path openings is shown.

His talk takes place on Tuesday, September 18, 2018 at 13:00 in room A113.

Santosh Mathan: Scaling up Cognitive Efficacy with Neurotechnology

Santosh Mathan is an Engineering Fellow at Honeywell Aerospace. His research lies at the intersection of human-computer interaction, machine
learning, and neurophysiological sensing. Santosh is principal investigator and program manager on several efforts to use neurotechnology in practical settings. These efforts, carried out in collaboration with academic and industry researchers around the world, have led to the development of systems that can estimate changes in cognitive function following brain trauma, identify fluctuations in attention, boost the activity of cortical networks underlying fluid intelligence, and serve as the basis for hands-free robotic control. Papers describing these projects have won multiple best paper awards at research conferences, and have been covered by the press in publications including the Wall Street Journal and Wired. He has been awarded over 19 US patents. Santosh has a doctoral degree in Human-Computer Interaction from the School of Computer Science at Carnegie Mellon University, where his research explored the use of computational cognitive models for diagnosing and remedying student difficulties during skill acquisition.

Scaling up Cognitive Efficacy with Neurotechnology

Cognition and behavior arise from the activity of billions of neurons. Ongoing research indicates that non-invasive neural sensing techniques can provide a window into this never ending storm of electrical activity in our brains, and yield rich information of interest to system designers and trainers. Direct measurement of brain activity has the potential to provide objective measures that can help estimate the impact of a system on users during the design process, estimating cognitive proficiency during training, and providing new modalities for humans to interact with computer systems. In this presentation, Santosh Mathan will review research in the Honeywell Advanced Technology organization that offer novel tools and techniques to advance Human Computer Interaction. While many of these research explorations are at an early stage, they offer the preview of practical tools that lie around the corner for researchers and practitioners with an interest in boosting human performance in challenging task environments.

His talk takes place on Friday, August 24, 2018 at 13:00 in room A112.

Slides of the talk are publicly available.

Niko Brummer: Tractable priors, likelihoods, posteriors and proper scoring rules for the astronomically complex problem of partitioning a large set of recordings w.r.t. speaker

brummerNiko Brummer received B.Eng (1986), M.Eng (1988) and Ph.D. (2010) degrees, all in electronic engineering, from Stellenbosch University. He worked as researcher at DataFusion (later called Spescom DataVoice), and AGNITIO and is currently with Nuance Communications. Most of his research for the last 25 years has been applied to automatic speaker and language recognition and he has been participating in most of the NIST SRE and LRE evaluations in these technologies, from the year 2000 to the present. He has been contributing to the Odyssey Workshop series since 2001 and was organizer of Odyssey 2008 in Stellenbosch. His FoCal and Bosaris Toolkits are widely used for fusion and calibration in speaker and language recognition research.

His research interests include development of new algorithms for speaker and language recognition, as well as evaluation methodologies for these technologies. In both cases, his emphasis is on probabilistic modelling. He has worked with both generative (eigenchannel, JFA, i-vector PLDA) and discriminative (system fusion, discriminative JFA and PLDA) recognizers. In evaluation, his focus is on judging the goodness of classifiers that produce probabilistic outputs in the form of well calibrated class likelihoods.

Tractable priors, likelihoods, posteriors and proper scoring rules for the astronomically complex problem of partitioning a large set of recordings w.r.t. speaker

Real-world speaker recognition problems are not always arranged into neat, NIST-style challenges with large labelled training databases and binary target/non-target evaluation trials. In the most general case we are given a (sometimes large) collection of recordings and ideally we just want to go and recognize the speakers in there. This problem is usually called speaker clustering and solutions like AHC (agglomerative hierarchical clustering) exist. The catch is that neither AHC, nor indeed any other yet-to-be-invented algorithm can find the correct solution with certainty. In the simple case of binary trials, we in the speaker recognition world are already very comfortable with dealing with this uncertainty—the recognizers quantify their uncertainty as likelihood-ratios. We know how calibrate these likelihood-ratios, how to use them to make Bayes decisions and how to judge their goodness with proper scoring rules. At a first glance all of these things seem to be hopelessly intractable for the clustering problem because of the astronomically large size of the solution space. In this talk show otherwise and propose a suite of tractable tools for probabilistic clustering.

His talk takes place on Monday, April 16, 2018 at 13:00 in room G202.

Video recording of the talk is publicly available.

Slides of the talk are publicly available.

David Filip: Standardization and Research

David Filip is Chair (Convener) of OASIS XLIFF OMOS TC; Secretary, Lead Editor and Liaison Officer of OASIS XLIFF TC; a former Co-Chair and Editor for the W3C ITS 2.0 Recommendation; Steering Committee member of GALA TAPICC, Advisory Editorial Board member for the Multilingual magazine; co-moderator of the Standards IG at JIAMCATT. David has been also appointed as NSAI expert to ISO TC 37/SC 3 and /SC 5, ISO/IEC JTC 1/WG 9. /SC38, and /SC42. His specialties include open standards and process metadata, workflow and meta-workflow automation. David works as a Research Fellow at the ADAPT Research Centre, Trinity College Dublin, Ireland. Before 2011, he oversaw key research and change projects for Moravia’s worldwide operations. David held research scholarships at universities in Vienna, Hamburg and Geneva, and graduated in 2004 from Brno University with a PhD in Analytic Philosophy. David also holds master’s degrees in Philosophy, Art History, Theory of Art and German Philology.

Standardization and Research

David will explain about the multilingual content standardization ecosystem, starting with foundational standards such as XML and Unicode, over XML vocabularies for payload and metadata exchange, to API and reference architecture specifications. He will explain basic standardization principles with special regard for internet based technologies, touching on different standardization cultures ranging from industry associations, over ad hoc consortia, IETF, OASIS, W3C, Unicode, to traditional SDOs such as ISO, ISO/IEC, ASTM etc. David will also touch on the relationship of standardization, research, and innovation and how it is important or not for research groups and institutes to participate in standardization. Difference between anticipatory and post hoc standardization will be explained and how royalty free standards create and grow markets for technology and innovation.

His talk takes place on Thursday, March 22, 2018 at 13:00 in room E104.

Jan Kybic: Accelerating image registration

Jan Kybic was born in Prague, Czech Republic, in 1974. He received a Mgr. (BSc.) and Ing. (MSc.) degrees with honors from the Czech Technical University, Prague, in 1996 and 1998, respectively. In 2001, he obtained the Ph.D. in biomedical image processing from Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland, for his thesis on elastic image registration using parametric deformation models. Between October 2002 and February 2003, he held a post-doc research position in INRIA, Sophia-Antipolis, France. Since 2003 he is a Senior Research Fellow with Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague and passed his habilitation (Associate Professor) in 2010 and became a full professor in 2015. He was a Vice-Dean in 2011-2013 and a Department Head in 2013-2017. Jan Kybic has authored or co-authored 31 articles in peer-reviewed international scientific journals, one book, two book chapters, and over 80 conference publications. He has supervised nine PhD students, six of them have already successfuly graduated. He has also supervised over twenty master, bachelor and short-term student projects.

He is a member of IEEE and served as an Associate Editor for IEEE Transactions on Medical Imaging and as a reviewer for numerous international journals and conferences. He was a general chair of the ISBI 2016 conference.

His research interests include signal and image processing, medical imaging, image registration, splines and wavelets, inverse problems, elastography, computer vision, numerical methods, algorithm theory and control theory.

He teaches Digital Image Processing and Medical Imaging courses.

Accelerating image registration

Image registration is one of the key image analysis tasks, especially in biomedical imaging. However, accurate image registration methods are often slow and this problem is exacerbated by the steadily increasing resolution of today’s acquisition methods. In my talk, I will present two of my relatively recent ideas, how image registration can be accelerated.

First, we take advantage of the fact that image registration is mostly driven by image edges. We take this idea to the extreme. We approximate the similarity criterion by sampling only a~small number of sparse keypoints and consider only normal displacements. Furthermore, we simplify images by segmenting them first. The segmentation can be performed jointly and alternated with the registration steps. Compared to classical image registration methods, our approach is at least one magnitude faster.

The second approach is based on matching generalized geometric graphs and is suitable for images containing linear structures with branches, such as road networks, rivers, blood vessels or neural fibers. Previously used methods could only match relatively small graphs, required good initial guess of the transformation, or could not be used for non-linear deformations. We present two methods which do not have such limitations – one is based on active testing and the second on Monte Carlo tree search, formulating the problem as a single player game. Our method can handle thousands of nodes and thus match very large images quickly. Besides several medical applications, we show for example, how to solve the localization problem by matching a~small aerial photo with a~large map.

His talk takes place on Thursday, March 1, 2018 at 13:00 in room E104.

 

Ondřej Bojar: Neural Machine Translation: From Basics to Semiotics

obo-2011Ondřej Bojar is an Assistant Professor at Charles University, Institute of Formal and Applied Linguistics (UFAL). Since his participation at the JHU summer engineering workshop in 2006 where the MT system Moses was released, Ondřej Bojar has been primarily active in the field of machine translation (MT), regularly taking part and later also co-organizing the WMT evaluation campaigns and contributing to the best practices of MT evaluation. Ondřej Bojar is the main author of the hybrid system Chimera which outperformed all competing systems in 2013 through 2015 (including Google Translate) in English-to-Czech translation. A variant of that system has been used in several commercial contracts of the department. Ondřej Bojar is now catching up with neural MT (NMT) and his main interest (aside from reaching again the best translation performance) lies in the study of the representations learned by the deep learning models. Is NMT learning any representations of sentence *meaning*, or is it merely a much advanced and softer variant of the copy-paste translation as performed by the previous approaches? His talk takes place on Tuesday, January 16, 2018 at 13:00 in room E105.

Neural Machine Translation: From Basics to Semiotics

In my talk, I will highlight the benefit that neural machine translation (NMT) has over previous statistical approaches to MT. I will then present the current state of the art in neural machine translation, briefly describing the current best architectures and their performance and limitations. In the second part of the talk, I will outline my planned search for correspondence between sentence meaning as traditionally studied by linguistics (or even semantics and semiotics) and the continuous representations learned by neural networks.

Vlastimil Havran: Surface reflectance in rendering algorithms

havran-bigVlastimil Havran is Associate professor at the Czech Technical University in Prague. His research interests include data structures and algorithms for rendering images and videos, visibility calculations, geometric range searching for global illumination, software architectures for rendering, applied Monte Carlo methods, data compression etc. His talk takes place on Monday, December 4, 2017 at 12:00 in room E105.

Surface reflectance in rendering algorithms

The rendering of images by computers, i.e., computationally solving a rendering equation, consists of three components: computing visibility for example by ray tracing, the interaction of light with surface and efficient Monte Carlo sampling algorithms. In this talk, we focus on various aspects of surface reflectance. That is a key issue to get high fidelity of objects’ visual appearance in the rendered images not only in the movie industry but also in real time applications of virtual and augmented reality. First, we recall the initial concepts of surface reflectance and its use in rendering equation. Then we will present our results on the surface reflectance characterization and its possible use in rendering algorithms. Further, we will show why the standard surface reflectance model usually represented as bidirectional reflectance distribution function needs to be extended spatially to achieve high fidelity of visual appearance. As this spatial extension leads to a big data problems, we will describe our algorithm for compression of spatially varying surface reflectance data. We also will describe an effective perceptually motivated method to compare two similar surface reflectance datasets, where one can be the reference data and the second one the result of its compression. As the last topic, we will describe the concepts and problems when we measure such surface reflectance datasets for real-world applications.

 

Themos Stafylakis: Deep Word Embeddings for Audiovisual Speech Recognition

Themos Stafylakis is a Marie Curie Research Fellow on audiovisual automatic speech recognition at the Computer Vision Laboratory of University of Nottingham (UK). He holds a PhD from Technical University of Athens (Greece) on Speaker Diarization for Broadcast News. He has a strong publication record on speaker recognition and diarization, as a result of his 5-year post-doc at CRIM (Montreal, Canada), under the supervision of Patrick Kenny. He is currently working on lip-reading and audiovisual speech recognition using deep learning methods. His talk takes place on November 22, 2017 at 13:00 in room A112.

Deep Word Embeddings for Audiovisual Speech Recognition

During the last few years, visual and audiovisual automatic speech recognition (ASR) are witnessing a renaissance, which can largely be attributed to the advent of deep learning methods. Deep architectures and learning algorithms initially proposed for audio-based ASR are combined with powerful computer vision models and are finding their way to lipreading and audiovisual ASR. In my talk, I will go through some of the most recent advances in audiovisual ASR, with emphasis on those based on deep learning. I will then present a deep architecture for visual and audiovisual ASR which attains state-of-the-art results in the challenging lipreading-in-the-wild database. Finally, I will focus on how this architecture can generalize to words unseen during training and discuss its applicability in continuous speech audiovisual ASR.

Tunç Aydın: Extracting transparent image layers for high-quality compositing

Tunç Aydın is a Research Scientist at Disney Research located at the Zürich Lab. My current research primarily focuses on image and video processing problems that address various movie production challenges, such as natural matting, green-screen keying, color grading, edge-aware filtering, and temporal coherence, among others. I have also been interested in analyzing visual content in terms of visual quality and aesthetic plausibility by utilizing knowledge of the human visual system. In my work I tend to utilize High Dynamic Range, Stereoscopic 3D, and High Frame-rate content, in addition to standard 8-bit images and videos.

Prior to joining Disney Research, I worked as a Research Associate at the Max-Planck-Institut für Informatik from 2006-2011, where I obtained my PhD degree under the supervision of Karol Myszkowski and Hans-Peter Seidel. I received the Eurographics PhD award in 2012 for my dissertation. I hold a Master’s degree in Computer Science from the College of Computing at Georgia Institute of Technology, and a Bachelor’s degree in Civil Engineering from Istanbul Teknik Universitesi. His talk takes place on Wednesday, November 1, 2017 at 13:00 in room A112.

Extracting transparent image layers for high-quality compositing

Compositing is an essential task in visual content production. For instance, a contemporary feature film production that doesn’t involve any compositing work is a rare occasion. However, achieving production-level quality often requires a significant amount of manual labor by digital compositing artists, mainly due to the limits of existing tools available for various compositing tasks. In this presentation I will talk about our recent work that aims on improving upon existing compositing technologies, where we focus on natural matting, green-screen keying, and color editing. We tackle natural matting using a novel affinity-based approach, whereas for green-screen keying and color editing we introduce a “color unmixing” framework, which we specialize individually for the two problem domains. Using these new techniques we achieve state-of-the-art results while also significantly reducing the manual interaction time.

 

Jakub Mareček: Urban Traffic Management – Traffic State Estimation, Signalling Games, and Traffic Control

Jakub Mareček is a research staff member at IBM Research. Together with some fabulous colleagues, Jakub develops solvers for optimisation and control problems at IBM’s Smarter Cities Technology Centre. Jakub joined IBM Research from the School of Mathematics at the University of Edinburgh in August 2012. Prior to his brief post-doc in Edinburgh, Jakub had presented an approach to general-purpose integer programming in his dissertation at the University of Nottingham and worked in two start-up companies in Brno, the Czech Republic. His talk takes place on Monday, October 16, 2017 at 13:30 in room D0207.

Urban Traffic Management: Traffic State Estimation, Signalling Games, and Traffic Control

In many engineering applications, one needs to identify a model of a non-linear system, increasingly using large volumes of heterogeneous, streamed data, and apply some form of (optimal) control. First, we illustrate why much of the classical identification and control is not applicable to problems involving time-varying populations of agents, such as in smart grids and intelligent transportations systems. Second, we use tools from robust statistics and convex optimisation to present alternative approaches to closed-loop system identification, and tools from iterated function systems to identify controllers for such systems with certain probabilistic guarantees on the performance for the individual interacting with the controller.

Marc Delcroix and Keisuke Kinoshita: NTT far-field speech processing research

Marc Delcroix is a senior research scientist at NTT Communication Science Laboratories, Kyoto, Japan. He received the M.Eng. degree from the Free University of Brussels, Brussels, Belgium, and the Ecole Centrale Paris, Paris, France, in 2003 and the Ph.D. degree from the Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan, in 2007. His research interests include robust multi-microphone speech recognition, acoustic model adaptation, integration of speech enhancement front-end and recognition back-end, speech enhancement and speech dereverberation. He took an active part in the development of NTT robust speech recognition systems for the REVERB and the CHiME 1 and 3 challenges, that all achieved best performances on the tasks. He was one of the organizers of the REVERB challenge, 2014 and of ARU 2017. He is a visiting lecturer at the Faculty of Science and Engineering of Waseda University, Tokyo, Japan.

Keisuke Kinoshita is a senior research scientist at NTT Communication Science Laboratories, Kyoto, Japan. He received the M.Eng. degree and the Ph.D degree from Sophia University in Tokyo, Japan in 2003 and 2010 respectively. He joined NTT in 2003 and since then has been working on speech and audio signal processing. His research interests include single- and multichannel speech enhancement and robust speech recognition. He was the Chief Coordinator of REVERB challenge 2014, an organizing committee member of ASRU-2017. He was honored to receive IEICE Paper Awards (2006), ASJ Technical Development Awards (2009), ASJ Awaya Young Researcher Award (2009), Japan Audio Society Award (2010), and Maeshima Hisoka Award (2017). He is a visiting lecturer at the Faculty of Science and Engineering of Doshisha University, Tokyo, Japan.

Their talk takes place on Monday, August 28, 2017 at 13:00 in room A112.

NTT far-field speech processing research

The success of voice search applications and voice controlled device such as the Amazon echo confirms that speech is becoming a common modality to access information. Despite great recent progress in the field, it is still challenging to achieve high automatic speech recognition (ASR) performance when using microphone distant from the speakers (Far-field), because of noise, reverberation and potential interfering speakers. It is even more challenging when the target speech consists of spontaneous conversations.

At NTT, we are pursuing research on far-field speech recognition focusing on speech enhancement front-end and robust ASR back-ends towards building next generation ASR systems able to understand natural conversations. Our research achievements have been combined into ASR systems we developed for the REVERB and CHiME 3 challenges, and for meeting recognition.

In this talk, after giving a brief overview of the research activity of our group, we will introduce in more detail two of our recent research achievements. First, we will present our work on speech dereverberation using weighted prediction error (WPE) algorithm. We have recently proposed an extension to WPE to integrate deep neural network based speech modeling into the WPE framework, and demonstrate further potential performance gains for reverberant speech recognition.

Next, we will discuss our recent work on acoustic model adaptation to create ASR back-ends robust to speaker and environment variations. We have recently proposed a context adaptive neural network architecture, which is a powerful way to exploit speaker or environment information to perform rapid acoustic model adaptation.

S. Umesh: Acoustic Modelling of low-resource Indian languages

S. Umesh is a professor in the Department of Electrical Engineering at Indian Institute of Technology – Madras. His research interests are mainly in automatic speech recognition particularly in low-resource modelling and speaker normalization & adaptation. He has also been a visiting researcher at AT&T Laboratories, Cambridge University and RWTH-Aachen under the Humboldt Fellowship. He is currently leading a consortium of 12 Indian institutions to develop speech based systems in agricultural domain. His talk takes place on Tuesday, June 27, 2017 at 13:00 in room A112.

Acoustic Modelling of low-resource Indian languages

In this talk, I will present recent efforts in India to build speech-based systems in agriculture domain to provide easy access to information to about 600 million farmers. This is being developed by a consortium of 12 Indian institutions initially in 12 languages, which will then be expanded to another 12 languages. Since the usage is in extremely noisy environments such as fields, the emphasis is on high accuracy by using directed queries which elicit short phrase-like responses. Within this framework, we explored cross-lingual and multilingual acoustic modelling techniques using subspace-GMMs and phone-CAT approaches. We also extended the use of phone-CAT for phone-mapping and articulatory features extraction which were then fed to a DNN based acoustic model. Further, we explored the joint estimation of acoustic model (DNN) and articulatory feature extractors. These approaches gave significant improvement in recognition performance, when compared to building systems using data from only one language. Finally, since the speech consisted of mostly short and noisy utterances, conventional adaptation and speaker-normalization approaches could not be easily used. We investigated the use of a neural network to map filter-bank features to fMLLR/VTLN features, so that the normalization can be done at frame-level without first-pass decode, or the necessity of long utterances to estimate the transforms. Alternately, we used a teacher-student framework where the teacher trained on normalized features is used to provide “soft targets” to the student network trained on un-normalized features. In both approaches, we obtained recognition performance that is better than ivector-based normalization schemes.

Kwang In Kim: Toward Intuitive Imagery: User Friendly Manipulation and Exploration of Images and Videos

Kwang In Kim is a senior lecturer of computer science at the University of Bath. He received a BSc in computer engineering from the Dongseo University in 1996, and MSc and PhD in computer engineering from the Kyungpook National University in 1998 and 2000, respectively. He was a post-doctoral researcher at KAIST, at the Max-Planck-Institute for Biological Cybernetics, at Saarland University, and at the Max-Planck-Institute for Informatics, from 2000 to 2013. Before joining Bath, he was a lecturer at the School of Computing and Communications, Lancaster University. His research interests include machine learning, vision, graphics, and human-computer interaction. His talk takes place in Wednesday, May 10th, 2017, at 3:30pm in room E105.

Toward Intuitive Imagery: User Friendly Manipulation and Exploration of Images and Videos

With the ubiquity of image and video capture devices, it is easy to form collections of images and video. Two important questions in this context are 1) how to retain the quality of individual images and videos and 2) how to explore the resulting large collections. Unlike professionally captured photographs and videos, the quality of the imageries that are casually captured by regular users are usually low. In this talk, we will discuss manipulating and improving such images and videos in several aspects. The central theme of the talk is user-friendliness. Unlike existing sophisticated algorithms, our approaches focus on enabling non-expert users freely manipulate and improve personal imagery collections. We present two specific examples in this context: image enhancement and video object removal. Existing interfaces to these video collections are often simply lists of text-ranked videos which do not exploit the visual content relationships between videos, or other implicit relationships such as spatial or geographical relationships. In the second part of the talk, we discuss data structures and interfaces that exploit content relationships present in images and videos.

Reinhold Häb-Umbach: Neural Network Supported Acoustic Beamforming

Reinhold Häb-Umbach is a professor of Communications Engineering at the University of Paderborn, Germany. His main research interests are in the fields of statistical signal processing and pattern recognition, with applications to speech enhancement, acoustic beamforming and source separation, as well as automatic speech recognition and unsupervised learning from speech and audio. He has more than 200 scientific publications, and recently co-authored the book Robust Automatic Speech Recognition – a Bridge to Practical Applications (Academic Press, 2015). He is a fellow of the International Speech Communication Association (ISCA). His talk takes place on Monday, April 24th, at 1pm in room D0207.

Neural Network Supported Acoustic Beamforming
for Speech Enhancement and Recognition

Abstract: With multiple microphones spatial information can be exploited to extract a target signal from a noisy environment. While the theory of statistically optimum beamforming is well established the challenge lies in the estimation of the beamforming coefficients from the noisy input signal. Traditionally these coefficients are derived from an estimate of the direction-of-arrival of the target signal, while more elaborate methods estimate the power spectral density matrices (PSD) of the desired and the interfering signals, thus avoiding the assumption of an anechoic signal propagation. We have proposed to estimate these PSD matrices using spectral masks determined by a neural network. This combination of data-driven approaches with statistically optimum multi-channel filtering has delivered competitive results on the recent CHiME challenge. In this talk, we detail this approach and show that the concept is more general and can be, for example, also used for dereverberation. When used as a front-end for a speech recognition system, we further show how the neural network for spectral mask estimation can be optimized w.r.t. a word error rate related criterion in and end-to-end setup.

Jiří Matas: Tracking with Discriminative Correlation Filters

Jiří MatasJiří Matas is a full professor at the Center for Machine Perception, Czech Technical University in Prague. He holds a PhD degree from the University of Surrey, UK (1995). He has published more than 200 papers in refereed journals and conferences. Google Scholar reports about 22 000 citations to his work and an h-index 53.
He received the best paper prize at the International Conference on Document Analysis and Recognition in 2015, the Scandinavian Conference on Image Analysis 2013, Image and Vision Computing New Zealand Conference 2013, the Asian Conference on Computer Vision 2007, and at British Machine Vision Conferences in 2002 and 2005. His students received a number of awards, e.g. Best Student paper at ICDAR 2013, Google Fellowship 2013, and various “Best Thesis” prizes.
J. Matas is on the editorial board of IJCV and was the Associate Editor-in-Chief of IEEE T. PAMI. He is a member of the ERC Computer Science and Informatics panel. He has served in various roles at major international conferences, e.g. ICCV, CVPR, ICPR, NIPS, ECCV, co-chairing ECCV 2004 and CVPR 2007. He is a program co-chair for ECCV 2016.
His research interests include object recognition, text localization and recognition, image retrieval, tracking, sequential pattern recognition, invariant feature detection, and Hough Transform and RANSAC-type optimization. His talk takes place on Thursday, March 2nd, at 1pm in room E105.

Tracking with Discriminative Correlation Filters

Visual tracking is a core video processing problem with many applications, e.g. in surveillance, autonomous driving, sport analysis, augmented reality, film post-production and medical imaging.

In the talk, tracking methods based on Discriminative Correlation Filters (DCF) will be presented. DCF-based trackers are currently the top performers on most commonly used tracking benchmarks. Starting from the oldest and simplest versions of DCF trackers like MOSSE, we will progress to kernel-based and multi-channel variants including those exploiting CNN features. Finally, the Discriminative Correlation Filter with Channel and Spatial Reliability will be introduced.

Time permitting, I will briefly introduce a problem that has been so far largely ignored by the computer vision community – tracking of blurred, fast moving objects.

Video recording of the talk is publicly available.