András Lőrincz: Towards human-machine and human-robot interactions “with a little help from my friends”

AndrásAndrás Lőrincz, a professor and senior researcher, has been teaching at the Faculty of Informatics at Eötvös University, Budapest since 1998. His research focuses on human-machine interaction and their applications in neurobiological and cognitive modeling, as well as medicine. He has founded the Neural Information Processing Group of Eötvös University and he directs a multidisciplinary team of mathematicians, programmers, computer scientists and physicists. He has acted as the PI of several successful international projects in collaboration with Panasonic, Honda Future Technology Research and the Information Directorate of the US Air Force, Robert Bosch, Ltd. Hungary, among others. He took part in several EU Framework Program projects.

He is a habilitated professor at the University of Szeged (1998) on laser physics and habilitated in the field of Informatics at the Eötvös Loránd University in 2008. He conducted research and taught quantum control, photoacoustics and artificial intelligence at the Hungarian Academy of Sciences, University of Chicago, Brown University, Princeton University, the Illinois Institute of Technology, University of Szeged, and Eötvös Loránd University. He authored about 300 peer reviewed scientific publications.

He has become an elected Fellow of the European Coordinating Committee for Artificial Intelligence (EurAI) for his pioneering work in the field of artificial intelligence in 2006. He has received the Innovative Researcher Prize of the University in 2009 and in 2019.

Partners: Barcelona University (on personality estimation and human-human interaction), Technical University of Delft (on human-human interaction), Rush Medical School, Chicago, on autism diagnosis and PTSD therapy.

Towards human-machine and human-robot interactions “with a little help from my friends”

Our work in the Neural Information Processing Group focuses on human-machine interactions. The first part of the talk will be an introduction to the technologies that we can or should use for effective iterations, such as the detection of environmental context, ongoing activity, including body movement, manipulation, and hidden parameters, i.e. intention, mood and personality state, as well as communication signals: body, head, hand, hand, face and gaze gestures, plus the body parameters that can be measured optically or by intelligent means, i.e., the temperature, blood pressure and stress levels, among others.

In the second part of the talk, I will review (a) what body and environment estimation methods we have, (b) what we can say about human-human interactions, which will also give insights into the requirements of human-machine and human-robot interactions, (c) what applications we have or can target in the areas of autism, “continuous healthcare” and “home and public safety”. (d) I will also list what technologies are missing and what we are looking for partners in.

His talk takes place on Thursday, June 2, 2022 at 14:00 in room A112. The talk is postponed, new date and time will be announced.

Heikki Kälviäinen: Computer Vision Applications

HeikkiHeikki Kälviäinen has been a Professor of Computer Science and Engineering since 1999. He is the head of the Computer Vision and Pattern Recognition Laboratory (CVPRL) at the Department of Computational Engineering of Lappeenranta-Lahti University of Technology LUT, Finland. Prof. Kälviäinen’s research interests include computer vision, machine vision, pattern recognition, machine learning, and digital image processing and analysis. Besides LUT, Prof. Kälviäinen has worked as a Visiting Professor at the Faculty of Information Technology of Brno University of Technology, Czech Republic, the Center for Machine Perception (CMP) of Czech Technical University, and the Centre for Vision, Speech, and Signal Processing (CVSSP) of University of Surrey, UK, and as a Professor of Computing at Monash University Malaysia.

Computer Vision Applications

The presentation considers computer vision, especially a point of view of applications. Digital image processing and analysis with machine learning methods enable efficient solutions for various areas of useful data-centric engineering applications. Challenges with image acquisition, data annotation with expert knowledge, and clustering and classification, including deep learning method training are discussed. Different applications are given as examples based on the fresh novel data available: planktons in the Baltic Sea, Saimaa ringed seals in Lake Saimaa, and logs in the sawmill industry. In the first application the motivation is that distributions of plankton types give much information about the condition of the sea water system, e.g., about the climate change. An imaging flow cytometer can produce a lot of plankton images which should be classified into different plankton types. Manual classification of these images is very laborious, and thus, a CNN-based method has been developed to automatically recognize the plankton types in the Baltic Sea. In the second application the Saimaa ringed seals are automatically identified individually using camera trap images for assisting this very small population to survive in nature. CNN-based re-identification methods are based on pelage patterns of the seals. The third application is related to the sawmill industry. The digitalization of the sawmill industry is important for optimizing material flows and the quality. The research is focused on seeing inside the log to be able to predict which kinds of sawn boards are produced after cutting the log.

His talk takes place on Wednesday, May 11, 2022 at 13:00 in room A112.

Augustin Žídek: Protein Structure Prediction with AlphaFold

AugustinAugustin Žídek works as a Research Engineer at DeepMind and has been a member of the protein folding team since 2017. He studied Computer Science at the University of Cambridge. He enjoys working at the boundary of research and engineering, hiking, playing musical instruments and fixing things.

Protein Structure Prediction with AlphaFold

In this talk, we will discuss what proteins are, what is the protein folding problem and why it is an important scientific challenge. We will then talk about AphaFold, a machine learning model developed by DeepMind that is able to predict protein 3D structure with high accuracy, its architecture and applications.

His talk takes place on Tuesday, March 1, 2022 at 13:00 in room A112. The talk will be streamed live at https://youtu.be/udyjZXtUuDw.

Hema A. Murthy: Signal Processing Guided Machine Learning

HemaHema A. Murthy is currently a Professor at the Department of Computer Science and Engineering. She has been with the department for the last 35 years. She currently leads an 18 Institute consortium that focuses on speech as part of the national language translation mission, an ambitious project where the objective is to produce speech to speech translation in Indian languages and Indian English.

Signal Processing Guided Machine Learning

In this talk we will focus on using signal processing algorithms in tandem with machine learning algorithms for various tasks in speech, music and brain signals. The primary objective is to understand events of interest from the perspective of the chosen domain. Appropriate signal processing is employed to detect events. Machine learning algorithms are then made to focus on learning the statistical characteristics of these events. The primary advantage of this approach is that it significantly reduces both computation and data costs. Examples from speech synthesis, Indian art music, and neuronal signals and EEG signals will be considered.

Her talk takes place on Tuesday, February 8, 2022 at 13:00 CET, virtually on zoom https://cesnet.zoom.us/j/91741432360.

Slides of the talk are publicly available.

Bernhard Egger: Inverse Graphics and Perception with Generative Face Models

JanProf. Dr. Bernhard Egger studies how humans and machines can perceive faces and shapes in general. In particular, he chooses to focus on statistical shape models and the 3D Morphable Models. He is a junior professor at the chair of visual computing at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). Before joing FAU he was a postdoc in Josh Tenenbaum‘s Computational Cognitive Science Lab at the Departement of Brain and Cognitive Sciences at MIT and the Center for Brains, Minds and Machines (CBMM) and Polina Golland‘s group at MIT Computer Science & Artificial Intelligence Lab. He did his PhD on facial image annotation and interpretation in unconstrained images in the Graphics and Vision Research Group at the University of Basel. Before his doctorate he obtained his M.Sc. and B.Sc. in Computer Science at the University of Basel and an upper secondary school teaching Diploma at the University of Applied Sciences Northwestern Switzerland.

Inverse Graphics and Perception with Generative Face Models

Human object perception is remarkably robust: Even when confronted with blurred or sheared photographs, or pictures taken under extreme illumination conditions, we can often recognize what we’re seeing and even recover rich three-dimensional structure. This robustness is especially notable when perceiving human faces. How can humans generalize so well to highly distorted images, transformed far beyond the range of natural face images we are normally exposed to? In this talk I will present an Analysis-by-Synthesis approach based on 3D Morphable Models that can generalize well across various distortions. We find that our top-down inverse rendering model better matches human precepts than either an invariance-based account implemented in a deep neural network, or a neural network trained to perform approximate inverse rendering in a feedforward circuit.

His talk takes place on Wednesday, January 19, 2022 at 15:00 in room A112. The talk will be streamed live at https://www.youtube.com/watch?v=l9Aqz-86pUg.

Ondřej Dušek: Better Supervision for End-to-end Neural Dialogue Systems

DusekOndřej Dušek is an assistant professor at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University. His research is in the areas of dialogue systems and natural language generation; he specifically focuses on neural-networks-based approaches to these problems and their evaluation. He is also involved in the THEaiTRE project on automatic theatre play generation. Ondřej got his PhD in 2017 at Charles University. Between 2016 and 2018, he worked at the Interaction Lab at Heriot Watt University in Edinburgh, one of the leading groups in dialogue systems and natural-language interaction with computers and robots. There he co-organized the E2E NLG text generation challenge and co-led a team of PhD students in the Amazon Alexa Prize dialogue system competition, which came third in two consecutive years.

Better Supervision for End-to-end Neural Dialogue Systems

While end-to-end neural models have been the research trend in task-oriented dialogue systems in the past years, they still suffer from significant problems: The neural models often produce replies inconsistent with past dialogue context or database results, their replies may be dull and formulaic, and they require large amounts of annotated data to train. In this talk, I will present two of our recent experiments that aim at solving these problems.

First, our end-to-end neural system AuGPT based on the GPT-2 pretrained language model aims at consistency and variability in dialogue responses by using massive data augmentation and filtering as well as specific auxiliary training objectives which check for dialogue consistency. It reached favorable results in terms of both automatic metrics and human judgments (in the DSTC9 competition).

Second, we designed a system that is able to discover relevant dialogue slots (domain attributes) without any human annotation. It uses weak supervision from generic linguistic annotation models (semantic parser, named entities), which is further filtered and clustered. We train a neural slot tagger on the discovered slots, which then reaches state-of-the-art results in dialogue slot tagging without labeled training data. We further show that the discovered slots are helpful for training an end-to-end neural dialogue system.

His talk takes place on Wednesday, December 1, 2021 at 15:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”). The talk will be streamed live and recorded at https://www.youtube.com/watch?v=JzBy-QuLxiE.

Tanel Alumäe: Weakly supervised training for speaker and language recognition

TanelTanel Alumäe is the head of Laboratory of Language Technology at Tallinn University of Technology (TalTech). He received his PhD degree from the same university in 2006. After that, he has worked in several research teams, including LIMSI/CNRS, Aalto University and Raytheon BBN Technologies. His recent research has focused on practical approaches to low-resource speech and language processing.

Weakly supervised training for speaker and language recognition

Speaker identification models are usually trained on data where the speech segments corresponding to the target speakers are hand-annotated. However, the process of hand-labelling speech data is expensive and doesn’t scale well, especially if a large set of speakers needs to be covered. Similarly, spoken language identification models require large amounts of training samples from each language that we want to cover.
This talk will show how metadata accompanied with speech data found on the internet can be treated as weak and/or noisy labels for training speaker and language identification models. Speaker identification models can be trained using only the information about speakers appearing in each of the recordings in training data, without any segment level annotation. For spoken language identification, we can often treat the detected language of the description of the multimedia clip as a noisy label. The latter method was used to compile VoxLingua107, a large scale speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. It contains data for 107 languages, with 62 hours per language on the average. A model trained on this dataset can be used as-is, or finetuned for a particular language identification task using only a small amount of manually verified data.

His talk takes place on Tuesday, November 9, 2021 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”). The talk will be streamed live and recorded at
https://youtu.be/fpsC0jzZSvs – thanks FIT student union for support!

Boliang Zhang: End-to-End Task-oriented Dialog Agent Training and Human-Human Dialog Collection

SrikanthBoliang Zhang is a research scientist at DiDi Labs, Los Angeles, CA. Currently, he works on building intelligent chatbots to help humans fulfill tasks. Before that, he has interned at Microsoft, Facebook, and AT&T Labs. He received his Ph.D. in 2019 at Rensselaer Polytechnic Institute. His thesis topic focuses on applications of neural networks for information extraction for low-resource languages. He has a broad interest in applications of natural language processing. He participated in DARPA Low Resource Languages for Emergent Incidents (LORELEI) project, where he, as a core system developer, built named entity recognition and linking system for low-resource languages, such as Hausa and Oromo, and achieves first place in the evaluation four times in a row. At DiDi Labs, he leads a small group to compete in the Multi-domain Task-oriented Dialog Challenge of DSTC9 and tied for first place among ten teams.

End-to-End Task-oriented Dialog Agent Training and Human-Human Dialog Collection

Task-oriented dialog systems aim to communicate with users through natural language to accomplish a wide range of tasks, such as restaurant booking, weather querying, etc. With the rising trend of artificial intelligence, they have attracted attention from both academia and industry. In the first half of this talk, I will introduce our participation in the DSTC9 Multi-domain Task-oriented Dialog Challenge and present our end-to-end dialog system. Compared to traditional pipelined dialog architecture where modules like Natural Language Understanding (NLU), Dialog Manager (DM), and Natural Language Generation (NLG) work separately and are optimized individually, our end-to-end system is a GPT-2 based fully data-driven method that jointly predicts belief states, database queries, and responses. In the second half of the talk, as we found that existing dialog collection tool has limitations in the real world scenario, I will introduce a novel human-human dialog platform that reduces all agent activity (API calls, utterances) to a series of clicks, yet maintains enough flexibility to satisfy users. This platform enables real-time agents to do real tasks, meanwhile stores all agent’s actions that are used for training chatbots later on.

The talk will take place on Tuesday April 20th at 17:00 CEST (sorry for late hour, but Boliang is on the US West Coast), virtually on zoom https://cesnet.zoom.us/j/95296064691.

Video recording of the talk is publicly available.

Slides of the talk are publicly available.

Srikanth Madikeri: Automatic Speech Recognition for Low-Resource languages

SrikanthSrikanth Madikeri got his Ph.D. in Computer Science and Engineering from Indian Institute of Technology Madras (India) in 2013. During his Ph.D., he worked on automatic speaker recognition and spoken keyword spotting. He is currently working as a Research Associate at Idiap Research Institute (Martigny, Switzerland) in the Speech Processing group. His current research interests include – Automatic Speech Recognition for low resource languages, Automatic Speaker Recognition and Speaker Diarization.

Automatic Speech Recognition for Low-Resource languages

This talk focuses on automatic speech recognition (ASR) systems for low-resource languages with applications to information retrieval.
A common approach to improve ASR system performance for low-resource ASR is to train multilingual acoustic models by pooling resources from multiple languages. In this talk, we present the challenges and benefits of different multilingual modeling with Lattice-Free Maximum Mutual Information (LF-MMI), the state-of-the-art technique for hybrid ASR systems. We also present an incremental semi-supervised learning approach applied to multi-genre speech recognition, a common task in the MATERIAL program. The simple approach helps avoid fast saturation of performance improvements when using large amounts of data for semi-supervised learning. Finally, we present Pkwrap, a Pytorch wrapper on Kaldi (among the most popular speech recognition toolkits), that helps combine the benefits of training acoustic models with Pytorch and Kaldi. The toolkit, now available at https://github.com/idiap/pkwrap, is intended to provide both fast prototyping benefits of Pytorch while using necessary functionalities from Kaldi (LF-MMI, parallel training, decoding, etc.).

The talk will take place on Monday March 8th 2021 at 13:00 CET, virtually on zoom https://cesnet.zoom.us/j/98589068121.

Jan Ullrich: Research on head-marking languages – its contribution to linguistic theory and implications for NLP and NLU

JanJan Ullrich is the linguistic director of The Language Conservancy, an organization serving indigenous communities in projects of language documentation and revitalization. His main research interests are in morphosyntactic analyses, semantics, corpus linguistics, lexicography, and second language acquisition.
He holds a Ph.D. in linguistics from Heinrich-Heine-Universität in Düsseldorf. He has taught at Indiana University, University of North Dakota, Oglala Lakota College, and Sitting Bull College and has given lectures at a number of institutions in Europe and North America.
Ullrich has been committed to and worked in fieldwork documentation and analysis of endangered languages since 1992, primarily focusing on the Dakotan branch of the Siouan language family (e.g. Lakhota, Dakhota, Assiniboine, Stoney). His research represents highly innovative, and in parts groundbreaking, analysis of predication and modification in Lakhota. He is the author and co-author of a number of highly acclaimed publications, such as the New Lakota Dictionary and the Lakota Grammar Handbook.

Research on head-marking languages: its contribution to linguistic theory and implications for NLP and NLU

Some of the most widely used linguistic theories, and especially those which have been more or less unsuccessfully applied in computer parsing and NLP, are affected by three main problems: (a) they are largely based on the study of dependent-marking syntax, which means they ignore half of the world’s languages, (b) they are syntacto-centric and mostly disregard semantics, and (c) they are not monostratal, but instead propose deep structures which cannot readily be accessed by statistically driven models and parsing algorithms.
This presentation will introduce a number of the broadly relevant theoretical concepts developed from the study of head-marking languages, such as Lakhóta (Siouan), and some of their implications for NLP and NLU. It will offer a brief introduction to the Role and Reference Grammar, a theory which connects structure and function by implementing a two-way linking algorithm between constituency-based structural analysis and semantics.

His talk will be held jointly as VGS-IT seminar and lecture of MTIa master course and takes place on Thursday, February 27th, 2020 at 12:00 in room E112.

Jan Chorowski: Representation learning for speech and handwriting

JanJan Chorowski is an Associate Professor at Faculty of Mathematics and Computer Science at the University of Wrocław and Head of AI at NavAlgo. He received his M.Sc. degree in electrical engineering from the Wrocław University of Technology, Poland and EE PhD from the University of Louisville, Kentucky in 2012. He has worked with several research teams, including Google Brain, Microsoft Researchand Yoshua Bengio’s lab at the University of Montreal. He has led a research topic during the JSALT 2019 workshop. His research interests are applications of neural networks to problems which are intuitive and easy for humans and difficult for machines, such as speech and natural language processing.

Representation learning for speech and handwriting

Learning representations of data in an unsupervised way is still an open problem of machine learning. We consider representations of speech and handwriting learned using autoencoders equipped with autoregressive decoders such as WeveNets or PixelCNNs. In those autoencoders, the encoder only needs to provide the little information needed to supplement all that can be inferred by the autoregressive decoder. This allows learning a representation able to capture high level semantic content from the signal, e.g. phoneme or character identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. I will show how the design choices of the autoencoder, such as the bottleneck kind its hyperparameters impact the induced latent representation. I will also show applications to unsupervised acoustic unit discovery on the ZeroSpeech task. Finally, I’ll show how knowledge about the average unit duration can be enforced during training ,as well as during inference on new data.

His talk takes place on Friday, January 10, 2020 at 13:00 in room A112.

Ilya Oparin (Apple, USA): Connecting and Comparing Language Model Interpolation Techniques

IlyaIlya Oparin is leading Language Modeling team that contributes to improving Siri at Apple. He did his Ph.D. on language modeling of inflectional languages at University of West Bohemia in collaboration with Speech@FIT group at Brno University of Technology. Before joining Apple in 2014, Ilya did 3 years of post-doc in Spoken Language Processing group at LIMSI. Ilya’s research interests cover any topics related to language modeling for automatic speech recognition and more broadly for natural language processing.

Connecting and Comparing Language Model Interpolation Techniques

In this work, we uncover a theoretical connection between two language model interpolation techniques, count merging and Bayesian interpolation. We compare these techniques as well as linear interpolation in three scenarios with abundant training data per component model. Consistent with prior work, we show that both count merging and Bayesian interpolation outperform linear interpolation. We include the first (to our knowledge) published comparison of count merging and Bayesian interpolation, showing that the two techniques perform similarly. Finally, we argue that other considerations will make Bayesian interpolation the preferred approach in most circumstances.

His talk takes place on Thursday, December 19, 2019 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).

Barbara Schuppler: Automatic speech recognition for conversational speech, or: What we can learn from human talk in interaction

barbaraBarbara Schuppler (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria) pursued her PhD research at Radboud Universiteit Nijmegen (The Netherlands) and at NTNU Trondheim (Norway) within the Marie Curie Research Training Network “Sound to Sense”. The central topic of ther thesis was the analysis of conditions for variation in large conversational speech corpora using ASR technology. Currently, she is working on a FWF-funded Elise-Richter Grant entitled ”Cross-layer prosodic models for conversational speech,” and in October 2019 starts her follow up project “Cross-layer language models for conversational speech.” Her research continues to be interdisciplinary; it includes the development of automatic tools for the study of prosodic variation, the study of reduction and phonetic detail in conversational speech and the integration of linguistic knowledge into ASR technology.

Automatic speech recognition for conversational speech, or: What we can learn from human talk in interaction

In the last decade, conversational speech has received a lot of attention among speech scientists. On the one hand, accurate automatic speech recognition (ASR) systems are essential for conversational dialogue systems, as these become more interactional and social rather than solely transactional. On the other hand, linguists study natural conversations, as they reveal additional insights to controlled experiments with respect to how speech processing works. Investigating conversational speech, however, does not only require applying existing methods to new data, but developing new categories, new modeling techniques and including new knowledge sources. Whereas traditional models are trained on either text or acoustic information, I propose language models that incorporate information on the phonetic variation of the words (i.e., pronunciation variation and prosody) and relate this information to the semantic context of the conversation and to the communicative functions in the conversation. This approach to language modeling is in line with the theoretical model proposed by Hawkins and Smith (2001), where the perceptual system accesses meaning from speech by using the most salient sensory information from any combination of levels/layers of formal linguistic analysis. The overal aim of my research is to create cross-layer models for conversational speech. In this talk, I will illustrate general challenges for ASR with conversational speech, I will present results from my recent and ongoing projects on pronunciation and prosody modeling, and I will discuss directions for future research.

Her talk takes place on Thursday, October 31, 2019 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).

Pratibha Moogi: India Centric R&D efforts in artificial intelligence

obo-2011Pratibha Moogi holds PhD from OGI, School of Engineering, OHSU, Portland and Masters from IIT Kanpur. She has served SRI International lab and many R&D groups including Texas Instruments, Nokia, and Samsung. Currently she is serving as a Director in Data Science Group (DSG), in a leading B2B customer operation & journey analytics company, [24]7.ai. She is also actively involved in mentoring India-wide training initiatives, start-up setups working in the domain of ML and AI for strengthening local India eco-system. She has got 16+ years of industry experience working on diverse set of Multimedia processing and ML based technologies namely Speech & Audio Recognition, Fingerprint, IRIS Biometric, Computer Vision based solutions & use-case scenarios development. Her current interests are emerging fields of applying Machine Learning to interdisciplinary, cross domain areas e.g. Multichannel Data Sources based Predictive Analytics.

India Centric R&D efforts in artificial intelligence

India, a country of ~1.3 billion people, ~300 million smart phone users, ~600 million internet users are getting on to use and feel AI, ML flavored solutions every single day, more than ever – be it Intelligent Camera which can take picture when you give that perfect smile, beauty your face to an extent you can look beautiful , tagging your pictures based on what content, subject you tried capturing in some of yours perfect shots, hiding your gallery photos from intruders using your Fingerprint, IRIS, or Face biometric, fetching that very product details that you just spotted in a Mall, or something that your friend is having right now – empowered by Content(Image) based Information (Product) Search Algorithms. Speech & Language Technologies are redefining voice interface for Indian common users – who speaks 28+ odd local languages. Voice analytics solutions are empowering BPO (Customer Care) centers –whether it is routing of millions of calls using automatically detected Customer intents, or segregating calls using positive or negative customer Sentiments, or automatically generating Business Insights that can drive more profits, revenues, higher customer satisfaction scores – All powered by Predictive Analytics solutions. This talk covers some of the India Centric R&D efforts – experienced by me while working on variety of products, services, and solutions over last decade. Talk is organized as the following topics 1. AI/ML – Digital India – Context (Problems & Opportunities, GDP landscape, Start-ups scenario), 2. Products & Solutions – recent deployments, 3. Present R&D spectrum – Algorithmic research efforts 4. Overall learnings from Indian Market.

Her talk takes place on Friday, September 13, 2019 at 13:00 in room A112.

Itshak Lapidot: Speaker Diarization and a bit more

Itshak Lapidot emigrated from the USSR to Israel in 1971. He received his B.Sc., M.Sc., and Ph.D. degrees in Electrical and Computer Engineering Department from Ben-Gurion University, Beer-Sheva, Israel in 1991, 1994 and 2001, respectively. During one year (2002-2003) he held a postdoctoral position at IDIAP Switzerland. Dr. Lapidot was previously a lecturer at the Electrical and Electronics Engineering Department at Sami Shamoon College of Engineering (SCE), in Beer-Sheva, Israel and served as a Researcher at the Laboratoire Informatique d’Avignon (LIA), University of Avignon in France during one year (2011-2012). Recently, Dr. Lapidot assumed a teaching position with the Electrical Engineering Department at the Afeka Academic College of Engineering and joined the ACLP research team. Dr. Lapidot’s primary research interests are speaker diarization, speaker clustering and speaker verification. He is also interesting in clustering and time series analysis from theoretical point of view.

Speaker Diarization and a bit more

In the talk will be presented three approaches applied to speaker and speech technologies, but can be applied to other machine learning (ML) technologies:
1. Speaker diarization – it is answering the question “Who spoke when?” when there is no knowledge about the speakers and the environments, no prior knowledge can be used and the problem is of unsupervised type. When no prior information can be use, even to train GMM, Total Variability matrix or PLDA, a different approach must take place, which use only the data of the given conversation. One of the possible solutions is using Viterbi based segmentation of hidden-Markov-models (HMMs). It assumes a high correlation between the log-likelihood and the diarization error rate (DER). This assumption leads to different problems. One possible solution will be sown, not only probabilistic to system but to a much broader family of solution named hidden-distortion-models (HDMs).
2. In different applications like homeland security, clustering of large amount of short segments is very important. The number of segments can be from hundreds to tens of thousands and the number of speakers from 2 up to tens of speakers (about 60 speakers). Several variants of the mean-shift clustering algorithm will be presented to solve the problem. An automatic way to estimate the clustering validity will be presented as well. It is very important, as clustering can be viewed as the preprocessing before other tasks, e.g., speaker verification. Using bad clustering will lead to poor verification results. As manual qualification of the clustering is not visible, an automatic tool is almost “must” tool.
3. Data-homogeneity measure for voice comparison – given two speech utterance for speaker verification, it is important that the utterances are valid for reliable comparison. Maybe the utterances are too short, or do not share enough common information for comparison. In this case high or low likelihood ratio is meaningless. The test of the data quality should be verification system independent. Such entropy based measure will be presented and the relations with verification performance will be shown.
4. Database assessment – when the data divided into train, development and evaluation datasets it sequential data as speech it is very difficult to know whether the sets are statistically meaningful for learning (even a fair coin can fall 100 times on tail). It is important to verify the statistical validity of the datasets prior to the training, development and evaluation process and it should be verified independent from the verification system/approach. Such data assessment will be presented, based on an entropy of the speech waveform.

His talk takes place on Tuesday, January 15, 2019 at 13:00 in room A113.

Misha Pavel: Digital Phenotyping Using Computational Models of Neuropsychological Processes Underlying Behavioral States and their Dynamics

Misha Pavel holds a joint faculty appointment in the College of Computer & Information Science and Bouvé College of Health Sciences. His background comprises electrical engineering, computer science and experimental psychology, and his research is focused on multiscale computational modeling of behaviors and their control, with applications ranging from elder care to augmentation of human performance. Professor Pavel is using these model-based approaches to develop algorithms transforming unobtrusive monitoring from smart homes and mobile devices to useful and actionable knowledge for diagnosis and intervention. Under the auspices of the Northeastern-based Consortium on Technology for Proactive Care, Professor Pavel and his colleagues are targeting technological innovations to support the development of economically feasible, proactive, distributed, and individual-centered healthcare. In addition, Professor Pavel is investigating approaches to inferring and augmenting human intelligence using computer games, EEG and transcranial electrical stimulation. Previously, Professor Pavel was the director of the Smart and Connected Health Program at the National Science Foundation, a program co-sponsored by the National Institutes of Health. Earlier, he served as the chair of the Department of Biomedical Engineering at Oregon Health & Science University, a Technology Leader at AT&T Laboratories, a member of the technical staff at Bell Laboratories, and faculty member at Stanford University and New York University. He is a Senior Life Member of IEEE.

Digital Phenotyping Using Computational Models of Neuropsychological Processes Underlying Behavioral States and their Dynamics

Human behaviors are both key determinants of health and effective indicators of individuals’ health and mental states. Recent advances in sensing, communication technology and computational modeling are supporting unprecedented opportunity to monitor individuals in the wild – in their daily lives. Continuous monitoring, thereby, enables Digital Phenotyping – characterization of health states, inferences of subtle changes in health states and thereby facilitating theoretical insights into human neuropsychology and neurophysiology. Moreover, temporally dense measurements may provide opportunities for optimal just-in-time interventions helping individuals to improve their health behaviors. Harvesting the potential benefits of digital phenotyping is, however, limited by the variability of behaviors as well as contextual and environmental effects that may significantly distort measured data. To mitigate these adverse effects, we have been developing computational models of a variety of physiological, neuropsychological and behavioral phenomena. In this talk, I will briefly discuss a continuum of models ranging from completely data-driven to principle-based, causal and mechanistic. I will then describe a few examples of approaches in several domains including cognition, sensory-motor behaviors and affective states. I will also describe a framework that can use such approaches as components of future proactive and distributed care, tailored to individuals.

His talk takes place on Monday, December 3, 2018 at 13:00 in room A113.

Jiří Schimmel: Spatial Audio Coding Using Ambisonic

Jiří Schimmel has been a doctoral student in the Department of Telecommunications of FEEC BUT since 1999. In 2006 he defends his doctoral thesis on the topic “Audio Effect Synthesis Using Non-Linear Signal Processing” and in 2016 habilitation thesis on “New Methods of Spatial Audio Coding and Rendering”. His professional scientific activity is focused on the research in the area of digital audio signal processing, on the research and development of real-time signal processing systems and multi-channel sound systems. He also cooperates with interior and foreign companies (C-Mexx, DFM, Audified).

Spatial Audio Coding Using Ambisonic

Ambisonic is a mathematically based acoustic signal processing technology that attempts to capture and reproduce information from a complete three-dimensional sound field, including the exact localization of each sound source and the environmental characteristics of the field. Basically this is a simplified solution of the wave equation for the progressive convergent spherical wave using spherical harmonic decomposition of the wave field. Theory and technologies related to ambisonic were developed already in the 1970s but its real-time use has been enabled by modern computing technologies. The output of the coding process are so-called ambisonic components whose number determines the order of the ambisonic as well as accuracy of the encoding and the subsequent reconstruction of the sound field. There are two ways how to obtain the ambisonic components – encoding sound object and capture the sound field using 3D microphone. The encoding process is based on finding weighting factors of ambisonic components according to the position of an audio object. For the 3D sound field capture a set of microphones is used that form virtual 3D microphone whose components are identical to the ambisonic components. The decoding process is based on reconstruction of the sound filed using several sound sources (loudspeakers) which supposes further simplifications. Although the sound field is mathematically fully described in ambisonic, there are still many problems that need to be addressed in its practical use.

His talk takes place on Tuesday, October 2, 2018 at 13:00 in room A113.

Petr Dokládal: Image processing in Non-Destructive Testing

Petr Dokládal is a senior researcher with the Center for Mathematical Morphology, a joint research lab of Armines and MINES ParisTech, Paris, France. He graduated from the Technical University in Brno, Czech Republic, in 1994, as a telecommunication engineer, received his Ph.D. degree in 2000 from the Marne la Vallée University, France, in general computer sciences, specialized in image processing and received his habilitation from the ParisEst University in 2013. His research interests include mathematical morphology, image segmentation, object tracking and pattern recognition.

Image processing in Non-Destructive Testing

Non-destructive testing is a frequent task in industry for material control and structure inspection. There are many imaging techniques available to make defects visible. Effort is being made to automatize the process to make it repeatable, more accurate, cheaper and environment friendly. Others techniques (able to work remotely, easier to automatize) are being developed. Most of these techniques are still followed by a visual inspection performed by a qualified personnel.

In the beginning of this talk we will review a few, various inspection techniques used in industry. In the second part we will focus on the detection of cracks. From the image processing angle of view cracks are thin, curvilinear structures. They are not always easy to detect especially when surrounded by noise. We show in this talk how cracks can be detected by using path openings, an operator from mathematical morphology. Then, inspired by the a contrario approach, we will show how to choose a convenient threshold value to obtain a binary result. The a contrario approach, instead of modeling the structures to detect, models the noise to detect structures deviating from the model. In this scope, we assume noise composed of pixels that are independent random variables. Henceforth, cracks that are curvilinear and not necessarily connected sequences of bright pixels, are detected as abnormal sequences of bright pixels. In the second part, a fast approximation of the solution based on parsimonious path openings is shown.

His talk takes place on Tuesday, September 18, 2018 at 13:00 in room A113.

Santosh Mathan: Scaling up Cognitive Efficacy with Neurotechnology

Santosh Mathan is an Engineering Fellow at Honeywell Aerospace. His research lies at the intersection of human-computer interaction, machine
learning, and neurophysiological sensing. Santosh is principal investigator and program manager on several efforts to use neurotechnology in practical settings. These efforts, carried out in collaboration with academic and industry researchers around the world, have led to the development of systems that can estimate changes in cognitive function following brain trauma, identify fluctuations in attention, boost the activity of cortical networks underlying fluid intelligence, and serve as the basis for hands-free robotic control. Papers describing these projects have won multiple best paper awards at research conferences, and have been covered by the press in publications including the Wall Street Journal and Wired. He has been awarded over 19 US patents. Santosh has a doctoral degree in Human-Computer Interaction from the School of Computer Science at Carnegie Mellon University, where his research explored the use of computational cognitive models for diagnosing and remedying student difficulties during skill acquisition.

Scaling up Cognitive Efficacy with Neurotechnology

Cognition and behavior arise from the activity of billions of neurons. Ongoing research indicates that non-invasive neural sensing techniques can provide a window into this never ending storm of electrical activity in our brains, and yield rich information of interest to system designers and trainers. Direct measurement of brain activity has the potential to provide objective measures that can help estimate the impact of a system on users during the design process, estimating cognitive proficiency during training, and providing new modalities for humans to interact with computer systems. In this presentation, Santosh Mathan will review research in the Honeywell Advanced Technology organization that offer novel tools and techniques to advance Human Computer Interaction. While many of these research explorations are at an early stage, they offer the preview of practical tools that lie around the corner for researchers and practitioners with an interest in boosting human performance in challenging task environments.

His talk takes place on Friday, August 24, 2018 at 13:00 in room A112.

Slides of the talk are publicly available.

Niko Brummer: Tractable priors, likelihoods, posteriors and proper scoring rules for the astronomically complex problem of partitioning a large set of recordings w.r.t. speaker

brummerNiko Brummer received B.Eng (1986), M.Eng (1988) and Ph.D. (2010) degrees, all in electronic engineering, from Stellenbosch University. He worked as researcher at DataFusion (later called Spescom DataVoice), and AGNITIO and is currently with Nuance Communications. Most of his research for the last 25 years has been applied to automatic speaker and language recognition and he has been participating in most of the NIST SRE and LRE evaluations in these technologies, from the year 2000 to the present. He has been contributing to the Odyssey Workshop series since 2001 and was organizer of Odyssey 2008 in Stellenbosch. His FoCal and Bosaris Toolkits are widely used for fusion and calibration in speaker and language recognition research.

His research interests include development of new algorithms for speaker and language recognition, as well as evaluation methodologies for these technologies. In both cases, his emphasis is on probabilistic modelling. He has worked with both generative (eigenchannel, JFA, i-vector PLDA) and discriminative (system fusion, discriminative JFA and PLDA) recognizers. In evaluation, his focus is on judging the goodness of classifiers that produce probabilistic outputs in the form of well calibrated class likelihoods.

Tractable priors, likelihoods, posteriors and proper scoring rules for the astronomically complex problem of partitioning a large set of recordings w.r.t. speaker

Real-world speaker recognition problems are not always arranged into neat, NIST-style challenges with large labelled training databases and binary target/non-target evaluation trials. In the most general case we are given a (sometimes large) collection of recordings and ideally we just want to go and recognize the speakers in there. This problem is usually called speaker clustering and solutions like AHC (agglomerative hierarchical clustering) exist. The catch is that neither AHC, nor indeed any other yet-to-be-invented algorithm can find the correct solution with certainty. In the simple case of binary trials, we in the speaker recognition world are already very comfortable with dealing with this uncertainty—the recognizers quantify their uncertainty as likelihood-ratios. We know how calibrate these likelihood-ratios, how to use them to make Bayes decisions and how to judge their goodness with proper scoring rules. At a first glance all of these things seem to be hopelessly intractable for the clustering problem because of the astronomically large size of the solution space. In this talk show otherwise and propose a suite of tractable tools for probabilistic clustering.

His talk takes place on Monday, April 16, 2018 at 13:00 in room G202.

Video recording of the talk is publicly available.

Slides of the talk are publicly available.