Hynek Hermansky has been active in speech research for over 40 years, is a Life Fellow of IEEE, Fellow of the International Speech Communication Association, authored or co-authored more than 350 papers with over 20,000 citations, holds more than 20 patents and received IEEE James L. Flanagan Speech and Audio Processing Award, and ISCA Medal for Scientific Achievements. He started his career in 1972 at Brno University of Technology, obtained his D.Eng.. degree from the University of Tokyo, worked for Panasonic Technologies, U S WEST Advanced Technologies, the Oregon Graduate Institute, IDIAP Martigny, the Johns Hopkins University, and Google Deep Mind. Currently, he is a Researcher at Speech@FIT BUT, and an Emeritus Professor at the Johns Hopkins University.
Learning: It’s not just for machines anymore
Machine recognition of speech requires training on a large amount of speech training data. Subsequently, research in machine recognition of speech consists mainly of getting hands-on large amounts of speech training data combined, often by a try-and-error, with the appropriate combination of processing modules. Advances are mostly being evaluated by error rates observed in recognition of test data. Such a process may be missing one of the prime goals of scientific endeavor, which is to obtain new knowledge, applicable to other applications. We argue that speech data can be used to obtain relevant hearing knowledge, which is used in decoding messages in speech, and report on some experiments, which support this notion.
His talk takes place Wednesday, November 22, 2023 at 14:00 in E105.
Sébastien Lefèvre is currently a Full Professor in Computer Science at the University of South Brittany (Vannes Institute of Technology) since September 2010. He founded the OBELIX group from IRISA laboratory, and led the group from 2013 to 2021 (Prof. Nicolas Courty is leading the group since March 2021). He is also coordinating the GeoData Science track within the Erasmus Mundus Copernicus Master in Digital Earth. His main research topics are image analysis/processing, pattern recognition and indexing, machine learning, deep learning and data mining with applications in remote sensing for Earth observation.
Deep Learning in Computer Vision: Are Numerous Labels the Holy Grail?
Deep Learning has been successful in a wide range of computer vision tasks, at the cost of high computational resources and large labeled datasets required to train the models. The latter is a strong bottleneck in numerous applications where collecting annotated data is challenging.
In this talk, I will present some of our works attempting to alleviate our need for large annotated datasets. More precisely, the methods we develop rely on semi-supervised, weakly-supervised, unsupervised settings, domain adaptation, data simulation, active learning, among other frameworks. Various applications in Earth Observation will be provided to illustrate the relevance of these solutions for a wide range of problems such as semantic segmentation, image classification, or object detection.
His talk takes place in Thursday, June 15, 2023 at 14:00 in G108.
Jiri Mekyska is head of the BDALab (Brain Diseases Analysis Laboratory) at the Brno University of Technology, where he leads a multidisciplinary team of researchers (signal processing engineers, data scientists, neuroscientists, psychologists) with a special focus on the development of new digital biomarkers facilitating understanding, diagnosis and monitoring of neurodegenerative (e.g. Parkinson’s disease) and neurodevelopmental (e.g. dysgraphia) disorders.
Acoustic analysis of speech and voice disorders in patients with Parkinson’s disease
Parkinson’s disease (PD) is the second most frequent neurodegenerative disease, which is associated with several motor and non-motor features. Up to 90 % of PD patients develop a motor speech disorder called hypokinetic dysarthria (HD). HD manifests in the field of phonation (e.g. increased instability of articulatory organs, microperturbation in pitch and amplitude), articulation (e.g. rigidity of tongue and jaw, slow alternating motion rate), prosody (e.g. monopitch, monoloudness), and respiration (e.g. airflow insufficiency). Acoustic analysis of these specific speech/voice disorders enables neurologists and speech-language therapists to effectively monitor the progress of PD as well as to diagnose it. In the frame of this talk, we will present a concept of acoustic HD analysis. Consequently, we will present some recent findings focused on the prediction of motor (freezing of gait) and non-motor (cognitive) deficits based on the acoustic analysis, we will discuss an application of acoustic HD analysis in treatment effect monitoring (based on high-frequency repetitive transcranial magnetic stimulation), and in PD diagnosis. Finally, we will present some future directions in terms of integration into Health 4.0 systems.
His talk takes place in Tuesday, May 16, 2023 at 15:00 in E105.
András Lőrincz, a professor and senior researcher, has been teaching at the Faculty of Informatics at Eötvös University, Budapest since 1998. His research focuses on human-machine interaction and their applications in neurobiological and cognitive modeling, as well as medicine. He has founded the Neural Information Processing Group of Eötvös University and he directs a multidisciplinary team of mathematicians, programmers, computer scientists and physicists. He has acted as the PI of several successful international projects in collaboration with Panasonic, Honda Future Technology Research and the Information Directorate of the US Air Force, Robert Bosch, Ltd. Hungary, among others. He took part in several EU Framework Program projects.
He is a habilitated professor at the University of Szeged (1998) on laser physics and habilitated in the field of Informatics at the Eötvös Loránd University in 2008. He conducted research and taught quantum control, photoacoustics and artificial intelligence at the Hungarian Academy of Sciences, University of Chicago, Brown University, Princeton University, the Illinois Institute of Technology, University of Szeged, and Eötvös Loránd University. He authored about 300 peer reviewed scientific publications.
He has become an elected Fellow of the European Coordinating Committee for Artificial Intelligence (EurAI) for his pioneering work in the field of artificial intelligence in 2006. He has received the Innovative Researcher Prize of the University in 2009 and in 2019.
Partners: Barcelona University (on personality estimation and human-human interaction), Technical University of Delft (on human-human interaction), Rush Medical School, Chicago, on autism diagnosis and PTSD therapy.
Towards human-machine and human-robot interactions “with a little help from my friends”
Our work in the Neural Information Processing Group focuses on human-machine interactions. The first part of the talk will be an introduction to the technologies that we can or should use for effective iterations, such as the detection of environmental context, ongoing activity, including body movement, manipulation, and hidden parameters, i.e. intention, mood and personality state, as well as communication signals: body, head, hand, hand, face and gaze gestures, plus the body parameters that can be measured optically or by intelligent means, i.e., the temperature, blood pressure and stress levels, among others.
In the second part of the talk, I will review (a) what body and environment estimation methods we have, (b) what we can say about human-human interactions, which will also give insights into the requirements of human-machine and human-robot interactions, (c) what applications we have or can target in the areas of autism, “continuous healthcare” and “home and public safety”. (d) I will also list what technologies are missing and what we are looking for partners in.
His talk takes place on Tuesday, November 1, 2022 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).
Heikki Kälviäinen has been a Professor of Computer Science and Engineering since 1999. He is the head of the Computer Vision and Pattern Recognition Laboratory (CVPRL) at the Department of Computational Engineering of Lappeenranta-Lahti University of Technology LUT, Finland. Prof. Kälviäinen’s research interests include computer vision, machine vision, pattern recognition, machine learning, and digital image processing and analysis. Besides LUT, Prof. Kälviäinen has worked as a Visiting Professor at the Faculty of Information Technology of Brno University of Technology, Czech Republic, the Center for Machine Perception (CMP) of Czech Technical University, and the Centre for Vision, Speech, and Signal Processing (CVSSP) of University of Surrey, UK, and as a Professor of Computing at Monash University Malaysia.
Computer Vision Applications
The presentation considers computer vision, especially a point of view of applications. Digital image processing and analysis with machine learning methods enable efficient solutions for various areas of useful data-centric engineering applications. Challenges with image acquisition, data annotation with expert knowledge, and clustering and classification, including deep learning method training are discussed. Different applications are given as examples based on the fresh novel data available: planktons in the Baltic Sea, Saimaa ringed seals in Lake Saimaa, and logs in the sawmill industry. In the first application the motivation is that distributions of plankton types give much information about the condition of the sea water system, e.g., about the climate change. An imaging flow cytometer can produce a lot of plankton images which should be classified into different plankton types. Manual classification of these images is very laborious, and thus, a CNN-based method has been developed to automatically recognize the plankton types in the Baltic Sea. In the second application the Saimaa ringed seals are automatically identified individually using camera trap images for assisting this very small population to survive in nature. CNN-based re-identification methods are based on pelage patterns of the seals. The third application is related to the sawmill industry. The digitalization of the sawmill industry is important for optimizing material flows and the quality. The research is focused on seeing inside the log to be able to predict which kinds of sawn boards are produced after cutting the log.
His talk takes place on Wednesday, May 11, 2022 at 13:00 in room A112.
Augustin Žídek works as a Research Engineer at DeepMind and has been a member of the protein folding team since 2017. He studied Computer Science at the University of Cambridge. He enjoys working at the boundary of research and engineering, hiking, playing musical instruments and fixing things.
Protein Structure Prediction with AlphaFold
In this talk, we will discuss what proteins are, what is the protein folding problem and why it is an important scientific challenge. We will then talk about AphaFold, a machine learning model developed by DeepMind that is able to predict protein 3D structure with high accuracy, its architecture and applications.
Hema A. Murthy is currently a Professor at the Department of Computer Science and Engineering. She has been with the department for the last 35 years. She currently leads an 18 Institute consortium that focuses on speech as part of the national language translation mission, an ambitious project where the objective is to produce speech to speech translation in Indian languages and Indian English.
Signal Processing Guided Machine Learning
In this talk we will focus on using signal processing algorithms in tandem with machine learning algorithms for various tasks in speech, music and brain signals. The primary objective is to understand events of interest from the perspective of the chosen domain. Appropriate signal processing is employed to detect events. Machine learning algorithms are then made to focus on learning the statistical characteristics of these events. The primary advantage of this approach is that it significantly reduces both computation and data costs. Examples from speech synthesis, Indian art music, and neuronal signals and EEG signals will be considered.
Her talk takes place on Tuesday, February 8, 2022 at 13:00 CET, virtually on zoom https://cesnet.zoom.us/j/91741432360.
Prof. Dr. Bernhard Egger studies how humans and machines can perceive faces and shapes in general. In particular, he chooses to focus on statistical shape models and the 3D Morphable Models. He is a junior professor at the chair of visual computing at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). Before joing FAU he was a postdoc in Josh Tenenbaum‘s Computational Cognitive Science Lab at the Departement of Brain and Cognitive Sciences at MIT and the Center for Brains, Minds and Machines (CBMM) and Polina Golland‘s group at MIT Computer Science & Artificial Intelligence Lab. He did his PhD on facial image annotation and interpretation in unconstrained images in the Graphics and Vision Research Group at the University of Basel. Before his doctorate he obtained his M.Sc. and B.Sc. in Computer Science at the University of Basel and an upper secondary school teaching Diploma at the University of Applied Sciences Northwestern Switzerland.
Inverse Graphics and Perception with Generative Face Models
Human object perception is remarkably robust: Even when confronted with blurred or sheared photographs, or pictures taken under extreme illumination conditions, we can often recognize what we’re seeing and even recover rich three-dimensional structure. This robustness is especially notable when perceiving human faces. How can humans generalize so well to highly distorted images, transformed far beyond the range of natural face images we are normally exposed to? In this talk I will present an Analysis-by-Synthesis approach based on 3D Morphable Models that can generalize well across various distortions. We find that our top-down inverse rendering model better matches human precepts than either an invariance-based account implemented in a deep neural network, or a neural network trained to perform approximate inverse rendering in a feedforward circuit.
Ondřej Dušek is an assistant professor at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University. His research is in the areas of dialogue systems and natural language generation; he specifically focuses on neural-networks-based approaches to these problems and their evaluation. He is also involved in the THEaiTRE project on automatic theatre play generation. Ondřej got his PhD in 2017 at Charles University. Between 2016 and 2018, he worked at the Interaction Lab at Heriot Watt University in Edinburgh, one of the leading groups in dialogue systems and natural-language interaction with computers and robots. There he co-organized the E2E NLG text generation challenge and co-led a team of PhD students in the Amazon Alexa Prize dialogue system competition, which came third in two consecutive years.
Better Supervision for End-to-end Neural Dialogue Systems
While end-to-end neural models have been the research trend in task-oriented dialogue systems in the past years, they still suffer from significant problems: The neural models often produce replies inconsistent with past dialogue context or database results, their replies may be dull and formulaic, and they require large amounts of annotated data to train. In this talk, I will present two of our recent experiments that aim at solving these problems.
First, our end-to-end neural system AuGPT based on the GPT-2 pretrained language model aims at consistency and variability in dialogue responses by using massive data augmentation and filtering as well as specific auxiliary training objectives which check for dialogue consistency. It reached favorable results in terms of both automatic metrics and human judgments (in the DSTC9 competition).
Second, we designed a system that is able to discover relevant dialogue slots (domain attributes) without any human annotation. It uses weak supervision from generic linguistic annotation models (semantic parser, named entities), which is further filtered and clustered. We train a neural slot tagger on the discovered slots, which then reaches state-of-the-art results in dialogue slot tagging without labeled training data. We further show that the discovered slots are helpful for training an end-to-end neural dialogue system.
His talk takes place on Wednesday, December 1, 2021 at 15:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”). The talk will be streamed live and recorded at https://www.youtube.com/watch?v=JzBy-QuLxiE.
Tanel Alumäe is the head of Laboratory of Language Technology at Tallinn University of Technology (TalTech). He received his PhD degree from the same university in 2006. After that, he has worked in several research teams, including LIMSI/CNRS, Aalto University and Raytheon BBN Technologies. His recent research has focused on practical approaches to low-resource speech and language processing.
Weakly supervised training for speaker and language recognition
Speaker identification models are usually trained on data where the speech segments corresponding to the target speakers are hand-annotated. However, the process of hand-labelling speech data is expensive and doesn’t scale well, especially if a large set of speakers needs to be covered. Similarly, spoken language identification models require large amounts of training samples from each language that we want to cover.
This talk will show how metadata accompanied with speech data found on the internet can be treated as weak and/or noisy labels for training speaker and language identification models. Speaker identification models can be trained using only the information about speakers appearing in each of the recordings in training data, without any segment level annotation. For spoken language identification, we can often treat the detected language of the description of the multimedia clip as a noisy label. The latter method was used to compile VoxLingua107, a large scale speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. It contains data for 107 languages, with 62 hours per language on the average. A model trained on this dataset can be used as-is, or finetuned for a particular language identification task using only a small amount of manually verified data.
His talk takes place on Tuesday, November 9, 2021 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”). The talk will be streamed live and recorded at
https://youtu.be/fpsC0jzZSvs – thanks FIT student union for support!
Boliang Zhang is a research scientist at DiDi Labs, Los Angeles, CA. Currently, he works on building intelligent chatbots to help humans fulfill tasks. Before that, he has interned at Microsoft, Facebook, and AT&T Labs. He received his Ph.D. in 2019 at Rensselaer Polytechnic Institute. His thesis topic focuses on applications of neural networks for information extraction for low-resource languages. He has a broad interest in applications of natural language processing. He participated in DARPA Low Resource Languages for Emergent Incidents (LORELEI) project, where he, as a core system developer, built named entity recognition and linking system for low-resource languages, such as Hausa and Oromo, and achieves first place in the evaluation four times in a row. At DiDi Labs, he leads a small group to compete in the Multi-domain Task-oriented Dialog Challenge of DSTC9 and tied for first place among ten teams.
End-to-End Task-oriented Dialog Agent Training and Human-Human Dialog Collection
Task-oriented dialog systems aim to communicate with users through natural language to accomplish a wide range of tasks, such as restaurant booking, weather querying, etc. With the rising trend of artificial intelligence, they have attracted attention from both academia and industry. In the first half of this talk, I will introduce our participation in the DSTC9 Multi-domain Task-oriented Dialog Challenge and present our end-to-end dialog system. Compared to traditional pipelined dialog architecture where modules like Natural Language Understanding (NLU), Dialog Manager (DM), and Natural Language Generation (NLG) work separately and are optimized individually, our end-to-end system is a GPT-2 based fully data-driven method that jointly predicts belief states, database queries, and responses. In the second half of the talk, as we found that existing dialog collection tool has limitations in the real world scenario, I will introduce a novel human-human dialog platform that reduces all agent activity (API calls, utterances) to a series of clicks, yet maintains enough flexibility to satisfy users. This platform enables real-time agents to do real tasks, meanwhile stores all agent’s actions that are used for training chatbots later on.
The talk will take place on Tuesday April 20th at 17:00 CEST (sorry for late hour, but Boliang is on the US West Coast), virtually on zoom https://cesnet.zoom.us/j/95296064691.
Srikanth Madikeri got his Ph.D. in Computer Science and Engineering from Indian Institute of Technology Madras (India) in 2013. During his Ph.D., he worked on automatic speaker recognition and spoken keyword spotting. He is currently working as a Research Associate at Idiap Research Institute (Martigny, Switzerland) in the Speech Processing group. His current research interests include – Automatic Speech Recognition for low resource languages, Automatic Speaker Recognition and Speaker Diarization.
Automatic Speech Recognition for Low-Resource languages
This talk focuses on automatic speech recognition (ASR) systems for low-resource languages with applications to information retrieval.
A common approach to improve ASR system performance for low-resource ASR is to train multilingual acoustic models by pooling resources from multiple languages. In this talk, we present the challenges and benefits of different multilingual modeling with Lattice-Free Maximum Mutual Information (LF-MMI), the state-of-the-art technique for hybrid ASR systems. We also present an incremental semi-supervised learning approach applied to multi-genre speech recognition, a common task in the MATERIAL program. The simple approach helps avoid fast saturation of performance improvements when using large amounts of data for semi-supervised learning. Finally, we present Pkwrap, a Pytorch wrapper on Kaldi (among the most popular speech recognition toolkits), that helps combine the benefits of training acoustic models with Pytorch and Kaldi. The toolkit, now available at https://github.com/idiap/pkwrap, is intended to provide both fast prototyping benefits of Pytorch while using necessary functionalities from Kaldi (LF-MMI, parallel training, decoding, etc.).
The talk will take place on Monday March 8th 2021 at 13:00 CET, virtually on zoom https://cesnet.zoom.us/j/98589068121.
Jan Ullrich is the linguistic director of The Language Conservancy, an organization serving indigenous communities in projects of language documentation and revitalization. His main research interests are in morphosyntactic analyses, semantics, corpus linguistics, lexicography, and second language acquisition.
He holds a Ph.D. in linguistics from Heinrich-Heine-Universität in Düsseldorf. He has taught at Indiana University, University of North Dakota, Oglala Lakota College, and Sitting Bull College and has given lectures at a number of institutions in Europe and North America.
Ullrich has been committed to and worked in fieldwork documentation and analysis of endangered languages since 1992, primarily focusing on the Dakotan branch of the Siouan language family (e.g. Lakhota, Dakhota, Assiniboine, Stoney). His research represents highly innovative, and in parts groundbreaking, analysis of predication and modification in Lakhota. He is the author and co-author of a number of highly acclaimed publications, such as the New Lakota Dictionary and the Lakota Grammar Handbook.
Research on head-marking languages: its contribution to linguistic theory and implications for NLP and NLU
Some of the most widely used linguistic theories, and especially those which have been more or less unsuccessfully applied in computer parsing and NLP, are affected by three main problems: (a) they are largely based on the study of dependent-marking syntax, which means they ignore half of the world’s languages, (b) they are syntacto-centric and mostly disregard semantics, and (c) they are not monostratal, but instead propose deep structures which cannot readily be accessed by statistically driven models and parsing algorithms.
This presentation will introduce a number of the broadly relevant theoretical concepts developed from the study of head-marking languages, such as Lakhóta (Siouan), and some of their implications for NLP and NLU. It will offer a brief introduction to the Role and Reference Grammar, a theory which connects structure and function by implementing a two-way linking algorithm between constituency-based structural analysis and semantics.
Jan Chorowski is an Associate Professor at Faculty of Mathematics and Computer Science at the University of Wrocław and Head of AI at NavAlgo. He received his M.Sc. degree in electrical engineering from the Wrocław University of Technology, Poland and EE PhD from the University of Louisville, Kentucky in 2012. He has worked with several research teams, including Google Brain, Microsoft Researchand Yoshua Bengio’s lab at the University of Montreal. He has led a research topic during the JSALT 2019 workshop. His research interests are applications of neural networks to problems which are intuitive and easy for humans and difficult for machines, such as speech and natural language processing.
Representation learning for speech and handwriting
Learning representations of data in an unsupervised way is still an open problem of machine learning. We consider representations of speech and handwriting learned using autoencoders equipped with autoregressive decoders such as WeveNets or PixelCNNs. In those autoencoders, the encoder only needs to provide the little information needed to supplement all that can be inferred by the autoregressive decoder. This allows learning a representation able to capture high level semantic content from the signal, e.g. phoneme or character identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. I will show how the design choices of the autoencoder, such as the bottleneck kind its hyperparameters impact the induced latent representation. I will also show applications to unsupervised acoustic unit discovery on the ZeroSpeech task. Finally, I’ll show how knowledge about the average unit duration can be enforced during training ,as well as during inference on new data.
His talk takes place on Friday, January 10, 2020 at 13:00 in room A112.
Ilya Oparin is leading Language Modeling team that contributes to improving Siri at Apple. He did his Ph.D. on language modeling of inflectional languages at University of West Bohemia in collaboration with Speech@FIT group at Brno University of Technology. Before joining Apple in 2014, Ilya did 3 years of post-doc in Spoken Language Processing group at LIMSI. Ilya’s research interests cover any topics related to language modeling for automatic speech recognition and more broadly for natural language processing.
Connecting and Comparing Language Model Interpolation Techniques
In this work, we uncover a theoretical connection between two language model interpolation techniques, count merging and Bayesian interpolation. We compare these techniques as well as linear interpolation in three scenarios with abundant training data per component model. Consistent with prior work, we show that both count merging and Bayesian interpolation outperform linear interpolation. We include the first (to our knowledge) published comparison of count merging and Bayesian interpolation, showing that the two techniques perform similarly. Finally, we argue that other considerations will make Bayesian interpolation the preferred approach in most circumstances.
His talk takes place on Thursday, December 19, 2019 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).
Barbara Schuppler (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria) pursued her PhD research at Radboud Universiteit Nijmegen (The Netherlands) and at NTNU Trondheim (Norway) within the Marie Curie Research Training Network “Sound to Sense”. The central topic of ther thesis was the analysis of conditions for variation in large conversational speech corpora using ASR technology. Currently, she is working on a FWF-funded Elise-Richter Grant entitled ”Cross-layer prosodic models for conversational speech,” and in October 2019 starts her follow up project “Cross-layer language models for conversational speech.” Her research continues to be interdisciplinary; it includes the development of automatic tools for the study of prosodic variation, the study of reduction and phonetic detail in conversational speech and the integration of linguistic knowledge into ASR technology.
Automatic speech recognition for conversational speech, or: What we can learn from human talk in interaction
In the last decade, conversational speech has received a lot of attention among speech scientists. On the one hand, accurate automatic speech recognition (ASR) systems are essential for conversational dialogue systems, as these become more interactional and social rather than solely transactional. On the other hand, linguists study natural conversations, as they reveal additional insights to controlled experiments with respect to how speech processing works. Investigating conversational speech, however, does not only require applying existing methods to new data, but developing new categories, new modeling techniques and including new knowledge sources. Whereas traditional models are trained on either text or acoustic information, I propose language models that incorporate information on the phonetic variation of the words (i.e., pronunciation variation and prosody) and relate this information to the semantic context of the conversation and to the communicative functions in the conversation. This approach to language modeling is in line with the theoretical model proposed by Hawkins and Smith (2001), where the perceptual system accesses meaning from speech by using the most salient sensory information from any combination of levels/layers of formal linguistic analysis. The overal aim of my research is to create cross-layer models for conversational speech. In this talk, I will illustrate general challenges for ASR with conversational speech, I will present results from my recent and ongoing projects on pronunciation and prosody modeling, and I will discuss directions for future research.
Her talk takes place on Thursday, October 31, 2019 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).
Pratibha Moogi holds PhD from OGI, School of Engineering, OHSU, Portland and Masters from IIT Kanpur. She has served SRI International lab and many R&D groups including Texas Instruments, Nokia, and Samsung. Currently she is serving as a Director in Data Science Group (DSG), in a leading B2B customer operation & journey analytics company, 7.ai. She is also actively involved in mentoring India-wide training initiatives, start-up setups working in the domain of ML and AI for strengthening local India eco-system. She has got 16+ years of industry experience working on diverse set of Multimedia processing and ML based technologies namely Speech & Audio Recognition, Fingerprint, IRIS Biometric, Computer Vision based solutions & use-case scenarios development. Her current interests are emerging fields of applying Machine Learning to interdisciplinary, cross domain areas e.g. Multichannel Data Sources based Predictive Analytics.
India Centric R&D efforts in artificial intelligence
India, a country of ~1.3 billion people, ~300 million smart phone users, ~600 million internet users are getting on to use and feel AI, ML flavored solutions every single day, more than ever – be it Intelligent Camera which can take picture when you give that perfect smile, beauty your face to an extent you can look beautiful , tagging your pictures based on what content, subject you tried capturing in some of yours perfect shots, hiding your gallery photos from intruders using your Fingerprint, IRIS, or Face biometric, fetching that very product details that you just spotted in a Mall, or something that your friend is having right now – empowered by Content(Image) based Information (Product) Search Algorithms. Speech & Language Technologies are redefining voice interface for Indian common users – who speaks 28+ odd local languages. Voice analytics solutions are empowering BPO (Customer Care) centers –whether it is routing of millions of calls using automatically detected Customer intents, or segregating calls using positive or negative customer Sentiments, or automatically generating Business Insights that can drive more profits, revenues, higher customer satisfaction scores – All powered by Predictive Analytics solutions. This talk covers some of the India Centric R&D efforts – experienced by me while working on variety of products, services, and solutions over last decade. Talk is organized as the following topics 1. AI/ML – Digital India – Context (Problems & Opportunities, GDP landscape, Start-ups scenario), 2. Products & Solutions – recent deployments, 3. Present R&D spectrum – Algorithmic research efforts 4. Overall learnings from Indian Market.
Her talk takes place on Friday, September 13, 2019 at 13:00 in room A112.
Itshak Lapidot emigrated from the USSR to Israel in 1971. He received his B.Sc., M.Sc., and Ph.D. degrees in Electrical and Computer Engineering Department from Ben-Gurion University, Beer-Sheva, Israel in 1991, 1994 and 2001, respectively. During one year (2002-2003) he held a postdoctoral position at IDIAP Switzerland. Dr. Lapidot was previously a lecturer at the Electrical and Electronics Engineering Department at Sami Shamoon College of Engineering (SCE), in Beer-Sheva, Israel and served as a Researcher at the Laboratoire Informatique d’Avignon (LIA), University of Avignon in France during one year (2011-2012). Recently, Dr. Lapidot assumed a teaching position with the Electrical Engineering Department at the Afeka Academic College of Engineering and joined the ACLP research team. Dr. Lapidot’s primary research interests are speaker diarization, speaker clustering and speaker verification. He is also interesting in clustering and time series analysis from theoretical point of view.
Speaker Diarization and a bit more
In the talk will be presented three approaches applied to speaker and speech technologies, but can be applied to other machine learning (ML) technologies:
1. Speaker diarization – it is answering the question “Who spoke when?” when there is no knowledge about the speakers and the environments, no prior knowledge can be used and the problem is of unsupervised type. When no prior information can be use, even to train GMM, Total Variability matrix or PLDA, a different approach must take place, which use only the data of the given conversation. One of the possible solutions is using Viterbi based segmentation of hidden-Markov-models (HMMs). It assumes a high correlation between the log-likelihood and the diarization error rate (DER). This assumption leads to different problems. One possible solution will be sown, not only probabilistic to system but to a much broader family of solution named hidden-distortion-models (HDMs).
2. In different applications like homeland security, clustering of large amount of short segments is very important. The number of segments can be from hundreds to tens of thousands and the number of speakers from 2 up to tens of speakers (about 60 speakers). Several variants of the mean-shift clustering algorithm will be presented to solve the problem. An automatic way to estimate the clustering validity will be presented as well. It is very important, as clustering can be viewed as the preprocessing before other tasks, e.g., speaker verification. Using bad clustering will lead to poor verification results. As manual qualification of the clustering is not visible, an automatic tool is almost “must” tool.
3. Data-homogeneity measure for voice comparison – given two speech utterance for speaker verification, it is important that the utterances are valid for reliable comparison. Maybe the utterances are too short, or do not share enough common information for comparison. In this case high or low likelihood ratio is meaningless. The test of the data quality should be verification system independent. Such entropy based measure will be presented and the relations with verification performance will be shown.
4. Database assessment – when the data divided into train, development and evaluation datasets it sequential data as speech it is very difficult to know whether the sets are statistically meaningful for learning (even a fair coin can fall 100 times on tail). It is important to verify the statistical validity of the datasets prior to the training, development and evaluation process and it should be verified independent from the verification system/approach. Such data assessment will be presented, based on an entropy of the speech waveform.
His talk takes place on Tuesday, January 15, 2019 at 13:00 in room A113.
Misha Pavel holds a joint faculty appointment in the College of Computer & Information Science and Bouvé College of Health Sciences. His background comprises electrical engineering, computer science and experimental psychology, and his research is focused on multiscale computational modeling of behaviors and their control, with applications ranging from elder care to augmentation of human performance. Professor Pavel is using these model-based approaches to develop algorithms transforming unobtrusive monitoring from smart homes and mobile devices to useful and actionable knowledge for diagnosis and intervention. Under the auspices of the Northeastern-based Consortium on Technology for Proactive Care, Professor Pavel and his colleagues are targeting technological innovations to support the development of economically feasible, proactive, distributed, and individual-centered healthcare. In addition, Professor Pavel is investigating approaches to inferring and augmenting human intelligence using computer games, EEG and transcranial electrical stimulation. Previously, Professor Pavel was the director of the Smart and Connected Health Program at the National Science Foundation, a program co-sponsored by the National Institutes of Health. Earlier, he served as the chair of the Department of Biomedical Engineering at Oregon Health & Science University, a Technology Leader at AT&T Laboratories, a member of the technical staff at Bell Laboratories, and faculty member at Stanford University and New York University. He is a Senior Life Member of IEEE.
Digital Phenotyping Using Computational Models of Neuropsychological Processes Underlying Behavioral States and their Dynamics
Human behaviors are both key determinants of health and effective indicators of individuals’ health and mental states. Recent advances in sensing, communication technology and computational modeling are supporting unprecedented opportunity to monitor individuals in the wild – in their daily lives. Continuous monitoring, thereby, enables Digital Phenotyping – characterization of health states, inferences of subtle changes in health states and thereby facilitating theoretical insights into human neuropsychology and neurophysiology. Moreover, temporally dense measurements may provide opportunities for optimal just-in-time interventions helping individuals to improve their health behaviors. Harvesting the potential benefits of digital phenotyping is, however, limited by the variability of behaviors as well as contextual and environmental effects that may significantly distort measured data. To mitigate these adverse effects, we have been developing computational models of a variety of physiological, neuropsychological and behavioral phenomena. In this talk, I will briefly discuss a continuum of models ranging from completely data-driven to principle-based, causal and mechanistic. I will then describe a few examples of approaches in several domains including cognition, sensory-motor behaviors and affective states. I will also describe a framework that can use such approaches as components of future proactive and distributed care, tailored to individuals.
His talk takes place on Monday, December 3, 2018 at 13:00 in room A113.
Jiří Schimmel has been a doctoral student in the Department of Telecommunications of FEEC BUT since 1999. In 2006 he defends his doctoral thesis on the topic “Audio Effect Synthesis Using Non-Linear Signal Processing” and in 2016 habilitation thesis on “New Methods of Spatial Audio Coding and Rendering”. His professional scientific activity is focused on the research in the area of digital audio signal processing, on the research and development of real-time signal processing systems and multi-channel sound systems. He also cooperates with interior and foreign companies (C-Mexx, DFM, Audified).
Spatial Audio Coding Using Ambisonic
Ambisonic is a mathematically based acoustic signal processing technology that attempts to capture and reproduce information from a complete three-dimensional sound field, including the exact localization of each sound source and the environmental characteristics of the field. Basically this is a simplified solution of the wave equation for the progressive convergent spherical wave using spherical harmonic decomposition of the wave field. Theory and technologies related to ambisonic were developed already in the 1970s but its real-time use has been enabled by modern computing technologies. The output of the coding process are so-called ambisonic components whose number determines the order of the ambisonic as well as accuracy of the encoding and the subsequent reconstruction of the sound field. There are two ways how to obtain the ambisonic components – encoding sound object and capture the sound field using 3D microphone. The encoding process is based on finding weighting factors of ambisonic components according to the position of an audio object. For the 3D sound field capture a set of microphones is used that form virtual 3D microphone whose components are identical to the ambisonic components. The decoding process is based on reconstruction of the sound filed using several sound sources (loudspeakers) which supposes further simplifications. Although the sound field is mathematically fully described in ambisonic, there are still many problems that need to be addressed in its practical use.
His talk takes place on Tuesday, October 2, 2018 at 13:00 in room A113.