Shuai Wang: Speaker Representation Learning – Theories, Applications and PracticeShuai Wang

Shuai WangShuai Wang obtained a Ph.D. degree at Shanghai Jiao Tong University in 2020.09, under the supervision of Kai Yu and Yanmin Qian. During his Ph.D., his research interests included deep learning-based approaches for speaker recognition, speaker diarization, and voice activity detection. After graduation, he joined Tencent Games as a senior researcher, where he (informally) led a speech group and extended his research interest to speech synthesis, voice conversion, music generation, and audio retrieval. Currently, he is with the SpeechLab at Shenzhen Research Institute of Big Data, Chinese University of Hong Kong (Shenzhen), led by Haizhou Li.

Speaker Representation Learning: Theories, Applications and Practice

Speaker individuality information is one of the most critical elements of speech signals. By thoroughly and accurately modeling this information, it can be applied in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this talk, I would like to approach the speaker characterization problem from a broader perspective, extending beyond just speaker recognition. First, I will present the developmental history and paradigm shifts in speaker modeling within the framework of deep representation learning. Next, I will discuss recent advances in pre-trained model-based methods and self-supervised training techniques. I will also cover topics such as robustness, efficiency, and interpretability, as well as the various applications of speaker modeling technologies. Finally, I will introduce two open-source toolkits I developed: wespeaker and wesep. Wespeaker is currently one of the most popular toolkits for speaker embedding learning, while wesep extends its capabilities to target speaker extraction, seamlessly integrating with wespeaker. You can find related works and recommended references in my overview paper titled “Overview of Speaker Modeling and Its Applications: >From the Lens of Deep Speaker Representation Learning”.

His talk takes place on Tuesday, September 10, 2024 at 13:00 in A112. The talk will be streamed live at https://youtube.com/live/FMY5_smgrYY.

Sriram Ganapathy: Factorized self-supervision models for speech representation learning

SriramSriram Ganapathy is an Associate Professor at the Electrical Engineering, Indian Institute of Science, Bangalore, where he heads the activities of the Learning and Extraction of Acoustic Patterns (LEAP) lab. He is also a visiting research scientist at Google Research India, Bangalore. His research interests include signal processing, machine learning methodologies for speech and speaker recognition and auditory neuroscience. Prior to joining the Indian Institute of Science, he was a research staff member at the IBM Watson Research Center, Yorktown Heights. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University. He obtained his Bachelor of Technology from College of Engineering, Trivandrum, India and Master of Engineering from the Indian Institute of Science, Bangalore. He has also worked as a Research Assistant in Idiap Research Institute, Switzerland. Over the past 15 years, he has published more than 120 peer-reviewed journal/conference publications in the areas of deep learning, and speech/audio processing. Dr. Ganapathy currently serves as the IEEE Sigport Chief Editor, member of the IEEE Education Board, and functions as subject editor for Elsevier Speech Communication Journal. He is also a recipient of several awards including Department of Science and Technology (DST) Early Career Award in India, Department of Atomic Energy (DAE), India Young Scientist Award and Verisk AI Faculty Award. He is a senior member of the IEEE Signal Processing Society and a member of the International Speech Communication Association (ISCA).

Factorized self-supervision models for speech representation learning

In recent years, self-supervised learning (SSL) of speech has enabled substantial advances in downstream applications by generating succinct representations of the speech signal. The paradigm in most of these works involve the frame-level (20-30ms) contrastive or predictive modeling of speech representations. However, speech signal entails information sources at multiple levels – semantic information encoded at frame-level, non-semantic information at utterance-level and channel/ambient information encoded at the recording session level. In this talk, I will describe efforts undertaken by our group on learning representations at multiple scales in a factorized manner.

In the first part, I will elaborate an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input “time-frequency” representations from the convolutional neural network (CNN) module are processed with long short term memory (LSTM) layers, which are smaller in computational requirements compared to other models. We explore techniques that improve the speaker invariance of the learned representations and illustrate the effectiveness of the proposed approach on two settings, i) completely unsupervised speech applications on the sub-tasks described as part of the ZeroSpeech 2021 challenge and ii) semi-supervised automatic speech recognition (ASR) applications on the TIMIT dataset and on the GramVaani challenge Hindi dataset. In these experiments, we achieve state-of-art results for various Zero-speech tasks (as of 2023). In the second part of the talk, I will discuss our recent proposal on a framework to Learning Disentangled (Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by contrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two modules consists of disentangling the two encoders using a mutual information based criterion. With several downstream evaluation experiments, we show that the proposed Learn2Diss framework achieves state-of-the-art results on a variety of tasks, including those in SUPERB challenge. Finally, I will highlight a related effort towards zero-shot emotion conversion and conclude the talk with a discussion of future prospects for these work streams.

His talk takes place on Wednesday, June 26, 2024 at 13:00 in E112. The talk will be streamed live at https://youtube.com/live/2IcAJmFH4Ys.

Preslav Nakov: Factuality Challenges in the Era of Large Language Models

PreslavPreslav Nakov is Professor at Mohamed bin Zayed University of Artificial Intelligence. Previously, he was Principal Scientist at the Qatar Computing Research Institute, HBKU, where he led the Tanbih mega-project, developed in collaboration with MIT, which aims to limit the impact of “fake news”, propaganda and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. He received his PhD degree in Computer Science from the University of California at Berkeley, supported by a Fulbright grant. He is Chair-Elect of the European Chapter of the Association for Computational Linguistics (EACL), Secretary of ACL SIGSLAV, and Secretary of the Truth and Trust Online board of trustees. Formerly, he was PC chair of ACL 2022, and President of ACL SIGLEX. He is also member of the editorial board of several journals including Computational Linguistics, TACL, ACM TOIS, IEEE TASL, IEEE TAC, CS&L, NLE, AI Communications, and Frontiers in AI. He authored a Morgan & Claypool book on Semantic Relations between Nominals, two books on computer algorithms, and 250+ research papers. He received a Best Paper Award at ACM WebSci’2022, a Best Long Paper Award at CIKM’2020, a Best Demo Paper Award (Honorable Mention) at ACL’2020, a Best Task Paper Award (Honorable Mention) at SemEval’2020, a Best Poster Award at SocInfo’2019, and the Young Researcher Award at RANLP’2011. He was also the first to receive the Bulgarian President’s John Atanasoff award, named after the inventor of the first automatic electronic digital computer. His research was featured by over 100 news outlets, including Forbes, Boston Globe, Aljazeera, DefenseOne, Business Insider, MIT Technology Review, Science Daily, Popular Science, Fast Company, The Register, WIRED, and Engadget, among others.

Factuality Challenges in the Era of Large Language Models

We will discuss the risks, the challenges, and the opportunities that Large Language Models (LLMs) bring regarding factuality. We will then delve into our recent work on using LLMs to assist fact-checking (e.g., claim normalization, stance detection, question-guided fact-checking, program-guided reasoning, and synthetic data generation for fake news and propaganda identification), on checking and correcting the output of LLMs, on detecting machine-generated text (blackbox and whitebox), and on fighting the ongoing misinformation pollution with LLMs. Finally, we will discuss work on safeguarding LLMs, and the safety mechanisms we incorporated in Jais-chat, the world’s best open Arabic-centric foundation and instruction-tuned LLM.

His talk takes place on Thursday, February 29, 2024 at 16:00 in A112. The talk will be streamed live at https://www.youtube.com/live/niT_shR8jbU.

Michael Buchholz: Connected, Cooperative Automated Mobility Supported by Intelligent Infrastructure

MichaelMichael Buchholz earned his diploma degree in electrical engineering and information technology as well as his Ph.D. degree from the Faculty of Electrical Engineering and Information Technology at today’s Karlsruhe Institute of Technology. Since 2009, he is a research group leader and lecturer at the Institute of Measurement, Control, and Microtechnology, Ulm University, Germany, where he finished his “Habilitation” (post-doctoral lecturing qualification) in the field of automation engineering based on his research on cooperative, connected automated mobility in 2022. His further research interests include electric mobility, the modelling and control of mechatronic systems, and system identification.

Connected, Cooperative Automated Mobility Supported by Intelligent Infrastructure

Fully automated driving of vehicles in mixed traffic is a complex task with dynamically changing conditions, e.g., due to weather and other road users. Urban areas are especially challenging, showing a high traffic density and limited field of view (FOV) of the on-board sensors of an automated vehicle (AV). The latter is caused by occlusions, e.g., due to buildings, vegetation, or other traffic participants. To overcome the these FOV limitations, in the first part, this talk presents a supporting solution by intelligent infrastructure connected with the AVs via mobile communication realized as a test site in real traffic in Ulm, Germany. In the second part, a solution is proposed to enhance this system by an additional cooperative planner, which proposes manoeuvres to cooperative connected road users that ensure safety for vulnerable road users and can enhance traffic efficiency. Results from a proof of concept in mixed traffic at the test site in Ulm will be shown to demonstrate the possibilities of this approach.

His talk takes place on Friday, January 19, 2024 at 9:30 in G108.

Ondřej Klejch: Deciphering Speech – a Zero-Resource Approach to Cross-Lingual Transfer in ASR

OndřejOndřej Klejch is a senior researcher in the Centre for Speech Technology Research in the School of Informatics at the University of Edinburgh. He obtained his Ph.D. from the University of Edinburgh in 2020 and received his M.Sc. and B.Sc. from Charles University in Prague. He has been working on building automatic speech recognition systems with limited training data and supervision within several large projects funded by EPSRC, H2020, and IARPA. His recent work investigated semi-supervised and unsupervised training methods for automatic speech recognition in low-resource languages.

Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

Automatic speech recognition technology has achieved outstanding performance in recent years. This progress has been possible thanks to the advancements in deep learning and the availability of large training datasets. The production models are typically trained on thousands of hours of manually transcribed speech recordings to achieve the best possible accuracy. Unfortunately, due to the expensive and time-consuming manual annotation process, automatic speech recognition is available only for a fraction of all languages and their speakers.

In this talk, I will describe methods we have successfully used to improve the language coverage of automatic speech recognition. I will describe semi-supervised training approaches for building systems with only a few hours of manually transcribed training data and large amounts of crawled audio and text. Subsequently, I will discuss training dynamics of semi-supervised training approaches and why a good language model is necessary for their success. I will then present a novel decipherment approach for training an automatic speech recognition system for a new language without any manually transcribed data. This method can “decipher” speech in a new language using as little as 20 minutes of audio and paves the way for providing automatic speech recognition in many more languages in the future. Finally, I will talk about open challenges when training and evaluating automatic-speech-recognition models for low-resource languages.

His talk takes place on Thursday, December 14, 2023 at 14:00 in A113.

Hynek Hermansky: Learning: It’s not just for machines anymore

Hynek Hermansky has been active in speech research for over 40 years, is a Life Fellow of IEEE, Fellow of the International Speech Communication Association, authored or co-authored more than 350 papers with over 20,000 citations, holds more than 20 patents and received IEEE James L. Flanagan Speech and Audio Processing Award, and ISCA Medal for Scientific Achievements. He started his career in 1972 at Brno University of Technology, obtained his D.Eng.. degree from the University of Tokyo, worked for Panasonic Technologies, U S WEST Advanced Technologies, the Oregon Graduate Institute, IDIAP Martigny, the Johns Hopkins University, and Google Deep Mind. Currently, he is a Researcher at Speech@FIT BUT, and an Emeritus Professor at the Johns Hopkins University.

Learning: It’s not just for machines anymore

Machine recognition of speech requires training on a large amount of speech training data. Subsequently, research in machine recognition of speech consists mainly of getting hands-on large amounts of speech training data combined, often by a try-and-error, with the appropriate combination of processing modules. Advances are mostly being evaluated by error rates observed in recognition of test data. Such a process may be missing one of the prime goals of scientific endeavor, which is to obtain new knowledge, applicable to other applications. We argue that speech data can be used to obtain relevant hearing knowledge, which is used in decoding messages in speech, and report on some experiments, which support this notion.

His talk takes place Wednesday, November 22, 2023 at 14:00 in E105.

Video recording of the talk is publicly available.

Sébastien Lefèvre: Deep Learning in Computer Vision – Are Numerous Labels the Holy Grail?

JiriSébastien Lefèvre is currently a Full Professor in Computer Science at the University of South Brittany (Vannes Institute of Technology) since September 2010. He founded the OBELIX group from IRISA laboratory, and led the group from 2013 to 2021 (Prof. Nicolas Courty is leading the group since March 2021). He is also coordinating the GeoData Science track within the Erasmus Mundus Copernicus Master in Digital Earth. His main research topics are image analysis/processing, pattern recognition and indexing, machine learning, deep learning and data mining with applications in remote sensing for Earth observation.

Deep Learning in Computer Vision: Are Numerous Labels the Holy Grail?

Deep Learning has been successful in a wide range of computer vision tasks, at the cost of high computational resources and large labeled datasets required to train the models. The latter is a strong bottleneck in numerous applications where collecting annotated data is challenging.
In this talk, I will present some of our works attempting to alleviate our need for large annotated datasets. More precisely, the methods we develop rely on semi-supervised, weakly-supervised, unsupervised settings, domain adaptation, data simulation, active learning, among other frameworks. Various applications in Earth Observation will be provided to illustrate the relevance of these solutions for a wide range of problems such as semantic segmentation, image classification, or object detection.

His talk takes place in Thursday, June 15, 2023 at 14:00 in G108.

Jiri Mekyska: Acoustic analysis of speech and voice disorders in patients with Parkinson’s disease

JiriJiri Mekyska is head of the BDALab (Brain Diseases Analysis Laboratory) at the Brno University of Technology, where he leads a multidisciplinary team of researchers (signal processing engineers, data scientists, neuroscientists, psychologists) with a special focus on the development of new digital biomarkers facilitating understanding, diagnosis and monitoring of neurodegenerative (e.g. Parkinson’s disease) and neurodevelopmental (e.g. dysgraphia) disorders.

Acoustic analysis of speech and voice disorders in patients with Parkinson’s disease

Parkinson’s disease (PD) is the second most frequent neurodegenerative disease, which is associated with several motor and non-motor features. Up to 90 % of PD patients develop a motor speech disorder called hypokinetic dysarthria (HD). HD manifests in the field of phonation (e.g. increased instability of articulatory organs, microperturbation in pitch and amplitude), articulation (e.g. rigidity of tongue and jaw, slow alternating motion rate), prosody (e.g. monopitch, monoloudness), and respiration (e.g. airflow insufficiency). Acoustic analysis of these specific speech/voice disorders enables neurologists and speech-language therapists to effectively monitor the progress of PD as well as to diagnose it. In the frame of this talk, we will present a concept of acoustic HD analysis. Consequently, we will present some recent findings focused on the prediction of motor (freezing of gait) and non-motor (cognitive) deficits based on the acoustic analysis, we will discuss an application of acoustic HD analysis in treatment effect monitoring (based on high-frequency repetitive transcranial magnetic stimulation), and in PD diagnosis. Finally, we will present some future directions in terms of integration into Health 4.0 systems.

His talk takes place in Tuesday, May 16, 2023 at 15:00 in E105.

András Lőrincz: Towards human-machine and human-robot interactions “with a little help from my friends”

AndrásAndrás Lőrincz, a professor and senior researcher, has been teaching at the Faculty of Informatics at Eötvös University, Budapest since 1998. His research focuses on human-machine interaction and their applications in neurobiological and cognitive modeling, as well as medicine. He has founded the Neural Information Processing Group of Eötvös University and he directs a multidisciplinary team of mathematicians, programmers, computer scientists and physicists. He has acted as the PI of several successful international projects in collaboration with Panasonic, Honda Future Technology Research and the Information Directorate of the US Air Force, Robert Bosch, Ltd. Hungary, among others. He took part in several EU Framework Program projects.

He is a habilitated professor at the University of Szeged (1998) on laser physics and habilitated in the field of Informatics at the Eötvös Loránd University in 2008. He conducted research and taught quantum control, photoacoustics and artificial intelligence at the Hungarian Academy of Sciences, University of Chicago, Brown University, Princeton University, the Illinois Institute of Technology, University of Szeged, and Eötvös Loránd University. He authored about 300 peer reviewed scientific publications.

He has become an elected Fellow of the European Coordinating Committee for Artificial Intelligence (EurAI) for his pioneering work in the field of artificial intelligence in 2006. He has received the Innovative Researcher Prize of the University in 2009 and in 2019.

Partners: Barcelona University (on personality estimation and human-human interaction), Technical University of Delft (on human-human interaction), Rush Medical School, Chicago, on autism diagnosis and PTSD therapy.

Towards human-machine and human-robot interactions “with a little help from my friends”

Our work in the Neural Information Processing Group focuses on human-machine interactions. The first part of the talk will be an introduction to the technologies that we can or should use for effective iterations, such as the detection of environmental context, ongoing activity, including body movement, manipulation, and hidden parameters, i.e. intention, mood and personality state, as well as communication signals: body, head, hand, hand, face and gaze gestures, plus the body parameters that can be measured optically or by intelligent means, i.e., the temperature, blood pressure and stress levels, among others.

In the second part of the talk, I will review (a) what body and environment estimation methods we have, (b) what we can say about human-human interactions, which will also give insights into the requirements of human-machine and human-robot interactions, (c) what applications we have or can target in the areas of autism, “continuous healthcare” and “home and public safety”. (d) I will also list what technologies are missing and what we are looking for partners in.

His talk takes place on Tuesday, November 1, 2022 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).

Heikki Kälviäinen: Computer Vision Applications

HeikkiHeikki Kälviäinen has been a Professor of Computer Science and Engineering since 1999. He is the head of the Computer Vision and Pattern Recognition Laboratory (CVPRL) at the Department of Computational Engineering of Lappeenranta-Lahti University of Technology LUT, Finland. Prof. Kälviäinen’s research interests include computer vision, machine vision, pattern recognition, machine learning, and digital image processing and analysis. Besides LUT, Prof. Kälviäinen has worked as a Visiting Professor at the Faculty of Information Technology of Brno University of Technology, Czech Republic, the Center for Machine Perception (CMP) of Czech Technical University, and the Centre for Vision, Speech, and Signal Processing (CVSSP) of University of Surrey, UK, and as a Professor of Computing at Monash University Malaysia.

Computer Vision Applications

The presentation considers computer vision, especially a point of view of applications. Digital image processing and analysis with machine learning methods enable efficient solutions for various areas of useful data-centric engineering applications. Challenges with image acquisition, data annotation with expert knowledge, and clustering and classification, including deep learning method training are discussed. Different applications are given as examples based on the fresh novel data available: planktons in the Baltic Sea, Saimaa ringed seals in Lake Saimaa, and logs in the sawmill industry. In the first application the motivation is that distributions of plankton types give much information about the condition of the sea water system, e.g., about the climate change. An imaging flow cytometer can produce a lot of plankton images which should be classified into different plankton types. Manual classification of these images is very laborious, and thus, a CNN-based method has been developed to automatically recognize the plankton types in the Baltic Sea. In the second application the Saimaa ringed seals are automatically identified individually using camera trap images for assisting this very small population to survive in nature. CNN-based re-identification methods are based on pelage patterns of the seals. The third application is related to the sawmill industry. The digitalization of the sawmill industry is important for optimizing material flows and the quality. The research is focused on seeing inside the log to be able to predict which kinds of sawn boards are produced after cutting the log.

His talk takes place on Wednesday, May 11, 2022 at 13:00 in room A112.

Augustin Žídek: Protein Structure Prediction with AlphaFold

AugustinAugustin Žídek works as a Research Engineer at DeepMind and has been a member of the protein folding team since 2017. He studied Computer Science at the University of Cambridge. He enjoys working at the boundary of research and engineering, hiking, playing musical instruments and fixing things.

Protein Structure Prediction with AlphaFold

In this talk, we will discuss what proteins are, what is the protein folding problem and why it is an important scientific challenge. We will then talk about AphaFold, a machine learning model developed by DeepMind that is able to predict protein 3D structure with high accuracy, its architecture and applications.

His talk takes place on Tuesday, March 1, 2022 at 13:00 in room A112. The talk will be streamed live at https://youtu.be/udyjZXtUuDw.

Hema A. Murthy: Signal Processing Guided Machine Learning

HemaHema A. Murthy is currently a Professor at the Department of Computer Science and Engineering. She has been with the department for the last 35 years. She currently leads an 18 Institute consortium that focuses on speech as part of the national language translation mission, an ambitious project where the objective is to produce speech to speech translation in Indian languages and Indian English.

Signal Processing Guided Machine Learning

In this talk we will focus on using signal processing algorithms in tandem with machine learning algorithms for various tasks in speech, music and brain signals. The primary objective is to understand events of interest from the perspective of the chosen domain. Appropriate signal processing is employed to detect events. Machine learning algorithms are then made to focus on learning the statistical characteristics of these events. The primary advantage of this approach is that it significantly reduces both computation and data costs. Examples from speech synthesis, Indian art music, and neuronal signals and EEG signals will be considered.

Her talk takes place on Tuesday, February 8, 2022 at 13:00 CET, virtually on zoom https://cesnet.zoom.us/j/91741432360.

Slides of the talk are publicly available.

Bernhard Egger: Inverse Graphics and Perception with Generative Face Models

JanProf. Dr. Bernhard Egger studies how humans and machines can perceive faces and shapes in general. In particular, he chooses to focus on statistical shape models and the 3D Morphable Models. He is a junior professor at the chair of visual computing at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). Before joing FAU he was a postdoc in Josh Tenenbaum‘s Computational Cognitive Science Lab at the Departement of Brain and Cognitive Sciences at MIT and the Center for Brains, Minds and Machines (CBMM) and Polina Golland‘s group at MIT Computer Science & Artificial Intelligence Lab. He did his PhD on facial image annotation and interpretation in unconstrained images in the Graphics and Vision Research Group at the University of Basel. Before his doctorate he obtained his M.Sc. and B.Sc. in Computer Science at the University of Basel and an upper secondary school teaching Diploma at the University of Applied Sciences Northwestern Switzerland.

Inverse Graphics and Perception with Generative Face Models

Human object perception is remarkably robust: Even when confronted with blurred or sheared photographs, or pictures taken under extreme illumination conditions, we can often recognize what we’re seeing and even recover rich three-dimensional structure. This robustness is especially notable when perceiving human faces. How can humans generalize so well to highly distorted images, transformed far beyond the range of natural face images we are normally exposed to? In this talk I will present an Analysis-by-Synthesis approach based on 3D Morphable Models that can generalize well across various distortions. We find that our top-down inverse rendering model better matches human precepts than either an invariance-based account implemented in a deep neural network, or a neural network trained to perform approximate inverse rendering in a feedforward circuit.

His talk takes place on Wednesday, January 19, 2022 at 15:00 in room A112. The talk will be streamed live at https://www.youtube.com/watch?v=l9Aqz-86pUg.

Ondřej Dušek: Better Supervision for End-to-end Neural Dialogue Systems

DusekOndřej Dušek is an assistant professor at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University. His research is in the areas of dialogue systems and natural language generation; he specifically focuses on neural-networks-based approaches to these problems and their evaluation. He is also involved in the THEaiTRE project on automatic theatre play generation. Ondřej got his PhD in 2017 at Charles University. Between 2016 and 2018, he worked at the Interaction Lab at Heriot Watt University in Edinburgh, one of the leading groups in dialogue systems and natural-language interaction with computers and robots. There he co-organized the E2E NLG text generation challenge and co-led a team of PhD students in the Amazon Alexa Prize dialogue system competition, which came third in two consecutive years.

Better Supervision for End-to-end Neural Dialogue Systems

While end-to-end neural models have been the research trend in task-oriented dialogue systems in the past years, they still suffer from significant problems: The neural models often produce replies inconsistent with past dialogue context or database results, their replies may be dull and formulaic, and they require large amounts of annotated data to train. In this talk, I will present two of our recent experiments that aim at solving these problems.

First, our end-to-end neural system AuGPT based on the GPT-2 pretrained language model aims at consistency and variability in dialogue responses by using massive data augmentation and filtering as well as specific auxiliary training objectives which check for dialogue consistency. It reached favorable results in terms of both automatic metrics and human judgments (in the DSTC9 competition).

Second, we designed a system that is able to discover relevant dialogue slots (domain attributes) without any human annotation. It uses weak supervision from generic linguistic annotation models (semantic parser, named entities), which is further filtered and clustered. We train a neural slot tagger on the discovered slots, which then reaches state-of-the-art results in dialogue slot tagging without labeled training data. We further show that the discovered slots are helpful for training an end-to-end neural dialogue system.

His talk takes place on Wednesday, December 1, 2021 at 15:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”). The talk will be streamed live and recorded at https://www.youtube.com/watch?v=JzBy-QuLxiE.

Tanel Alumäe: Weakly supervised training for speaker and language recognition

TanelTanel Alumäe is the head of Laboratory of Language Technology at Tallinn University of Technology (TalTech). He received his PhD degree from the same university in 2006. After that, he has worked in several research teams, including LIMSI/CNRS, Aalto University and Raytheon BBN Technologies. His recent research has focused on practical approaches to low-resource speech and language processing.

Weakly supervised training for speaker and language recognition

Speaker identification models are usually trained on data where the speech segments corresponding to the target speakers are hand-annotated. However, the process of hand-labelling speech data is expensive and doesn’t scale well, especially if a large set of speakers needs to be covered. Similarly, spoken language identification models require large amounts of training samples from each language that we want to cover.
This talk will show how metadata accompanied with speech data found on the internet can be treated as weak and/or noisy labels for training speaker and language identification models. Speaker identification models can be trained using only the information about speakers appearing in each of the recordings in training data, without any segment level annotation. For spoken language identification, we can often treat the detected language of the description of the multimedia clip as a noisy label. The latter method was used to compile VoxLingua107, a large scale speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. It contains data for 107 languages, with 62 hours per language on the average. A model trained on this dataset can be used as-is, or finetuned for a particular language identification task using only a small amount of manually verified data.

His talk takes place on Tuesday, November 9, 2021 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”). The talk will be streamed live and recorded at
https://youtu.be/fpsC0jzZSvs – thanks FIT student union for support!

Boliang Zhang: End-to-End Task-oriented Dialog Agent Training and Human-Human Dialog Collection

SrikanthBoliang Zhang is a research scientist at DiDi Labs, Los Angeles, CA. Currently, he works on building intelligent chatbots to help humans fulfill tasks. Before that, he has interned at Microsoft, Facebook, and AT&T Labs. He received his Ph.D. in 2019 at Rensselaer Polytechnic Institute. His thesis topic focuses on applications of neural networks for information extraction for low-resource languages. He has a broad interest in applications of natural language processing. He participated in DARPA Low Resource Languages for Emergent Incidents (LORELEI) project, where he, as a core system developer, built named entity recognition and linking system for low-resource languages, such as Hausa and Oromo, and achieves first place in the evaluation four times in a row. At DiDi Labs, he leads a small group to compete in the Multi-domain Task-oriented Dialog Challenge of DSTC9 and tied for first place among ten teams.

End-to-End Task-oriented Dialog Agent Training and Human-Human Dialog Collection

Task-oriented dialog systems aim to communicate with users through natural language to accomplish a wide range of tasks, such as restaurant booking, weather querying, etc. With the rising trend of artificial intelligence, they have attracted attention from both academia and industry. In the first half of this talk, I will introduce our participation in the DSTC9 Multi-domain Task-oriented Dialog Challenge and present our end-to-end dialog system. Compared to traditional pipelined dialog architecture where modules like Natural Language Understanding (NLU), Dialog Manager (DM), and Natural Language Generation (NLG) work separately and are optimized individually, our end-to-end system is a GPT-2 based fully data-driven method that jointly predicts belief states, database queries, and responses. In the second half of the talk, as we found that existing dialog collection tool has limitations in the real world scenario, I will introduce a novel human-human dialog platform that reduces all agent activity (API calls, utterances) to a series of clicks, yet maintains enough flexibility to satisfy users. This platform enables real-time agents to do real tasks, meanwhile stores all agent’s actions that are used for training chatbots later on.

The talk will take place on Tuesday April 20th at 17:00 CEST (sorry for late hour, but Boliang is on the US West Coast), virtually on zoom https://cesnet.zoom.us/j/95296064691.

Video recording of the talk is publicly available.

Slides of the talk are publicly available.

Srikanth Madikeri: Automatic Speech Recognition for Low-Resource languages

SrikanthSrikanth Madikeri got his Ph.D. in Computer Science and Engineering from Indian Institute of Technology Madras (India) in 2013. During his Ph.D., he worked on automatic speaker recognition and spoken keyword spotting. He is currently working as a Research Associate at Idiap Research Institute (Martigny, Switzerland) in the Speech Processing group. His current research interests include – Automatic Speech Recognition for low resource languages, Automatic Speaker Recognition and Speaker Diarization.

Automatic Speech Recognition for Low-Resource languages

This talk focuses on automatic speech recognition (ASR) systems for low-resource languages with applications to information retrieval.
A common approach to improve ASR system performance for low-resource ASR is to train multilingual acoustic models by pooling resources from multiple languages. In this talk, we present the challenges and benefits of different multilingual modeling with Lattice-Free Maximum Mutual Information (LF-MMI), the state-of-the-art technique for hybrid ASR systems. We also present an incremental semi-supervised learning approach applied to multi-genre speech recognition, a common task in the MATERIAL program. The simple approach helps avoid fast saturation of performance improvements when using large amounts of data for semi-supervised learning. Finally, we present Pkwrap, a Pytorch wrapper on Kaldi (among the most popular speech recognition toolkits), that helps combine the benefits of training acoustic models with Pytorch and Kaldi. The toolkit, now available at https://github.com/idiap/pkwrap, is intended to provide both fast prototyping benefits of Pytorch while using necessary functionalities from Kaldi (LF-MMI, parallel training, decoding, etc.).

The talk will take place on Monday March 8th 2021 at 13:00 CET, virtually on zoom https://cesnet.zoom.us/j/98589068121.

Jan Ullrich: Research on head-marking languages – its contribution to linguistic theory and implications for NLP and NLU

JanJan Ullrich is the linguistic director of The Language Conservancy, an organization serving indigenous communities in projects of language documentation and revitalization. His main research interests are in morphosyntactic analyses, semantics, corpus linguistics, lexicography, and second language acquisition.
He holds a Ph.D. in linguistics from Heinrich-Heine-Universität in Düsseldorf. He has taught at Indiana University, University of North Dakota, Oglala Lakota College, and Sitting Bull College and has given lectures at a number of institutions in Europe and North America.
Ullrich has been committed to and worked in fieldwork documentation and analysis of endangered languages since 1992, primarily focusing on the Dakotan branch of the Siouan language family (e.g. Lakhota, Dakhota, Assiniboine, Stoney). His research represents highly innovative, and in parts groundbreaking, analysis of predication and modification in Lakhota. He is the author and co-author of a number of highly acclaimed publications, such as the New Lakota Dictionary and the Lakota Grammar Handbook.

Research on head-marking languages: its contribution to linguistic theory and implications for NLP and NLU

Some of the most widely used linguistic theories, and especially those which have been more or less unsuccessfully applied in computer parsing and NLP, are affected by three main problems: (a) they are largely based on the study of dependent-marking syntax, which means they ignore half of the world’s languages, (b) they are syntacto-centric and mostly disregard semantics, and (c) they are not monostratal, but instead propose deep structures which cannot readily be accessed by statistically driven models and parsing algorithms.
This presentation will introduce a number of the broadly relevant theoretical concepts developed from the study of head-marking languages, such as Lakhóta (Siouan), and some of their implications for NLP and NLU. It will offer a brief introduction to the Role and Reference Grammar, a theory which connects structure and function by implementing a two-way linking algorithm between constituency-based structural analysis and semantics.

His talk will be held jointly as VGS-IT seminar and lecture of MTIa master course and takes place on Thursday, February 27th, 2020 at 12:00 in room E112.

Jan Chorowski: Representation learning for speech and handwriting

JanJan Chorowski is an Associate Professor at Faculty of Mathematics and Computer Science at the University of Wrocław and Head of AI at NavAlgo. He received his M.Sc. degree in electrical engineering from the Wrocław University of Technology, Poland and EE PhD from the University of Louisville, Kentucky in 2012. He has worked with several research teams, including Google Brain, Microsoft Researchand Yoshua Bengio’s lab at the University of Montreal. He has led a research topic during the JSALT 2019 workshop. His research interests are applications of neural networks to problems which are intuitive and easy for humans and difficult for machines, such as speech and natural language processing.

Representation learning for speech and handwriting

Learning representations of data in an unsupervised way is still an open problem of machine learning. We consider representations of speech and handwriting learned using autoencoders equipped with autoregressive decoders such as WeveNets or PixelCNNs. In those autoencoders, the encoder only needs to provide the little information needed to supplement all that can be inferred by the autoregressive decoder. This allows learning a representation able to capture high level semantic content from the signal, e.g. phoneme or character identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. I will show how the design choices of the autoencoder, such as the bottleneck kind its hyperparameters impact the induced latent representation. I will also show applications to unsupervised acoustic unit discovery on the ZeroSpeech task. Finally, I’ll show how knowledge about the average unit duration can be enforced during training ,as well as during inference on new data.

His talk takes place on Friday, January 10, 2020 at 13:00 in room A112.

Ilya Oparin (Apple, USA): Connecting and Comparing Language Model Interpolation Techniques

IlyaIlya Oparin is leading Language Modeling team that contributes to improving Siri at Apple. He did his Ph.D. on language modeling of inflectional languages at University of West Bohemia in collaboration with Speech@FIT group at Brno University of Technology. Before joining Apple in 2014, Ilya did 3 years of post-doc in Spoken Language Processing group at LIMSI. Ilya’s research interests cover any topics related to language modeling for automatic speech recognition and more broadly for natural language processing.

Connecting and Comparing Language Model Interpolation Techniques

In this work, we uncover a theoretical connection between two language model interpolation techniques, count merging and Bayesian interpolation. We compare these techniques as well as linear interpolation in three scenarios with abundant training data per component model. Consistent with prior work, we show that both count merging and Bayesian interpolation outperform linear interpolation. We include the first (to our knowledge) published comparison of count merging and Bayesian interpolation, showing that the two techniques perform similarly. Finally, we argue that other considerations will make Bayesian interpolation the preferred approach in most circumstances.

His talk takes place on Thursday, December 19, 2019 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).