Category Archives: News

Lenka Zdeborová: Towards Understanding of Scaling, Emergence and Transformers

Lenka Zdeborová

Lenka Zdeborová is a Professor of Physics and of Computer Science in École Polytechnique Fédérale de Lausanne where she leads the Statistical Physics of Computation Laboratory. She received a PhD in physics from University Paris-Sud and from Charles University in Prague in 2008. She spent two years in the Los Alamos National Laboratory as the Director’s Postdoctoral Fellow. Between 2010 and 2020 she was a researcher at CNRS working in the Institute of Theoretical Physics in CEA Saclay, France. In 2014, she was awarded the CNRS bronze medal, in 2016 Philippe Meyer prize in theoretical physics and an ERC Starting Grant, in 2018 the Irène Joliot-Curie prize, in 2021 the Gibbs lectureship of AMS. She is an editorial board member for Journal of Physics A, Physical Review E, Physical Review X, SIMODS, Machine Learning: Science and Technology, and Information and Inference. Lenka’s expertise is in applications of concepts from statistical physics, such as advanced mean field methods, replica method and related message-passing algorithms, to problems in machine learning, signal processing, inference and optimization. She enjoys erasing the boundaries between theoretical physics, mathematics and computer science.

Towards Understanding of Scaling, Emergence and Transformers

For over four decades, statistical physics has studied exactly solvable models of artificial neural networks. In this talk, we will explore how these models offer insights into deep learning and large language models. Specifically, we will examine a research strategy that trades distributional assumptions about data for precise control over learning behavior in high-dimensional settings. We will discuss several types of phase transitions that emerge in this limit, particularly as a function of data quantity. In particular, we will highlight how discontinuous phase transitions are linked to algorithmic hardness, impacting the behavior of gradient-based learning algorithms. Finally, we will cover recent progress in learning from sequences and advances in understanding generalization in modern architectures, including the role of dot-product attention layers in transformers.

Her talk takes place on Tuesday, May 20, 2025 at 13:00 in E112. The talk will be streamed live at https://www.youtube.com/live/FCvPhHm5uJw.

Hervé Lombaert: Geometric Deep Learning – Examples on Brain Surfaces

Hervé Lombaert is Associate Professor at Polytechnique Montreal; holder of the Canada Research Chair in Shape Analysis in Medical Imaging at ETS Montreal; and Associate Member of Mila. His research focuses on the statistical analysis of geometry in the context of machine learning and medical imaging. His achievements include early image segmentation methods with graph cuts, installed in hospitals around the world; surface analysis with spectral graphs and graph convolutional networks, used in various neuro institutes; and the first human atlas of the cardiac fibers, a major contribution to cardiology. Hervé has authored over 80 articles, 5 patents, and been recognized by several awards, including the Erbsmann Prize. Herve serves on the editorial boards of the journals IEEE TMI and MedIA. In the past, he has worked in multiple research centers, including Inria, Microsoft Research, Siemens Research, and McGill.

Geometric Deep Learning – Examples on Brain Surfaces

How to analyze the shapes of complex organs, such as the highly folded surface of the brain? This talk will show how spectral shape analysis can benefit general learning problems where data fundamentally lives on surfaces. We exploit spectral coordinates derived from the Laplacian eigenfunctions of shapes. Spectral coordinates have the advantage over Euclidean coordinates, to be geometry aware, invariant to isometric deformations, and to parameterize surfaces explicitly. This change of paradigm, from Euclidean to spectral representations, enables a classifier to be applied *directly* on surface data, via spectral coordinates. Brain matching and learning of surface data will be shown as examples. The talk will focus, first, on the spectral representations of shapes, with an example on brain surface matching; second, on the basics of geometric deep learning; and finally, on the learning of surface data, with an example on automatic brain surface parcellation.

His talk takes place on Thursday, April 17, 2025 at 14:00 in E105. The talk will be streamed live at https://www.youtube.com/live/0LOLhW3TjFA.

Gijs Luijten: XR Projects in Medicine at the smart-XR Lab of the AI-Guided Therapies Group

Gijs Luijten holds a Master’s degree in Technical Medicine, specializing in Medical Imaging and Intervention, from the Technical University of Twente. He focused on Augmented Reality (AR) applications for oral-maxillofacial (OMF) and Free Flap Surgery. At the Institute for Artificial Intelligence in Medicine (IKIM), Gijs conducts research with the University Hospital Essen (UKE). The research aims to improve (surgical) diagnosis and intervention using state-of-the-art 3D technologies, particularly Virtual and Augmented Reality (VR/AR). As a visiting scientist, Gijs collaborated closely with the Technical University of Graz (Institute for Computer Graphics and Vision) and the Medical University of Graz. This collaboration aimed to investigate the effectiveness and usability of AR tools in real clinical settings and to explore interesting combinations of machine learning (ML) and AR. Recently, Gijs has explored diminishing reality in synthetic data of surgeries and created an extensive database and API for 3D medical shapes for machine learning, eye tracking to improve diagnosis, and more. Currently, Gijs is co-managing the XR-Lab at IKIM while working on a combination of ultrasound, machine learning, and augmented reality. A strong desire for continuous learning and improvement in related fields, including machine learning, workflow planning, modeling, and programming is what drives Gijs. A strong desire for continuous learning and improvement in related fields, including machine learning, workflow planning, modeling, and programming is what drives Gijs. Gijs’ work aims to empower researchers, improve patient care, and inspire clinicians, patients, and companies to push the boundaries of current possibilities.

XR Projects in Medicine at the smart-XR Lab of the AI-Guided Therapies Group

The smart XR Lab of the Artificial Intelligence Guided Therapies (AIT) group within the Institute of Institute of Artificial Intelligence in Medicine (IKIM) is located at Essen University Hospital (AöR). Our team develops applications to support researchers, patients, and clinicians.
After an overview of the hospital, institute, and XR Lab, we will present some past and ongoing projects. For education, we developed MultiAR—a multi-user, cross-device platform for collaborative anatomy education—and recently received a grant to further develop virtual reality for educational purposes. For diagnostics, we are measuring eye tracking during the HINTS exam in clinical settings. For patients, we have a project to help them relax during chemotherapy. For clinicians and medical students, we stream ultrasound images to multiple headsets and incorporate automated length and width measurements. In surgery, we are working on several markerless registration projects for CT-to-patient superimposition using both standard and custom algorithms. With the recently acquired Apple Vision Pro and Siemens Cinematic Reality application, we are investigating its added value for surgical planning. To support researchers, we created a 3D surgical instrument collection used to generate synthetic scenes for object detection, segmentation, and diminished reality. We have also developed several facial and skull datasets based on CT scans—datasets that support the training of algorithms for patient-specific implants. All these datasets, along with those from our collaboration partners, have been integrated into MedShapeNet, a database available via API. We hope MedShapeNet will impact medical computer vision research as ShapeNet did for general computer vision. We look forward to sharing our projects for researchers, patients, and clinicians and to collaborating on translating our research into meaningful applications.

His talk takes place on Thursday, April 3, 2025 at 14:00 in E105. The talk will be streamed live at https://www.youtube.com/live/e0_bIDkVr8E.

Julia Berezutskaya: Restoring Communication with Brain Implants

Dr. Julia Berezutskaya is a researcher at University Medical Center Utrecht (the Netherlands) specializing in neurotechnology, brain-computer interfaces, and artificial intelligence. With expertise in neuroscience, AI, and linguistics, her work focuses on decoding brain activity to enable communication for individuals with severe paralysis. A recipient of national and international research grants, Julia is dedicated to translating her research into medical technology that could reduce healthcare costs and significantly enhance quality of life for people with paralysis.

Restoring Communication with Brain Implants

Brain-computer interfaces (BCIs) hold promise for restoring communication in people with severe paralysis. BCIs bypass damaged neural pathways, allowing individuals who can’t move or speak to express themselves directly through brain activity. In our lab we work with brain implants—devices placed under the skull to record directly from the brain tissue. Using advanced algorithms, we analyze these signals in real-time, interpreting them to generate words or sentences on a computer screen.

In this talk, I’ll explore the process of decoding brain signals, from the role of intracranial implants to the machine learning techniques we use to translate neural activity into communication. I’ll highlight real-world applications, showing how BCIs are enabling people with paralysis to regain their voices and reestablish connections with loved ones.

Her talk takes place on Thursday, December 5, 2024 at 14:00 in E112. The talk will be streamed live at https://www.youtube.com/live/XvIihi5MeCI.

Shuai Wang: Speaker Representation Learning – Theories, Applications and PracticeShuai Wang

Shuai Wang obtained a Ph.D. degree at Shanghai Jiao Tong University in 2020.09, under the supervision of Kai Yu and Yanmin Qian. During his Ph.D., his research interests included deep learning-based approaches for speaker recognition, speaker diarization, and voice activity detection. After graduation, he joined Tencent Games as a senior researcher, where he (informally) led a speech group and extended his research interest to speech synthesis, voice conversion, music generation, and audio retrieval. Currently, he is with the SpeechLab at Shenzhen Research Institute of Big Data, Chinese University of Hong Kong (Shenzhen), led by Haizhou Li.

Speaker Representation Learning: Theories, Applications and Practice

Speaker individuality information is one of the most critical elements of speech signals. By thoroughly and accurately modeling this information, it can be applied in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this talk, I would like to approach the speaker characterization problem from a broader perspective, extending beyond just speaker recognition. First, I will present the developmental history and paradigm shifts in speaker modeling within the framework of deep representation learning. Next, I will discuss recent advances in pre-trained model-based methods and self-supervised training techniques. I will also cover topics such as robustness, efficiency, and interpretability, as well as the various applications of speaker modeling technologies. Finally, I will introduce two open-source toolkits I developed: wespeaker and wesep. Wespeaker is currently one of the most popular toolkits for speaker embedding learning, while wesep extends its capabilities to target speaker extraction, seamlessly integrating with wespeaker. You can find related works and recommended references in my overview paper titled “Overview of Speaker Modeling and Its Applications: >From the Lens of Deep Speaker Representation Learning”.

His talk takes place on Tuesday, September 10, 2024 at 13:00 in A112. The talk will be streamed live at https://youtube.com/live/FMY5_smgrYY.

Sriram Ganapathy: Factorized self-supervision models for speech representation learning

Sriram Sriram Ganapathy is an Associate Professor at the Electrical Engineering, Indian Institute of Science, Bangalore, where he heads the activities of the Learning and Extraction of Acoustic Patterns (LEAP) lab. He is also a visiting research scientist at Google Research India, Bangalore. His research interests include signal processing, machine learning methodologies for speech and speaker recognition and auditory neuroscience. Prior to joining the Indian Institute of Science, he was a research staff member at the IBM Watson Research Center, Yorktown Heights. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University. He obtained his Bachelor of Technology from College of Engineering, Trivandrum, India and Master of Engineering from the Indian Institute of Science, Bangalore. He has also worked as a Research Assistant in Idiap Research Institute, Switzerland. Over the past 15 years, he has published more than 120 peer-reviewed journal/conference publications in the areas of deep learning, and speech/audio processing. Dr. Ganapathy currently serves as the IEEE Sigport Chief Editor, member of the IEEE Education Board, and functions as subject editor for Elsevier Speech Communication Journal. He is also a recipient of several awards including Department of Science and Technology (DST) Early Career Award in India, Department of Atomic Energy (DAE), India Young Scientist Award and Verisk AI Faculty Award. He is a senior member of the IEEE Signal Processing Society and a member of the International Speech Communication Association (ISCA).

Factorized self-supervision models for speech representation learning

In recent years, self-supervised learning (SSL) of speech has enabled substantial advances in downstream applications by generating succinct representations of the speech signal. The paradigm in most of these works involve the frame-level (20-30ms) contrastive or predictive modeling of speech representations. However, speech signal entails information sources at multiple levels – semantic information encoded at frame-level, non-semantic information at utterance-level and channel/ambient information encoded at the recording session level. In this talk, I will describe efforts undertaken by our group on learning representations at multiple scales in a factorized manner.

In the first part, I will elaborate an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input “time-frequency” representations from the convolutional neural network (CNN) module are processed with long short term memory (LSTM) layers, which are smaller in computational requirements compared to other models. We explore techniques that improve the speaker invariance of the learned representations and illustrate the effectiveness of the proposed approach on two settings, i) completely unsupervised speech applications on the sub-tasks described as part of the ZeroSpeech 2021 challenge and ii) semi-supervised automatic speech recognition (ASR) applications on the TIMIT dataset and on the GramVaani challenge Hindi dataset. In these experiments, we achieve state-of-art results for various Zero-speech tasks (as of 2023). In the second part of the talk, I will discuss our recent proposal on a framework to Learning Disentangled (Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by contrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two modules consists of disentangling the two encoders using a mutual information based criterion. With several downstream evaluation experiments, we show that the proposed Learn2Diss framework achieves state-of-the-art results on a variety of tasks, including those in SUPERB challenge. Finally, I will highlight a related effort towards zero-shot emotion conversion and conclude the talk with a discussion of future prospects for these work streams.

His talk takes place on Wednesday, June 26, 2024 at 13:00 in E112. The talk will be streamed live at https://youtube.com/live/2IcAJmFH4Ys.

Preslav Nakov: Factuality Challenges in the Era of Large Language Models

Preslav Preslav Nakov is Professor at Mohamed bin Zayed University of Artificial Intelligence. Previously, he was Principal Scientist at the Qatar Computing Research Institute, HBKU, where he led the Tanbih mega-project, developed in collaboration with MIT, which aims to limit the impact of “fake news”, propaganda and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. He received his PhD degree in Computer Science from the University of California at Berkeley, supported by a Fulbright grant. He is Chair-Elect of the European Chapter of the Association for Computational Linguistics (EACL), Secretary of ACL SIGSLAV, and Secretary of the Truth and Trust Online board of trustees. Formerly, he was PC chair of ACL 2022, and President of ACL SIGLEX. He is also member of the editorial board of several journals including Computational Linguistics, TACL, ACM TOIS, IEEE TASL, IEEE TAC, CS&L, NLE, AI Communications, and Frontiers in AI. He authored a Morgan & Claypool book on Semantic Relations between Nominals, two books on computer algorithms, and 250+ research papers. He received a Best Paper Award at ACM WebSci’2022, a Best Long Paper Award at CIKM’2020, a Best Demo Paper Award (Honorable Mention) at ACL’2020, a Best Task Paper Award (Honorable Mention) at SemEval’2020, a Best Poster Award at SocInfo’2019, and the Young Researcher Award at RANLP’2011. He was also the first to receive the Bulgarian President’s John Atanasoff award, named after the inventor of the first automatic electronic digital computer. His research was featured by over 100 news outlets, including Forbes, Boston Globe, Aljazeera, DefenseOne, Business Insider, MIT Technology Review, Science Daily, Popular Science, Fast Company, The Register, WIRED, and Engadget, among others.

Factuality Challenges in the Era of Large Language Models

We will discuss the risks, the challenges, and the opportunities that Large Language Models (LLMs) bring regarding factuality. We will then delve into our recent work on using LLMs to assist fact-checking (e.g., claim normalization, stance detection, question-guided fact-checking, program-guided reasoning, and synthetic data generation for fake news and propaganda identification), on checking and correcting the output of LLMs, on detecting machine-generated text (blackbox and whitebox), and on fighting the ongoing misinformation pollution with LLMs. Finally, we will discuss work on safeguarding LLMs, and the safety mechanisms we incorporated in Jais-chat, the world’s best open Arabic-centric foundation and instruction-tuned LLM.

His talk takes place on Thursday, February 29, 2024 at 16:00 in A112. The talk will be streamed live at https://www.youtube.com/live/niT_shR8jbU.

Michael Buchholz: Connected, Cooperative Automated Mobility Supported by Intelligent Infrastructure

Michael Michael Buchholz earned his diploma degree in electrical engineering and information technology as well as his Ph.D. degree from the Faculty of Electrical Engineering and Information Technology at today’s Karlsruhe Institute of Technology. Since 2009, he is a research group leader and lecturer at the Institute of Measurement, Control, and Microtechnology, Ulm University, Germany, where he finished his “Habilitation” (post-doctoral lecturing qualification) in the field of automation engineering based on his research on cooperative, connected automated mobility in 2022. His further research interests include electric mobility, the modelling and control of mechatronic systems, and system identification.

Connected, Cooperative Automated Mobility Supported by Intelligent Infrastructure

Fully automated driving of vehicles in mixed traffic is a complex task with dynamically changing conditions, e.g., due to weather and other road users. Urban areas are especially challenging, showing a high traffic density and limited field of view (FOV) of the on-board sensors of an automated vehicle (AV). The latter is caused by occlusions, e.g., due to buildings, vegetation, or other traffic participants. To overcome the these FOV limitations, in the first part, this talk presents a supporting solution by intelligent infrastructure connected with the AVs via mobile communication realized as a test site in real traffic in Ulm, Germany. In the second part, a solution is proposed to enhance this system by an additional cooperative planner, which proposes manoeuvres to cooperative connected road users that ensure safety for vulnerable road users and can enhance traffic efficiency. Results from a proof of concept in mixed traffic at the test site in Ulm will be shown to demonstrate the possibilities of this approach.

His talk takes place on Friday, January 19, 2024 at 9:30 in G108.

Ondřej Klejch: Deciphering Speech – a Zero-Resource Approach to Cross-Lingual Transfer in ASR

Ondřej Ondřej Klejch is a senior researcher in the Centre for Speech Technology Research in the School of Informatics at the University of Edinburgh. He obtained his Ph.D. from the University of Edinburgh in 2020 and received his M.Sc. and B.Sc. from Charles University in Prague. He has been working on building automatic speech recognition systems with limited training data and supervision within several large projects funded by EPSRC, H2020, and IARPA. His recent work investigated semi-supervised and unsupervised training methods for automatic speech recognition in low-resource languages.

Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

Automatic speech recognition technology has achieved outstanding performance in recent years. This progress has been possible thanks to the advancements in deep learning and the availability of large training datasets. The production models are typically trained on thousands of hours of manually transcribed speech recordings to achieve the best possible accuracy. Unfortunately, due to the expensive and time-consuming manual annotation process, automatic speech recognition is available only for a fraction of all languages and their speakers.

In this talk, I will describe methods we have successfully used to improve the language coverage of automatic speech recognition. I will describe semi-supervised training approaches for building systems with only a few hours of manually transcribed training data and large amounts of crawled audio and text. Subsequently, I will discuss training dynamics of semi-supervised training approaches and why a good language model is necessary for their success. I will then present a novel decipherment approach for training an automatic speech recognition system for a new language without any manually transcribed data. This method can “decipher” speech in a new language using as little as 20 minutes of audio and paves the way for providing automatic speech recognition in many more languages in the future. Finally, I will talk about open challenges when training and evaluating automatic-speech-recognition models for low-resource languages.

His talk takes place on Thursday, December 14, 2023 at 14:00 in A113.

Hynek Hermansky: Learning: It’s not just for machines anymore

Hynek Hermansky has been active in speech research for over 40 years, is a Life Fellow of IEEE, Fellow of the International Speech Communication Association, authored or co-authored more than 350 papers with over 20,000 citations, holds more than 20 patents and received IEEE James L. Flanagan Speech and Audio Processing Award, and ISCA Medal for Scientific Achievements. He started his career in 1972 at Brno University of Technology, obtained his D.Eng.. degree from the University of Tokyo, worked for Panasonic Technologies, U S WEST Advanced Technologies, the Oregon Graduate Institute, IDIAP Martigny, the Johns Hopkins University, and Google Deep Mind. Currently, he is a Researcher at Speech@FIT BUT, and an Emeritus Professor at the Johns Hopkins University.

Learning: It’s not just for machines anymore

Machine recognition of speech requires training on a large amount of speech training data. Subsequently, research in machine recognition of speech consists mainly of getting hands-on large amounts of speech training data combined, often by a try-and-error, with the appropriate combination of processing modules. Advances are mostly being evaluated by error rates observed in recognition of test data. Such a process may be missing one of the prime goals of scientific endeavor, which is to obtain new knowledge, applicable to other applications. We argue that speech data can be used to obtain relevant hearing knowledge, which is used in decoding messages in speech, and report on some experiments, which support this notion.

His talk takes place Wednesday, November 22, 2023 at 14:00 in E105.

Video recording of the talk is publicly available.

Sébastien Lefèvre: Deep Learning in Computer Vision – Are Numerous Labels the Holy Grail?

Jiri Sébastien Lefèvre is currently a Full Professor in Computer Science at the University of South Brittany (Vannes Institute of Technology) since September 2010. He founded the OBELIX group from IRISA laboratory, and led the group from 2013 to 2021 (Prof. Nicolas Courty is leading the group since March 2021). He is also coordinating the GeoData Science track within the Erasmus Mundus Copernicus Master in Digital Earth. His main research topics are image analysis/processing, pattern recognition and indexing, machine learning, deep learning and data mining with applications in remote sensing for Earth observation.

Deep Learning in Computer Vision: Are Numerous Labels the Holy Grail?

Deep Learning has been successful in a wide range of computer vision tasks, at the cost of high computational resources and large labeled datasets required to train the models. The latter is a strong bottleneck in numerous applications where collecting annotated data is challenging.
In this talk, I will present some of our works attempting to alleviate our need for large annotated datasets. More precisely, the methods we develop rely on semi-supervised, weakly-supervised, unsupervised settings, domain adaptation, data simulation, active learning, among other frameworks. Various applications in Earth Observation will be provided to illustrate the relevance of these solutions for a wide range of problems such as semantic segmentation, image classification, or object detection.

His talk takes place in Thursday, June 15, 2023 at 14:00 in G108.

Jiri Mekyska: Acoustic analysis of speech and voice disorders in patients with Parkinson’s disease

Jiri Jiri Mekyska is head of the BDALab (Brain Diseases Analysis Laboratory) at the Brno University of Technology, where he leads a multidisciplinary team of researchers (signal processing engineers, data scientists, neuroscientists, psychologists) with a special focus on the development of new digital biomarkers facilitating understanding, diagnosis and monitoring of neurodegenerative (e.g. Parkinson’s disease) and neurodevelopmental (e.g. dysgraphia) disorders.

Acoustic analysis of speech and voice disorders in patients with Parkinson’s disease

Parkinson’s disease (PD) is the second most frequent neurodegenerative disease, which is associated with several motor and non-motor features. Up to 90 % of PD patients develop a motor speech disorder called hypokinetic dysarthria (HD). HD manifests in the field of phonation (e.g. increased instability of articulatory organs, microperturbation in pitch and amplitude), articulation (e.g. rigidity of tongue and jaw, slow alternating motion rate), prosody (e.g. monopitch, monoloudness), and respiration (e.g. airflow insufficiency). Acoustic analysis of these specific speech/voice disorders enables neurologists and speech-language therapists to effectively monitor the progress of PD as well as to diagnose it. In the frame of this talk, we will present a concept of acoustic HD analysis. Consequently, we will present some recent findings focused on the prediction of motor (freezing of gait) and non-motor (cognitive) deficits based on the acoustic analysis, we will discuss an application of acoustic HD analysis in treatment effect monitoring (based on high-frequency repetitive transcranial magnetic stimulation), and in PD diagnosis. Finally, we will present some future directions in terms of integration into Health 4.0 systems.

His talk takes place in Tuesday, May 16, 2023 at 15:00 in E105.

András Lőrincz: Towards human-machine and human-robot interactions “with a little help from my friends”

András András Lőrincz, a professor and senior researcher, has been teaching at the Faculty of Informatics at Eötvös University, Budapest since 1998. His research focuses on human-machine interaction and their applications in neurobiological and cognitive modeling, as well as medicine. He has founded the Neural Information Processing Group of Eötvös University and he directs a multidisciplinary team of mathematicians, programmers, computer scientists and physicists. He has acted as the PI of several successful international projects in collaboration with Panasonic, Honda Future Technology Research and the Information Directorate of the US Air Force, Robert Bosch, Ltd. Hungary, among others. He took part in several EU Framework Program projects.

He is a habilitated professor at the University of Szeged (1998) on laser physics and habilitated in the field of Informatics at the Eötvös Loránd University in 2008. He conducted research and taught quantum control, photoacoustics and artificial intelligence at the Hungarian Academy of Sciences, University of Chicago, Brown University, Princeton University, the Illinois Institute of Technology, University of Szeged, and Eötvös Loránd University. He authored about 300 peer reviewed scientific publications.

He has become an elected Fellow of the European Coordinating Committee for Artificial Intelligence (EurAI) for his pioneering work in the field of artificial intelligence in 2006. He has received the Innovative Researcher Prize of the University in 2009 and in 2019.

Partners: Barcelona University (on personality estimation and human-human interaction), Technical University of Delft (on human-human interaction), Rush Medical School, Chicago, on autism diagnosis and PTSD therapy.

Towards human-machine and human-robot interactions “with a little help from my friends”

Our work in the Neural Information Processing Group focuses on human-machine interactions. The first part of the talk will be an introduction to the technologies that we can or should use for effective iterations, such as the detection of environmental context, ongoing activity, including body movement, manipulation, and hidden parameters, i.e. intention, mood and personality state, as well as communication signals: body, head, hand, hand, face and gaze gestures, plus the body parameters that can be measured optically or by intelligent means, i.e., the temperature, blood pressure and stress levels, among others.

In the second part of the talk, I will review (a) what body and environment estimation methods we have, (b) what we can say about human-human interactions, which will also give insights into the requirements of human-machine and human-robot interactions, (c) what applications we have or can target in the areas of autism, “continuous healthcare” and “home and public safety”. (d) I will also list what technologies are missing and what we are looking for partners in.

His talk takes place on Tuesday, November 1, 2022 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”).

Heikki Kälviäinen: Computer Vision Applications

Heikki Heikki Kälviäinen has been a Professor of Computer Science and Engineering since 1999. He is the head of the Computer Vision and Pattern Recognition Laboratory (CVPRL) at the Department of Computational Engineering of Lappeenranta-Lahti University of Technology LUT, Finland. Prof. Kälviäinen’s research interests include computer vision, machine vision, pattern recognition, machine learning, and digital image processing and analysis. Besides LUT, Prof. Kälviäinen has worked as a Visiting Professor at the Faculty of Information Technology of Brno University of Technology, Czech Republic, the Center for Machine Perception (CMP) of Czech Technical University, and the Centre for Vision, Speech, and Signal Processing (CVSSP) of University of Surrey, UK, and as a Professor of Computing at Monash University Malaysia.

Computer Vision Applications

The presentation considers computer vision, especially a point of view of applications. Digital image processing and analysis with machine learning methods enable efficient solutions for various areas of useful data-centric engineering applications. Challenges with image acquisition, data annotation with expert knowledge, and clustering and classification, including deep learning method training are discussed. Different applications are given as examples based on the fresh novel data available: planktons in the Baltic Sea, Saimaa ringed seals in Lake Saimaa, and logs in the sawmill industry. In the first application the motivation is that distributions of plankton types give much information about the condition of the sea water system, e.g., about the climate change. An imaging flow cytometer can produce a lot of plankton images which should be classified into different plankton types. Manual classification of these images is very laborious, and thus, a CNN-based method has been developed to automatically recognize the plankton types in the Baltic Sea. In the second application the Saimaa ringed seals are automatically identified individually using camera trap images for assisting this very small population to survive in nature. CNN-based re-identification methods are based on pelage patterns of the seals. The third application is related to the sawmill industry. The digitalization of the sawmill industry is important for optimizing material flows and the quality. The research is focused on seeing inside the log to be able to predict which kinds of sawn boards are produced after cutting the log.

His talk takes place on Wednesday, May 11, 2022 at 13:00 in room A112.

Augustin Žídek: Protein Structure Prediction with AlphaFold

Augustin Žídek works as a Research Engineer at DeepMind and has been a member of the protein folding team since 2017. He studied Computer Science at the University of Cambridge. He enjoys working at the boundary of research and engineering, hiking, playing musical instruments and fixing things.

Protein Structure Prediction with AlphaFold

In this talk, we will discuss what proteins are, what is the protein folding problem and why it is an important scientific challenge. We will then talk about AphaFold, a machine learning model developed by DeepMind that is able to predict protein 3D structure with high accuracy, its architecture and applications.

His talk takes place on Tuesday, March 1, 2022 at 13:00 in room A112. The talk will be streamed live at https://youtu.be/udyjZXtUuDw.

Hema A. Murthy: Signal Processing Guided Machine Learning

Hema Hema A. Murthy is currently a Professor at the Department of Computer Science and Engineering. She has been with the department for the last 35 years. She currently leads an 18 Institute consortium that focuses on speech as part of the national language translation mission, an ambitious project where the objective is to produce speech to speech translation in Indian languages and Indian English.

Signal Processing Guided Machine Learning

In this talk we will focus on using signal processing algorithms in tandem with machine learning algorithms for various tasks in speech, music and brain signals. The primary objective is to understand events of interest from the perspective of the chosen domain. Appropriate signal processing is employed to detect events. Machine learning algorithms are then made to focus on learning the statistical characteristics of these events. The primary advantage of this approach is that it significantly reduces both computation and data costs. Examples from speech synthesis, Indian art music, and neuronal signals and EEG signals will be considered.

Her talk takes place on Tuesday, February 8, 2022 at 13:00 CET, virtually on zoom https://cesnet.zoom.us/j/91741432360.

Slides of the talk are publicly available.

Bernhard Egger: Inverse Graphics and Perception with Generative Face Models

Jan Prof. Dr. Bernhard Egger studies how humans and machines can perceive faces and shapes in general. In particular, he chooses to focus on statistical shape models and the 3D Morphable Models. He is a junior professor at the chair of visual computing at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). Before joing FAU he was a postdoc in Josh Tenenbaum‘s Computational Cognitive Science Lab at the Departement of Brain and Cognitive Sciences at MIT and the Center for Brains, Minds and Machines (CBMM) and Polina Golland‘s group at MIT Computer Science & Artificial Intelligence Lab. He did his PhD on facial image annotation and interpretation in unconstrained images in the Graphics and Vision Research Group at the University of Basel. Before his doctorate he obtained his M.Sc. and B.Sc. in Computer Science at the University of Basel and an upper secondary school teaching Diploma at the University of Applied Sciences Northwestern Switzerland.

Inverse Graphics and Perception with Generative Face Models

Human object perception is remarkably robust: Even when confronted with blurred or sheared photographs, or pictures taken under extreme illumination conditions, we can often recognize what we’re seeing and even recover rich three-dimensional structure. This robustness is especially notable when perceiving human faces. How can humans generalize so well to highly distorted images, transformed far beyond the range of natural face images we are normally exposed to? In this talk I will present an Analysis-by-Synthesis approach based on 3D Morphable Models that can generalize well across various distortions. We find that our top-down inverse rendering model better matches human precepts than either an invariance-based account implemented in a deep neural network, or a neural network trained to perform approximate inverse rendering in a feedforward circuit.

His talk takes place on Wednesday, January 19, 2022 at 15:00 in room A112. The talk will be streamed live at https://www.youtube.com/watch?v=l9Aqz-86pUg.

Ondřej Dušek: Better Supervision for End-to-end Neural Dialogue Systems

Dusek Ondřej Dušek is an assistant professor at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University. His research is in the areas of dialogue systems and natural language generation; he specifically focuses on neural-networks-based approaches to these problems and their evaluation. He is also involved in the THEaiTRE project on automatic theatre play generation. Ondřej got his PhD in 2017 at Charles University. Between 2016 and 2018, he worked at the Interaction Lab at Heriot Watt University in Edinburgh, one of the leading groups in dialogue systems and natural-language interaction with computers and robots. There he co-organized the E2E NLG text generation challenge and co-led a team of PhD students in the Amazon Alexa Prize dialogue system competition, which came third in two consecutive years.

Better Supervision for End-to-end Neural Dialogue Systems

While end-to-end neural models have been the research trend in task-oriented dialogue systems in the past years, they still suffer from significant problems: The neural models often produce replies inconsistent with past dialogue context or database results, their replies may be dull and formulaic, and they require large amounts of annotated data to train. In this talk, I will present two of our recent experiments that aim at solving these problems.

First, our end-to-end neural system AuGPT based on the GPT-2 pretrained language model aims at consistency and variability in dialogue responses by using massive data augmentation and filtering as well as specific auxiliary training objectives which check for dialogue consistency. It reached favorable results in terms of both automatic metrics and human judgments (in the DSTC9 competition).

Second, we designed a system that is able to discover relevant dialogue slots (domain attributes) without any human annotation. It uses weak supervision from generic linguistic annotation models (semantic parser, named entities), which is further filtered and clustered. We train a neural slot tagger on the discovered slots, which then reaches state-of-the-art results in dialogue slot tagging without labeled training data. We further show that the discovered slots are helpful for training an end-to-end neural dialogue system.

His talk takes place on Wednesday, December 1, 2021 at 15:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”). The talk will be streamed live and recorded at https://www.youtube.com/watch?v=JzBy-QuLxiE.

Tanel Alumäe: Weakly supervised training for speaker and language recognition

Tanel Tanel Alumäe is the head of Laboratory of Language Technology at Tallinn University of Technology (TalTech). He received his PhD degree from the same university in 2006. After that, he has worked in several research teams, including LIMSI/CNRS, Aalto University and Raytheon BBN Technologies. His recent research has focused on practical approaches to low-resource speech and language processing.

Weakly supervised training for speaker and language recognition

Speaker identification models are usually trained on data where the speech segments corresponding to the target speakers are hand-annotated. However, the process of hand-labelling speech data is expensive and doesn’t scale well, especially if a large set of speakers needs to be covered. Similarly, spoken language identification models require large amounts of training samples from each language that we want to cover.
This talk will show how metadata accompanied with speech data found on the internet can be treated as weak and/or noisy labels for training speaker and language identification models. Speaker identification models can be trained using only the information about speakers appearing in each of the recordings in training data, without any segment level annotation. For spoken language identification, we can often treat the detected language of the description of the multimedia clip as a noisy label. The latter method was used to compile VoxLingua107, a large scale speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. It contains data for 107 languages, with 62 hours per language on the average. A model trained on this dataset can be used as-is, or finetuned for a particular language identification task using only a small amount of manually verified data.

His talk takes place on Tuesday, November 9, 2021 at 13:00 in “little theater” R211 (next to Kachnicka student club in “Stary Pivovar”). The talk will be streamed live and recorded at
https://youtu.be/fpsC0jzZSvs – thanks FIT student union for support!

Boliang Zhang: End-to-End Task-oriented Dialog Agent Training and Human-Human Dialog Collection

Srikanth Boliang Zhang is a research scientist at DiDi Labs, Los Angeles, CA. Currently, he works on building intelligent chatbots to help humans fulfill tasks. Before that, he has interned at Microsoft, Facebook, and AT&T Labs. He received his Ph.D. in 2019 at Rensselaer Polytechnic Institute. His thesis topic focuses on applications of neural networks for information extraction for low-resource languages. He has a broad interest in applications of natural language processing. He participated in DARPA Low Resource Languages for Emergent Incidents (LORELEI) project, where he, as a core system developer, built named entity recognition and linking system for low-resource languages, such as Hausa and Oromo, and achieves first place in the evaluation four times in a row. At DiDi Labs, he leads a small group to compete in the Multi-domain Task-oriented Dialog Challenge of DSTC9 and tied for first place among ten teams.

End-to-End Task-oriented Dialog Agent Training and Human-Human Dialog Collection

Task-oriented dialog systems aim to communicate with users through natural language to accomplish a wide range of tasks, such as restaurant booking, weather querying, etc. With the rising trend of artificial intelligence, they have attracted attention from both academia and industry. In the first half of this talk, I will introduce our participation in the DSTC9 Multi-domain Task-oriented Dialog Challenge and present our end-to-end dialog system. Compared to traditional pipelined dialog architecture where modules like Natural Language Understanding (NLU), Dialog Manager (DM), and Natural Language Generation (NLG) work separately and are optimized individually, our end-to-end system is a GPT-2 based fully data-driven method that jointly predicts belief states, database queries, and responses. In the second half of the talk, as we found that existing dialog collection tool has limitations in the real world scenario, I will introduce a novel human-human dialog platform that reduces all agent activity (API calls, utterances) to a series of clicks, yet maintains enough flexibility to satisfy users. This platform enables real-time agents to do real tasks, meanwhile stores all agent’s actions that are used for training chatbots later on.

The talk will take place on Tuesday April 20th at 17:00 CEST (sorry for late hour, but Boliang is on the US West Coast), virtually on zoom https://cesnet.zoom.us/j/95296064691.

Video recording of the talk is publicly available.

Slides of the talk are publicly available.