Sriram Ganapathy: Factorized self-supervision models for speech representation learning

SriramSriram Ganapathy is an Associate Professor at the Electrical Engineering, Indian Institute of Science, Bangalore, where he heads the activities of the Learning and Extraction of Acoustic Patterns (LEAP) lab. He is also a visiting research scientist at Google Research India, Bangalore. His research interests include signal processing, machine learning methodologies for speech and speaker recognition and auditory neuroscience. Prior to joining the Indian Institute of Science, he was a research staff member at the IBM Watson Research Center, Yorktown Heights. He received his Doctor of Philosophy from the Center for Language and Speech Processing, Johns Hopkins University. He obtained his Bachelor of Technology from College of Engineering, Trivandrum, India and Master of Engineering from the Indian Institute of Science, Bangalore. He has also worked as a Research Assistant in Idiap Research Institute, Switzerland. Over the past 15 years, he has published more than 120 peer-reviewed journal/conference publications in the areas of deep learning, and speech/audio processing. Dr. Ganapathy currently serves as the IEEE Sigport Chief Editor, member of the IEEE Education Board, and functions as subject editor for Elsevier Speech Communication Journal. He is also a recipient of several awards including Department of Science and Technology (DST) Early Career Award in India, Department of Atomic Energy (DAE), India Young Scientist Award and Verisk AI Faculty Award. He is a senior member of the IEEE Signal Processing Society and a member of the International Speech Communication Association (ISCA).

Factorized self-supervision models for speech representation learning

In recent years, self-supervised learning (SSL) of speech has enabled substantial advances in downstream applications by generating succinct representations of the speech signal. The paradigm in most of these works involve the frame-level (20-30ms) contrastive or predictive modeling of speech representations. However, speech signal entails information sources at multiple levels – semantic information encoded at frame-level, non-semantic information at utterance-level and channel/ambient information encoded at the recording session level. In this talk, I will describe efforts undertaken by our group on learning representations at multiple scales in a factorized manner.

In the first part, I will elaborate an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework. The input “time-frequency” representations from the convolutional neural network (CNN) module are processed with long short term memory (LSTM) layers, which are smaller in computational requirements compared to other models. We explore techniques that improve the speaker invariance of the learned representations and illustrate the effectiveness of the proposed approach on two settings, i) completely unsupervised speech applications on the sub-tasks described as part of the ZeroSpeech 2021 challenge and ii) semi-supervised automatic speech recognition (ASR) applications on the TIMIT dataset and on the GramVaani challenge Hindi dataset. In these experiments, we achieve state-of-art results for various Zero-speech tasks (as of 2023). In the second part of the talk, I will discuss our recent proposal on a framework to Learning Disentangled (Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by contrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two modules consists of disentangling the two encoders using a mutual information based criterion. With several downstream evaluation experiments, we show that the proposed Learn2Diss framework achieves state-of-the-art results on a variety of tasks, including those in SUPERB challenge. Finally, I will highlight a related effort towards zero-shot emotion conversion and conclude the talk with a discussion of future prospects for these work streams.

His talk takes place on Wednesday, June 26, 2024 at 13:00 in E112. The talk will be streamed live at https://youtube.com/live/2IcAJmFH4Ys.