Electroencephalography (EEG) based speech technologies using neural networks

Date

2021-11-02

Authors

Krishna, Gautam, Ph. D.

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The emergence of virtual personal assistants like Apple Siri, Amazon Alexa, Google Assistant, Samsung Bixby, Windows Cortana, etc has improved the user experience for smartphone and personal computer users. The automatic speech recognition (ASR) system forms an important component of a virtual assistant. An ASR system converts speech signals into text which is further processed by the natural language understanding (NLU) component of the virtual assistant. However, the performance of an ASR system degrades in presence of background noise and this affects the performance of virtual assistants in noisy environments like shopping mall or airport. On the other hand, studies have demonstrated that we humans are able to perform speech recognition with a lower word error rate (WER) compared to machines in presence of background noise. This motivated me to investigate how to use non-invasive electroencephalography (EEG) brain signal features recorded synchronously with noisy speech to improve the performance of ASR and other speech processing models. Current state-of-the-art ASR systems are trained to recognize only acoustic features and this limits technology accessibility for people with speaking disabilities. This motivated me to investigate techniques to design ASR systems capable of recognizing EEG features with no speech input. We, humans, have a high speech rate of 150 words per minute, thus an EEG speech prosthetic that works by first recognizing text from EEG signal and then generating speech from the recognized text using a state-of-the-art speaker dependent or independent text-to-speech (TTS) system, may suffer high latency. This motivated me to investigate algorithms to generate speech signals directly from EEG signals instead of translating EEG signals to text. First, In this thesis, we demonstrate a neural network-based algorithm to improve the performance of ASR and voice activity detection (VAD) systems operating in presence of background noise on a limited English vocabulary using EEG features. We also show that EEG features can be used to improve end-pointer detection model performance as an extension of VAD application. Second, In this thesis we demonstrate a neural network-based algorithm to perform isolated speech recognition using only EEG features with no speech with high accuracy on a limited English vocabulary and we then study three techniques inspired by representation learning to improve the performance of continuous speech recognition systems using only EEG features with no speech. Third, In this thesis, we study different techniques to generate speech features from EEG features and vice-versa. We study a recurrent neural network (RNN) based regression model with and without attention layer to generate acoustic features from EEG features. We demonstrate generating acoustic features from EEG features with low test time error rates using the RNN model with and without attention layer. We show that for majority of our experiments attention model outperformed the RNN model without attention layer in terms of test time error rates even though for some subjects adding attention layer to the model was not helpful. We further identified the right sampling frequency and acoustic feature dimension to generate audio waveform with broader characteristics closer to the ground truth audio waveform from EEG signals. Fourth, In this thesis, we demonstrate an algorithm to improve the speech recognition performance for aphasia, apraxia and dysarthria speech by utilizing EEG features. We demonstrate that our proposed algorithm in real-time can make use of ear EEG, dry EEG and acoustic features to outperform a baseline speech recognition model trained using only acoustic features when tested on several subjects with various severity levels of aphasia, apraxia and dysarthria. We further show that the algorithm can be extended to other tasks like speaker identification and voice activity detection for impaired speech. We demonstrate that the proposed algorithm can also be used to improve the performance of speech recognition systems operating in presence of background noise in addition to impaired speech.

Description

LCSH Subject Headings

Citation