IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 22, NO. 5, SEPTEMBER 2014

1053

A Multi-Views Multi-Learners Approach Towards Dysarthric Speech Recognition Using Multi-Nets Artificial Neural Networks Seyed Reza Shahamiri, Member, IEEE, and Siti Salwah Binti Salim

Abstract—Automatic speech recognition (ASR) can be very helpful for speakers who suffer from dysarthria, a neurological disability that damages the control of motor speech articulators. Although a few attempts have been made to apply ASR technologies to sufferers of dysarthria, previous studies show that such ASR systems have not attained an adequate level of performance. In this study, a dysarthric multi-networks speech recognizer (DM-NSR) model is provided using a realization of multi-views multi-learners approach called multi-nets artificial neural networks, which tolerates variability of dysarthric speech. In particular, the DM-NSR model employs several ANNs (as learners) to approximate the likelihood of ASR vocabulary words and to deal with the complexity of dysarthric speech. The proposed DM-NSR approach was presented as both speaker-dependent and speaker-independent paradigms. In order to highlight the performance of the proposed model over legacy models, multi-views single-learner models of the DM-NSRs were also provided and their efficiencies were compared in detail. Moreover, a comparison among the prominent dysarthric ASR methods and the proposed one is provided. The results show that the DM-NSR recorded improved recognition rate by up to 24.67% and the error rate was reduced by up to 8.63% over the reference model. Index Terms—Dysarthria, dysarthric speech recognition, multi-nets artificial neural networks, multi-views multi-learners (MVML).

I. INTRODUCTION

D

YSARTHRIA is a neurological disability that damages the control of motor speech articulators; this impediment is caused by the lack of control over the speech-related muscles, lack of coordination among them, or their paralysis. It is often associated with irregular phonation and amplitude [1], [2] that result in compromised speech signals and reduced intelligibility of speech [3], [4]. Since the disabled persons are often physically incapacitated and unable to use keyboards, automatic speech recognition (ASR) systems can be very helpful for speakers with dysarthria [5], [6]. These systems identify the uttered word(s) represented as acoustic signals. ASR systems Manuscript received September 24, 2013; revised January 15, 2014; accepted February 24, 2014. Date of publication March 11, 2014; date of current version September 04, 2014. This work was supported by High Impact Research, Ministry of Education Malaysia under Grant UM.C/HIR/625/1/MOE/FCSIT/05. The authors are with the Department of Software Engineering, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur 50603, Malaysia (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNSRE.2014.2309336

have applications in many domains including health care, military service, telephony, and others [7]. ASR systems that are designed for normal speakers usually provide lower performance for individuals who suffer from dysarthria [8] than that of speakers without speech disabilities since dysarthric and normal speech are different [9]–[11]. According to Rudzicz [12], word recognition rates of normal ASR systems used for dysarthric speech were 26.2%–81.8% lower than those of normal ASR systems used for normal speech; as such, the current trend is to create specialized ASR systems for individuals with dysarthria [3], [8], [13], [14]. Therefore, it is necessary to propose ASR systems specifically built for users with dysarthria that deliver adequate accuracy; such specialized speech recognizers have generally achieved comparatively better performance for people with speech disorders [2], [4], [8], [14]. Although a few attempts have been made to apply ASR technologies to dysarthric speech, previous studies show that ASR systems for users with dysarthria have not attained an adequate performance level in terms of accuracy because of the complexity of dysarthric speech recognition [10]. For example, because of increased variability due to physical fatigue and frustration of individuals with dysarthria, as well as variations in the severity levels of the disease, it is difficult to produce an accurate ASR system to be used by individuals with dysarthria. Multi-Views Multi-Learners (MVML) model is an approach to solve complex pattern recognition problems in which legacy multi-views single-learner (MVSL) approaches have failed to provide high accuracy [15], [16]. MVSL model may fail because it has only one single learner approximating several views and the learner cannot adequately converge on all the views. On the other hand, MVML model suggests that the use of multiple learners increases classification performance since each of the learners is responsible to approximate one or a few of the views [17]. The classification task of dysarthric speech is extremely complicated due the complexity of dysarthric speech and its variability. Multi-Nets Artificial Neural Networks (M-N Anns) [18] model is a realization of the MVML approach; each ANN learns one of the views and together they constitute an ensemble of learners. They have been proven capable of recognizing complex patterns better than MVSL ANN-based classifiers because the complexity of the function is distributed among several independent neural networks so that the overall classification becomes easier [19]; under similar conditions, a MVSL ANN-based system fails to learn the function with

1534-4320 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1054

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 22, NO. 5, SEPTEMBER 2014

adequate accuracy. In this paper, we studied whether the MVML approach towards ANN-based dysarthric ASR systems can improve the performance of recognition task over that of the MVSL model. Here, a whole-word1 and isolated-word2 dysarthric multi-networks speech recognizer (DM-NSR) approach is proposed using M-N ANNs for the disabled that tolerates variability of speech. We studied the applications of M-N ANNs as speaker-dependent (SD) and speaker-independent (SI) dysarthric ASR paradigms; the SD paradigm was verified with data obtained from the same speakers that were used to train the ASR systems. In contrast, the SI paradigm was evaluated using pronunciations of users whose acoustic data were not provided during the training process, (i.e., the system may identify unforeseen speech). Speech data from different severity levels of disability were considered. To highlight the advantages of the proposed model, a similar MVSL ANN-based dysarthric ASR system [20] was provided as the reference model; it was trained and tested with the same data as well as same methodology. Finally, the results obtained from the proposed DM-NSR model and the reference model were compared. II. RELATED WORK Proposals of speech recognizers for dysarthric people have been considered before. Table I compares state-of-the-art ASR systems for English dysarthric speakers inclusive of the proposed model. It is pertinent to note that experimenting with more subjects of both sexes and variety of speech intelligibility leads to results that are more concrete, which show whether the system is capable of tolerating variability of dysarthric speech; hence, the number of dysarthric subjects is an important criterion in order to validate the results. In the following section, each dysarthric ASR system is discussed further. The ASR algorithm is usually selected among hidden Markov models (HMMs), support vector machines (SVMs), and ANNs. The first ASR system shown in Table I was proposed by Hasegawa–Johnson et al. [14]. They provided two ASR systems based on the data collected from three subjects with dysarthria: one female and two males with one control subject. The first system was based on HMM, and the second was based on SVM. The former was successful for two subjects, but it failed for one of the subjects with the tendency to delete consonants in a word. Similarly, the SVM-based ASR system failed to perform for one of the subjects with dysarthria, because he suffered from stuttering, but it was successful for the other two subjects. The authors concluded that HMM-based dysarthric ASR models may provide robustness against large-scale word-length fluctuations, and SVM-based models can handle the deletion or reduction of consonants. The next HMM-based speech recognizer was proposed by Selouani et al. [3]. The dysarthric subjects were four male individuals from the Nemours database [21] in addition to one control speaker. The authors claimed that the ASR system obtained 1The system uses word acoustic features to recognize the pronunciations instead of phoneme features. 2Isolated-word means it does not recognize continuous speech and requires the speaker to pause between words.

an average recognition rate of 69.1%; however, commonly cited evaluation criterion for continuous-speech is word error rate (WER) instead of recognition rate. Moreover, speech intelligibility of dysarthric subjects employed in this study is unknown. Green et al. [22] conducted some experiments with STARDUST (speech training and recognition for dysarthric users of assistive technology) with three dysarthric subjects. They reported mean recognition accuracy of 94.33% for a ten-word vocabulary but information about the gender of the subjects and their speech intelligibility is not available. Hawley [8] studied further with two female and five male dysarthric subjects with low speech intelligibility. They obtained a higher average recognition rate of 95.40% for a similar vocabulary, although the improvements were achieved after the users underwent speech training, which provided a controlled environment. ANN-based ASR systems have been successfully employed for normal speech as reported in the literature (such as [23], [24]). Jayaram and Abdelhamied studied the application of ANNs as a dysarthric ASR model [25]. The authors applied ANNs in a ten-word vocabulary ASR system to recognize the speech of one male dysarthric subject with 20% speech intelligibility. The maximum recognition rate of 78.25% was obtained. Recently there is a trend to use ANN/HMM hybrid algorithms for normal speech recognizers, but their applications for dysarthric speech do not get the same attention. The reason can be the difficulties in obtaining ANN training data: the hybrid algorithm usually employs neural networks for approximating the language model in order to perform phoneme recognition. However, since speech intelligibility of disabled people can be extremely low, dysarthric speech segmentation and labelling can be very challenging, which require complex methods such as forced alignment and knowledge of vocal tract [26]. Low intelligibility of dysarthric speech is due to a combination of many articulatory behaviours that may lead to phonemic insertion errors in or around words [12], [27]; hence, phoneme-based dysarthric ASR systems that depend on posterior probabilities may not be a proper approach because the segmentation data may be inaccurate and misleading. Nonetheless, postprocessing techniques such as confusing matrix and finite-state transducers [28] can be used to remedy the situation. In this context, Morales and Cox applied confusion matrix to improve the performance of dysarthric speech recognition. They applied pronunciations of 10 dysarthric subjects from Nemours database (74 sentences for each speaker), in which a subset of the first 34 sentences was used for training and the rest for testing. The best average recognition rate was less than 50%, and using confusion matrix was reported to have some issues with deletion and inability to model specific phoneme sequences [29]. In their later attempt, the authors employed WFST (weighted finite state transducers) in the confusion-matrix and improved the accuracy of their previous work by about 10% [30]. Another application of WFST that improved dysarthric speech recognition error was proposed by Seong et al. [31]. They constructed a context-dependent confusion matrix, interpolated with a context–independent matrix, and built a WFST integrated with a dictionary and language model

SHAHAMIRI AND SALIM: A MULTI-VIEWS MULTI-LEARNERS APPROACH TOWARDS DYSARTHRIC SPEECH RECOGNITION

1055

TABLE I COMPARATIVE STUDY OF STATE-OF-THE-ART ENGLISH DYSARTHRIC SPEECH RECOGNIZERS

transducers in order to correct speech recognition errors. The lowest WER reported in this work was 34.03%. -Best word list is another postprocessing technique used in dysarthric speech recognition, where it can be manual [35] or automatic [based on articulary data using electromagnetic articulography (EMA)] [26], [33]. In the manual approach, users are able to select the spoken word among those found in the -best list. On the other hand, the automated approach uses a systematic method to select the most probable utterance by rearranging the likelihoods extracted from the articulary models.

The problem with the –best list is it may fail if the corrected word is not found in the -best list; hence, the list must be accurate. In addition, EMA-based methods cannot be considered if articulatory data are not available. In the context of hybrid ANN/HMM dysarthric ASR studies, Polur and Miller [13] studied the hybrid algorithm for a dysarthric ASR system and verified it against data collected from three male dysarthric subjects with moderate speech intelligibility. The hybrid ASR system was an extended version of the researchers’ earlier experiments with HMMs [36]. Neural

1056

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 22, NO. 5, SEPTEMBER 2014

networks were considered to provide posterior probabilities; forced alignment and ergodic HMMs were applied in order to perform speech segmentation and prepare ANN training data. The authors concluded that the hybrid algorithm provided better recognition rates than legacy HMM-based dysarthric speech recognizers. Despite the above SD and SA systems, there had been a few unsuccessful attempts to provide SI ASR systems for users with dysarthria. They were unsuccessful because the error rates were too high for these speech recognition systems to be of any practical application. For example, Sanders and his colleagues [10] studied how a normal, SI HMM-based ASR system behaved when it was used by people with dysarthria. The ASR system, trained with nonspeech-disordered speech data, was evaluated with dysarthric data acquired from two male speakers with mild dysarthria. The results for the two evaluation subjects showed WER of 15.4% and 41%, respectively. However, the same ASR system had better performance when it was trained and tested with dysarthric data, (i.e., as a SD ASR system). The SD ASR system evaluated with data of the same two dysarthric subjects had WER of 2.6% and zero, respectively. Similar results were described by Talbot for the ENABL project [11]. The author verified a commercial ASR system with data collected from 10 individuals with dysarthria (five males and five females) and reported a very high error rate. Sharma and Hasegawa–Johnson considered the database used in this study to provide a speaker-adaptive and a speaker-dependent speech recognizers for users with dysarthria [34]. For the adaptive model, they provided an SI ASR system for speakers without disabilities trained with TIMIT corpus; then, the authors utilized the speech of seven speakers with dysarthria from the UA-Speech database to verify the ASR system provided as an SA dysarthric ASR paradigm. The maximum average recognition rate for the SA system was 36.8% and 30.84% for the SD system. Among the ASR systems discussed in this paper, only Sharma and Hasegawa-Johnson [34] verified their approach against all dysarthric severity levels, but the recognition rates were low. We verified DM-NSR model with several subjects who have speech intelligibility as low as 2% to as high as 95% and the recognition rates of our proposed approach are considerably higher. III. MATERIALS AND PARTICIPANTS We used speech materials provided by UA-Speech Database of dysarthric speakers produced by University of Illinois [37]. Although other dysarthric databases were available at the time of writing this report, most of them were not suitable for this research since the context of this study was to provide a fixedlength, isolated-word speech recognizer. Thus, an isolated-word dysarthric speech corpus including enough pronunciations of the vocabulary (Table II) was necessary. For example, LDC released TORGO database recently [38], [39], but it could not be considered in this research; this is because insufficient repetitions of some of the words were included in the database, such that ANN training samples could not be extracted appropriately. To put it differently, since word acoustic features were used to train neural networks instead of phoneme features, samples

TABLE II VOCABULARY

TABLE III DYSARTHRIC PARTICIPANTS [37]

of vocabulary utterances for each of the dysarthric participants were necessary, but these were not provided by TORGO. In addition, training a speaker-independent ASR system requires a large number of participants in order to increase the generalizability of the classifier; TORGO provides speech materials from only eight dysarthric subjects. UA-Speech database contains isolated words, acoustic samples of digits, radio alphabet letters, computer commands, and common words acquired from 19 male and female subjects with dysarthria of different severity levels, varying from very low speech intelligibility (2%) to high intelligibility (95%). All the speech samples were recorded using an array of eight microphones, which were 6 mm in diameter and presented as Microsoft PCM at a sampling rate of 16 kHz in Wave format. The vocabulary used in this experiment is shown by Table II. We utilized the pronunciations of 16 of the subjects with dysarthria to provide the required speech materials for ASR modelling and evaluation. There were not enough data available for the remaining three subjects so we did not include them in our experiments. Table III provides more information about the subjects with dysarthria used in this research. Moreover, acoustic samples of 11 speakers without disabilities for the same vocabulary (provided by the database) were also considered as control speakers.

SHAHAMIRI AND SALIM: A MULTI-VIEWS MULTI-LEARNERS APPROACH TOWARDS DYSARTHRIC SPEECH RECOGNITION

1057

Fig. 1. Dysarthric single-network speech recognizer, a MVSL approach.

The database provides three different repetitions of each word per speaker. For the SD DM-NSRs, we used one of the repetitions together with speech samples of three control speakers per word as training samples. The second and third dysarthric utterances were considered as ASR evaluation data and were employed to measure the ASR performance. The process of selecting one training and two evaluation pronunciations was repeated three times for cross-validation, each time with different sets. Speech therapists often use clinical assessments of intelligibility of dysarthric speakers for rehabilitation [40]. Speech intelligibility is the measure of the degree of speech understandability, and it correlates well with ASR accuracy [6]. UA-Speech database presents severity of dysarthric speech impairment by speech intelligibility of each speaker. In this study, we classified severity of speech impairment as high, moderate, or mild based on the participants’ intelligibility provided by the database. If the speech is identified as “High Dysarthric Severity,” its intelligibility is low (less than 33%). On the other hand, “Mild Dysarthric Severity” means the intelligibility is high (between 66% and 99%). The rest of the intelligibility values, ranging between 33% and 66%, are defined as “Moderate Dysarthric Severity.” Here we measured the performance of the proposed SI ASR model for each of the above severity levels separately. IV. METHOD A. Feature Extraction Each utterance was presented by 22 frames of melcepstrum with 12 coefficients. The number of frames was selected to match the maximum length of pronunciations provided by the database. The features extracted from each frame were concatenated to the previous frames to create the ANN input vector. In particular, each utterance was represented by a vector of 264 features (12 features per frame 22 frames); each feature was assigned to one of the input neurons, (i.e., each twelve-input neurons represented one of the frames).

B. Dysarthric Multi-Networks Speech Recognizer Model This section explains how M-N ANNs can be employed to formulate the DM-NSR. ANN classifiers comprising only one neural network may not be able to approximate highly complex functions that represent multiple views [15]. The main drawback of such systems is that they learn all the views using only one particular ANN, thus they are of a MVSL approach. The consequence is that the monolithic ANN may fail to provide an accurate approximation of all the views; the burden of functional complexity and the number of views are too heavy for one neural network. On the other hand, M-N ANNs use several parallel and standalone ANNs to learn the views; the administration of the view approximation procedure is divided among all the ANNs. In more precise terms, approximating a complex function comprising multiple views is distributed among multiple ANNs, and as one of the learners, each ANN is assigned to accomplish a part of the approximation job and learn a single view so that the complex function is simplified [18], [19]. Due to variations and complexity of impaired dysarthric speech, the recognition task becomes complicated because the speech features are unlikely to be the same as those used for training. Consequently, MVSL ANN-based ASR systems may not be able to identify the uttered words accurately. It is shown here that better dysarthric ASR performance is achieved by using DM-NSRs because they perform better classification. Fig. 1 shows the structure of an isolated-word, dysarthric ASR system based on a monolithic ANN, which is called dysarthric single-network speech recognizer (DS-NSR) in this paper; Fig. 2 depicts the same for a DM-NSR. Assuming the ASR vocabulary is defined as

where is one of the words to be identified by ASR system, and is the size of the vocabulary. The features vector is

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 22, NO. 5, SEPTEMBER 2014

1058

Fig. 2. Dysarthric multi-networks speech recognizer, a MVML approach.

Consider any possible values are complete feature values for vocabulary

, then (1) gives the

(1) MVSL-based approaches use only one ANN to approximate . On the other hand, the DM-NSR approach assigns a network to learn only the features associated with a particular view, hence DM-NSR is

where Let

is responsible for predicting the likelihood of . be the acoustic feature values that present , then approximates the feature values defined by (2)

in which

The ANN outputs are

likelihood

where is the probability of presenting . Once a feature vector is given to a DM-NSR, it feeds the feature data to all the ANNs that make up the recognizer. The selected ANN is the one with the maximum likelihood, as shown by (3) output is the likelihood of the th word, It is noted that (i.e., ) generated by the th neural network. Since each ANN , the modthat forms the DM-NSR learns its associated elling accuracy and generalizability are increased because the is decreased number of parameters required to approximate

Thus, if the ANN is given dysarthric acoustic data, it should be able to recognize its associated features better than a neural network that approximates . In addition, since the networks that make up DM-NSRs work independently, M-NSRs provide full customizability to choose different ANN models, architecture, and parameters as required,

SHAHAMIRI AND SALIM: A MULTI-VIEWS MULTI-LEARNERS APPROACH TOWARDS DYSARTHRIC SPEECH RECOGNITION

without affecting other ANNs that have acceptable recognition rates; this flexibility is helpful for increasing the quality of recognition task. Nevertheless, it is not true for DS-NSR since any modification to the neural network influences the whole system, and only one ANN model or architecture can be used.

1059

TABLE IV ANN STRUCTURES AND THEIR TRAINING PARAMETERS

C. Evaluation Criteria Word recognition rate and accuracy are considered as the evaluation criteria in order to assess the quality of the ANNbased speech recognizers produced in this study. These two parameters are defined as follows. 1) Recognition rate (RR): The proportion of correct identification of the vocabulary words by the ASR system. This conveys the correctness of the recognizers’ results when evaluation data are given to the system

2) Normalized root mean square error (NRMSE): It is used to measure the accuracy of the system. NRMSE is usually measured in computational neurosciences in order to show how well a system learns a model. Here, it is based on the calculation of the absolute distance between the ideal results and the actual results produced by the ASR system during the evaluation procedures. This parameter shows how close the ASR results are to the ideal ones in practice and how well the ASR system approximates the likelihood. Lower NRMSE percentage values show that the ASR provides likelihood that is more accurate. NRMSE is defined as

In which is the number of evaluation samples, and is the vocabulary size. The parameters and are set as maximum and minimum of Sigmoid activation function. V. EXPERIMENTS AND RESULTS We trained 443 neural networks; all of them were MLPs with one hidden layer trained by feed-forward and back-propagation algorithm owing to their robustness and ease of implementation. The ANNs consisted of 264 input neurons, as explained in Section IV-A, and the activation function is Sigmoid. The following rule was considered in selecting the number of hidden neurons

where is the number of hidden neurons, is the number of input neurons, and is the number of output neurons. The rest of the training parameters of all the ANNs were chosen by trial and error, and those providing the best results are discussed here. The structure of the ANNs are shown in Table IV. In the following, we explain two sets of experiments carried out in

achieving the objectives of this study and present the evaluation results. A. Speaker-Dependent DM-NSR Experiments The first set of experiments was conducted to measure the performance of the proposed DM-NSR as an SD dysarthric ASR system. We trained several different M-N ANN-based ASR systems for dysarthric subjects shown in Table III. Each subject was provided with both SD DM-NSR and dysarthric single-network speech recognizer (DS-NSR) for comparison purpose; all were trained and evaluated with the very same data. Three-fold cross-validation was considered; particularly, in each fold , for each one of the repetitions was used for training and the rest for evaluation (i.e., the evaluation set for each subject consists of 50 dysarthric pronunciations per word). The evaluation and training utterances were changed for the next fold but control training pronunciations were fixed for all folds. The results of these experiments are shown in Table V. In order to measure NRMSE, the following output conventions of ANN classification are adopted. If an ANN classifies a given feature as the word that is assigned to it, the target value is one, (i.e., 100% likelihood); therefore, it should produce a result very close or equal to one. On the other hand, if the features do not represent the assigned word, the target result is zero, (i.e., 0% likelihood), and the output it generates should be very close or equal to zero in order to indicate that this is not the word that it was trained for. For instance, if one of the ANNs is given a feature vector that represents its associated word and produces likelihood of 98%, then the distance for this particular network is 2%. However, if the ANN produces 5% likelihood, which means that it does not identify the word, then the distance is 95%. B. Speaker-Independent DM-NSR Experiments The evaluation of the proposed isolated-word SI DM-NSR is discussed in this section. For the SI experiments, we needed most of the data to train the system to maximize the generalizability. Hence, we were left with very few pieces of data to

1060

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 22, NO. 5, SEPTEMBER 2014

TABLE V SPEAKER-DEPENDENT DM-NSR AND DS-NSR RESULTS

testing (i.e., three repetitions of each vocabulary word), in addition to five silence samples. Hence, the SI testing data set comprised 80 samples per speaker. No data of control speakers was applied in testing. Table VI shows the results obtained by applying the test datasets to both SI ASR systems. The evaluation parameters were measured for each dysarthric severity level, (i.e., for each dysarthric subject) separately. VI. DISCUSSION

test the system. In order to overcome the problem, we repeated the training and testing process three times; each time 13 different subjects were chosen for training and three for testing. With the subsequent repetitions, different testing subjects were selected and the previous ones used for training. Particularly, the testing subjects for the first experiment became training subjects for the second experiment, and the data of three of the previously training participants were employed for testing. Likewise, the third experiment was conducted with different testing and training participants. For each testing subject, 75 different dysarthric pronunciations of the vocabulary were employed for

In order to highlight the superiority of MVML ANN-based dysarthric ASR systems over MVSL-based systems, several ASR systems were provided and compared in detail. Among 114 experiments, 96 experiments were conducted to measure the performance of the proposed approach as a speaker-dependent dysarthric ASR, and 18 experiments as speaker-independent. In order to highlight the reliability of the results, Figs. 3 and 4 depict the distribution of the D-M-NSR results for both speaker-dependent and speaker-independent experiments respectively in accordance to their standard deviation. It is shown that the RRs and NRMSEs followed a normal distribution because the minimum of 72.91% of the observations fall between . Moreover, Fig. 5 shows DM-NSR performance improvements over the reference model. As shown, the SD DM-NSRs recorded improved average recognition rate by 15.89% and decreased average NRSME by 5.14%. For the speaker-independent ASR system, the DM-NSR recorded increased average recognition rate by 15.69% and decreased error rate by 6.25%. The reduction of error rates proves that the likelihood values produced by the proposed model are closer to the target values, and they are more accurate; thus, the MVML-based ASR system is more effective in identifying dysarthric speech. In addition to the DM-NSR superior performance, another advantage of using DM-NSR is the flexibility to modify each ANN without affecting other neural networks. This is helpful when the ASR system does not identify some of the words in the vocabulary but recognizes the rest appropriately. DM-NSRs offer the capability to improve recognition rate of recognizers that recognized words poorly without affecting the others that are functioning properly. As an illustration, during the experiments, we noticed some misclassified results. Particularly, the ASR systems provided similar likelihood values for those words with similar attributes. For the DS-NSRs, it was impossible to modify the neural network parameters without affecting their correct results in order to enhance the likelihood prediction for those incorrect classifications. In particular, any modification to the ANN structure would cause recognition rates to be reduced from the ones mentioned here. On the other hand, since DM-NSR provides a standalone ANN for estimating each digit likelihood, it was possible to customize the ANNs associated with those misclassified words without modifying other ANNs. Thus, we were able to rectify the faulty networks by trial and error until the DM-NSR learnt those similar samples successfully. Finally, since the proposed approach requires word-based acoustic features instead of phoneme-based, there is no need for

SHAHAMIRI AND SALIM: A MULTI-VIEWS MULTI-LEARNERS APPROACH TOWARDS DYSARTHRIC SPEECH RECOGNITION

1061

TABLE VI SPEAKER-INDEPENDENT DM-NSR AND DS-NSR RESULTS

Fig. 3. Speaker-dependent DM-NSR results distribution.

speech segmentation; hence, applying the proposed approach is straightforward. Because intelligibility of dysarthric speech is often very low, identifying the phonemes and labelling them in order to segment dysarthric speech samples is a difficult, error-prone and time-consuming process. Nevertheless, speech segmentation is crucial for the assessment of phonatory dysfunction and dysarthric levels. Similarly, the performance of phoneme-based dysarthric ASR systems at the phonetic level can be improved by employing postprocessing techniques, such as those discussed in Section II.

Fig. 4. Speaker-independent DM-NSR results distribution.

VII. CONCLUSION In this paper, we studied the applications of M-N ANNs to provide a dysarthric ASR model. An isolated-word DM-NSR approach that improves the performance of legacy MVSL ANN-based dysarthric ASR models was proposed. The proposed DM-NSR approach was evaluated as speaker-dependent and speaker-independent paradigms. Furthermore, in order to highlight the performance of the proposed model, DS-NSRs were also provided for comparative purpose. The results indicate that DM-NSR provides better performance than the

1062

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 22, NO. 5, SEPTEMBER 2014

Fig. 5. DM-NSR performance improvements over the reference model.

MVSL-based ASR system; in particular, DM-NSR trained as a speaker-independent ASR recorded improved recognition rates for high, moderate and low intelligibility dysarthric speech from 65.83%, 64.58%, 48.75% to 83.00%, 79.16%, and 63.75%, respectively. Similarly, SD DM-NSRs recorded improved mean recognition rate from 64.94% to 80.83%. DM-NSRs recorded lower error rates in all of the experiments. Thus, it is concluded that the proposed approach handles dysarthric speech better than DS-NSRs. REFERENCES [1] K. Rosen and S. Yampolsky, “Automatic speech recognition and a review of its functioning with dysarthric speech,” Augment. Alt. Commun., vol. 16, no. 1, pp. 48–60, 2000. [2] P. D. Polur and G. E. Miller, “Effect of high-frequency spectral components in computer recognition of dysarthric speech based on a Mel-cepstral stochastic model,” J. Rehabil. Res. Develop., vol. 42, no. 3, pp. 363–371, May-Jun. 2005. [3] S.-A. Selouani, M. S. Yakoub, and D. O’Shaughnessy, “Alternative speech communication system for persons with severe speech disorders,” Eurasip J. Adv. Signal Process., 2009. [4] S. O. C. Morales and S. J. Cox, “Modelling errors in automatic speech recognition for dysarthric speakers,” Eurasip J. Adv. Signal Process., pp. 1–14, 2009. [5] P. C. Doyle, H. A. Leeper, A. L. Kotler, N. ThomasStonell, C. Oneill, M. C. Dylke, and K. Rolls, “Dysarthric speech: A comparison of computerized speech recognition and listener intelligibility,” J. Rehabil. Res. Develop., vol. 34, no. 3, pp. 309–316, Jul. 1997. [6] L. Ferrier, H. Shane, H. Ballard, T. Carpenter, and A. Benoit, “Dysarthric speakers’ intelligibility and speech characteristics in relation to computer speech recognition,” Augment. Alt. Commun., vol. 11, no. 3, pp. 165–175, 1995.

[7] P. Kitzing, A. Maier, and V. L. Ahlander, “Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders,” Logopedics Phoniatr. Vocol., vol. 34, no. 2, pp. 91–96, 2009. [8] M. S. Hawley, P. Enderby, P. Green, S. Cunningham, S. Brownsell, J. Carmichael, M. Parker, A. Hatzis, P. O’Neill, and R. Palmer, “A speech-controlled environmental control system for people with severe dysarthria,” Med. Eng. Phys., vol. 29, no. 5, pp. 586–593, Jun. 2007. [9] K. Hux, J. Rankin-Erickson, N. Manasse, and E. Lauritzen, “Accuracy of three speech recognition systems: Case study of dysarthric speech,” Augment. Alt. Commun., vol. 16, no. 3, pp. 186–196, 2000. [10] E. Sanders, M. Ruiter, L. Beijer, and H. Strik, “Automatic recognition of Dutch dysarthric speech: A pilot study,” in Proc. 7th Int. Conf. Spoken Language Process., Denver, CO, 2002, pp. 661–664. [11] N. Talbot, “Improving the speech recognition in the ENABL project,” KTH TMH-QPSR, vol. 41, no. 1, pp. 31–38, 2000. [12] F. Rudzicz, “Using articulatory likelihoods in the recognition of dysarthric speech,” Speech Commun., vol. 54, no. 3, pp. 430–444, 2012. [13] P. D. Polur and G. E. Miller, “Investigation of an HMM/ANN hybrid structure in pattern recognition application using cepstral analysis of dysarthric (distorted) speech signals,” Med. Eng. Phys., vol. 28, no. 8, pp. 741–748, Oct. 2006. [14] M. Hasegawa-Johnson, J. Gunderson, A. Perlman, and T. Huang, “HMM-based and SVM-based recognition of the speech of talkers with spastic dysarthria,” in Proc. 2006 IEEE Int. Conf. Acoust., Speech, Signal Process., Toulouse, France, 2006, pp. 1060–1063. [15] Q. Zhang and S. Sun, “Multiple-view multiple-learner active learning,” Pattern Recognit., vol. 43, no. 9, pp. 3113–3119, 2010. [16] S. Sun and Q. Zhang, “Multiple-view multiple-learner semi-supervised learning,” Neural Process. Lett., vol. 34, no. 3, pp. 229–240, 2011. [17] S. Sun, “A survey of multi-view machine learning,” Neural Comput. Appl., pp. 1–8, 2013. [18] S. R. Shahamiri and S. S. B. Salim, “Real-time frequency-based noiserobust automatic speech recognition using multi-nets artificial neural networks: A multi-views multi-learners approach,” Neurocomputing, vol. 129, no. 10, pp. 199–207, 2014. [19] S. R. Shahamiri, W. M. N. W. Kadir, S. Ibrahim, and S. Z. B. Hashim, “An automated framework for software test oracle,” Inf. Software Technol., vol. 53, no. 7, pp. 774–788, Jul. 2011. [20] S. R. Shahamiri and S. S. B. Salim, “Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach,” Adv. Eng. Informat., vol. 28, no. 1, pp. 102–110, 2014. [21] X. Menendez-Pidal, J. B. Polikoff, S. M. Peters, J. E. Leonzio, and H. T. Bunnell, “The Nemours database of dysarthric speech,” in Proc. 4th Int. Conf. Spoken Language, Philadelphia, PA, 1996, vol. 3, pp. 1962–1965. [22] P. Green, J. Carmichael, A. Hatzis, P. Enderby, M. Hawley, and M. Parker, “Automatic speech recognition with sparse training data for dysarthric speakers,” in Proc. 8th Eur. Conf. Speech Commun. Technol., Geneva, Switzerland, 2003, pp. 1189–1192. [23] E. Trentin and M. Gori, “A survey of hybrid ANN/HMM models for automatic speech recognition,” Neurocomputing, vol. 37, pp. 91–126, 2001. [24] G. Dede and M. H. Sazli, “Speech recognition with artificial neural networks,” Digital Signal Process., vol. 20, no. 3, pp. 763–768, May 2010. [25] G. Jayaram and K. Abdelhamied, “Experiments in dysarthric speech recognition using artificial neural networks,” J. Rehabil. Res. Develop., vol. 32, no. 2, pp. 162–169, May 1995. [26] F. Rudzicz, “Articulatory knowledge in the recognition of dysarthric speech,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 4, pp. 947–960, Apr. 2011. [27] P. Raghavendra, E. Rosengren, and S. Hunnicutt, “An investigation of different degrees of dysarthric speech as input to speaker-adaptive and speaker-dependent recognition systems,” Augment. Alt. Commun., vol. 17, no. 4, pp. 265–275, 2001. [28] S. O. C. Morales and S. J. Cox, “Modelling confusion matrices to improve speech recognition accuracy, with an application to dysarthric speech,” in Proc. 8th Annu. Conf. Int. Speech Commun. Assoc., Aug. 2007, pp. 277–280.

SHAHAMIRI AND SALIM: A MULTI-VIEWS MULTI-LEARNERS APPROACH TOWARDS DYSARTHRIC SPEECH RECOGNITION

[29] S. O. C. Morales and S. J. Cox, “Application of weighted finite-state transducers to improve recognition accuracy for dysarthric speech,” in Proc. 9th Annu. Conf. Int. Speech Commun. Assoc., Brisbane, Australia, 2008, pp. 1761–1764. [30] S. C. Morales and S. Cox, “Modelling errors in automatic speech recognition for dysarthric speakers,” EURASIP J. Adv. Signal Process., vol. 2009, no. 1, p. 308340, 2009. [31] W. K. Seong, J. H. Park, and H. K. Kim, “Dysarthric speech recognition error correction using weighted finite state transducers based on context-dependent pronunciation variation,” in Proc. 13th Int. Conf. Comput. Helping People Special Needs, Linz, Austria, 2012, pp. 475–482. [32] W. Seong, J. Park, and H. Kim, “Dysarthric speech recognition error correction using weighted finite state transducers based on context–dependent pronunciation variation,” in Computers Helping People with Special Needs, K. Miesenberger, A. Karshmer, P. Penaz, and W. Zagler, Eds. Berlin, Germany: Springer, 2012, Lecture Notes Comput. Sci., pp. 475–482. [33] F. Rudzicz, “Correcting errors in speech recognition with articulatory dynamics,” in Proc. 48th Annu. Meet. Assoc. Computat. Linguist., Uppsala, Sweden, 2010, pp. 60–68. [34] H. V. Sharma and M. Hasegawa-Johnson, “State-transition interpolation and MAP adaptation for HMM-based dysarthric speech recognition,” in Proc. NAACL HLT 2010 Workshop Speech Language Process. Assist. Technol., Los Angeles, CA, 2010, pp. 72–79. [35] J.-P. Hosom, T. Jakobs, A. Baker, and S. Fager, “Automatic speech recognition for assistive writing in speech supplemented word prediction,” in Proc. 11th Annu. Conf. Int. Speech Commun. Assoc., Makuhari, Japan, 2010, pp. 2674–2677. [36] P. D. Polur and G. E. Miller, “Experiments with fast Fourier transform, linear predictive and cepstral coefficients in dysarthric speech recognition algorithms using hidden Markov model,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 13, no. 4, pp. 558–561, Dec. 2005. [37] H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. Huang, K. Watkin, and A. S. Frame, “Dysarthric speech database for universal access research,” in Proc. 9th Annu. Conf. Int. Speech Commun. Assoc., Brisbane, Australia, 2008, pp. 1741–1744. [38] F. Rudzicz, G. Hirst, P. v. Lieshout, G. Penn, F. Shein, A. Namasivayam, and T. Wolff, TORGO Database of Dysarthric Articulation. Philadelphia, PA: Linguistic Data Consort., 2012.

1063

[39] F. Rudzicz, A. Namasivayam, and T. Wolff, “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources Evaluat., vol. 46, no. 4, pp. 523–541, 2012. [40] R. D. Kent, “Research on speech motor control and its disorders: A review and prospective,” J. Commun. Disorders, vol. 33, no. 5, pp. 391–428, 2000.

Seyed Reza Shahamiri (M’13) received the B.S. and M.S. degrees in computer engineering from Islamic Azad University, Tehran, Iran, in 2004 and 2007, respectively, and the Ph.D. degree in computer science from Universiti Teknologi Malaysia, Skudai, Malaysia, in 2011. He is a Senior Lecturer at the Department of Software Engineering, University of Malaya, Kuala Lumpur, Malaysia. His current research interests are centered in complex pattern recognition, disabled speech processing, and intelligent software testing. He leads and teaches modules at both undergraduate and postgraduate levels in computer science.

Siti Salwah Binti Salim received the Ph.D. degree in computer science from the University of Manchester Institute of Science and Technology, Manchester, U.K., in 1998. She is a Professor at the Department of Software Engineering, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia. She supervises Ph.D. degree and M.S. degree students in the areas of requirements engineering, human–computer interaction, automatic speech recognition, component based software development, and e-learning. She also leads and teaches modules at both B.Sc. degree and M.Sc. degree levels in software engineering.

A multi-views multi-learners approach towards dysarthric speech recognition using multi-nets artificial neural networks.

Automatic speech recognition (ASR) can be very helpful for speakers who suffer from dysarthria, a neurological disability that damages the control of ...
2MB Sizes 0 Downloads 3 Views