Home

Search

Collections

Journals

About

Contact us

My IOPscience

EEG classification in a single-trial basis for vowel speech perception using multivariate empirical mode decomposition

This content has been downloaded from IOPscience. Please scroll down to see the full text. 2014 J. Neural Eng. 11 036010 (http://iopscience.iop.org/1741-2552/11/3/036010) View the table of contents for this issue, or go to the journal homepage for more

Download details: IP Address: 192.236.36.29 This content was downloaded on 29/05/2014 at 15:27

Please note that terms and conditions apply.

Journal of Neural Engineering J. Neural Eng. 11 (2014) 036010 (12pp)

doi:10.1088/1741-2560/11/3/036010

EEG classification in a single-trial basis for vowel speech perception using multivariate empirical mode decomposition Jongin Kim 1 , Suh-Kyung Lee 1 and Boreom Lee 1,2,3 1

Department of Medical System Engineering (DMSE), Gwangju Institute of Science and Technology (GIST), Gwangju, Republic of Korea 2 School of Mechatronics, Gwangju Institute of Science and Technology (GIST), Gwangju, Republic of Korea E-mail: [email protected] Received 4 December 2013, revised 17 March 2014 Accepted for publication 19 March 2014 Published 8 May 2014 Abstract

Objective. The objective of this study is to find components that might be related to phoneme representation in the brain and to discriminate EEG responses for each speech sound on a trial basis. Approach. We used multivariate empirical mode decomposition (MEMD) and common spatial pattern for feature extraction. We chose three vowel stimuli, /a/, /i/ and /u/, based on previous findings, such that the brain can detect change in formant frequency (F2) of vowels. EEG activity was recorded from seven native Korean speakers at Gwangju Institute of Science and Technology. We applied MEMD over EEG channels to extract speech-related brain signal sources, and looked for the intrinsic mode functions which were dominant in the alpha bands. After the MEMD procedure, we applied the common spatial pattern algorithm for enhancing the classification performance, and used linear discriminant analysis (LDA) as a classifier. Main results. The brain responses to the three vowels could be classified as one of the learned phonemes on a single-trial basis with our approach. Significance. The results of our study show that brain responses to vowels can be classified for single trials using MEMD and LDA. This approach may not only become a useful tool for the brain–computer interface but it could also be used for discriminating the neural correlates of categorical speech perception. Keywords: single-trial EEG classification, multivariate empirical mode decomposition, vowel speech perception, ASSR (Some figures may appear in colour only in the online journal)

At the level of speech perception, many theories have been developed and experiments conducted on phoneme representation, but how phonemes are encoded and accessed in the brain remains controversial. Although the perception of general sound is more or less continuous, it has been widely accepted through behavioral tasks (Johnson 2012, Liberman et al 1957) that the perception of speech sounds is categorical. For example, listeners can more easily discriminate acoustic differences of speech sounds when the sounds are located in different phonetic categories than when they are located in the same phonetic category even if the acoustic differences are equivalent. Moreover, phonetic categories in adults depend on the listener’s native language, while infants can discriminate

1. Introduction Neurolinguistics is a study to reveal the neural mechanisms of language processing. The first man to discover the connection between brain regions and motor production of speech was Paul Broca, and the area to control speech production is known as Broca’s area. Subsequently, Carl Wernicke proposed that different brain regions control different linguistic tasks, and discovered Wernicke’s area, which spans the region between temporal and parietal lobes and handles speech comprehension (Brown and Hagoort 2000, Caplan 1987). 3

Author to whom any correspondence should be addressed.

1741-2560/14/036010+12$33.00

1

© 2014 IOP Publishing Ltd

Printed in the UK

J Kim et al

J. Neural Eng. 11 (2014) 036010

almost all phonetic boundaries used in human languages (Werker and Tees 1984). Due to the remarkable development of brain imaging and electrophysiological techniques such as EEG and MEG, there has been increasing interest in finding the neural basis for categorical perception. Previous studies on the neural correlates of categorical perception have used mismatch negativity (MMN). The MMN is an event-related potential (ERP) that occurs in response to an infrequent change in a repetitive sequence of stimuli such as sounds. It demonstrates the brain’s capability to perform automatic comparisons between successive stimuli. N¨aa¨ t¨anen et al (1997) reported evidence for the existence of languagedependent phoneme (vowel) representations in the human brain by measuring MMN with an oddball paradigm. They provided Finnish subjects with Finnish phoneme prototype /e/ as the frequent stimulus, and other Finnish phoneme prototypes or a non-prototype as infrequent stimulus. As a result, they found that the brain’s automatic change detection response was weakened when the infrequent stimulus was a non-prototype than when it was a Finnish prototype (N¨aa¨ t¨anen et al 1997). Dehaene-Lambertz also reported that MMN was enhanced by native phonemes (Dehaene-Lambertz 1997). The results of these studies indicate that the brain has languagespecific neural representation in sensory memory centers (e.g., auditory cortex in the left hemisphere). Most studies on phoneme representation in the brain try to elicit the MMN with an oddball paradigm to identify the brain’s capability for phoneme change detections, by averaging over multiple trials and subjects. It is vague, however, whether distinct phoneme response can be found in single trials. In our present study, we intend to discriminate the brain responses to three Korean vowels /a/, /i/ and /u/ for each trial, using an auditory steady-state response (ASSR)based paradigm, pattern recognition and signal processing techniques. The objective of this study is to find components that might be related to phoneme representation in the brain and to discriminate EEG responses for each speech sound on a trial basis. We expect that our method could be utilized for various applications related to the brain– computer interface (BCI), audio-verbal language rehabilitation by superior classification accuracy. The paper is organized as follows: in section 2, we will describe our speech stimuli and data collection process, as well as information about the algorithms we used in this study, such as multivariate empirical mode decomposition (MEMD) and common spatial pattern (CSP). Section 3 will provide the results of this experiment, and section 4 the interpretations of our results and future directions.

Figure 1. Formant frequencies of vowels. Data points for /i/, /u/,

/a/ and /æ/ are averages from Peterson and Bamey (1952).

distinct formant frequencies, which are the peaks found in the spectrum envelope of sound (Benade 1990). Vowels can be categorized based on their articulatory characteristics like the tongue body position (Peterson 1952). Due to such vocal tract configurations during production, each class of vowels is associated with its consequent acoustic characteristics, e.g., high vowels–low first formants, back vowels–low second formants. The tongue body positions for /u/, /i/ and /a/ are high-back, high-front and low-back, respectively, and therefore they have a large difference in their formant frequencies (Stevens 2000) (see figure 1). Thus, the three vowels /a/, /i/ and /u/ were selected as our stimuli for evoking distinct brain responses (see figure 2). All of the vowels were obtained from the Naver standard pronunciation service (http://dic.naver.com/), which provides Korean speech sounds produced by Korean professional announcers and reviewed by the National Institute of the Korean Language over a six-month period. 2.2. Data acquisition and experimental procedure

Seven healthy subjects were selected (native Korean male speakers with the mean age of 25.5 years) among the graduate students in the Department of Medical System Engineering (DMSE) and School of Mechatronics at Gwangju Institute of Science and Technology (GIST). None of the subjects had experienced a head injury or had a neurological disorder which could affect our experimental results. All but one (S1) subject were right-handed. Before the main experiment, a pre-test was conducted so that the subjects familiarized themselves with the experimental protocol. All data were acquired at GIST DMSE, and a 64-channel EEG device by Electrical Geodesics, Inc. was used for the EEG recording. We followed the international 10–20 system, and the layout is presented on the following website: ftp://ftp.egi.com. The sampling rate was set at 256 Hz, and all recordings were performed in a dimly lit room in the evening to prevent noises. We instructed the subjects to be seated in an armchair wearing a set of earphones (ER-4P, Etymotic Research). In order to avoid the subject’s

2. Methods 2.1. Stimulus selection

For evoking distinct EEG responses, selection of stimuli is one of the important factors for consideration. According to N¨aa¨ t¨anen et al (1997), phonemes with different formant frequencies (F2) evoke different levels of MMN. Based on this finding, we chose three Korean vowels that have very 2

J Kim et al

J. Neural Eng. 11 (2014) 036010

Figure 2. The time course (first row) and the linear predictive coefficient (LPC) spectra (second row) of three Korean vowels used in our experiment. The peaks of LPC spectra refer to the formant frequencies of speech signals. Vowel /a/ shows peaks at F1 = 951 Hz, F2 = 1285 Hz, and F3 = 2722 Hz; vowel /i/ at F1 = 270 Hz, F2 = 2703 Hz, and F3 = 3430 Hz; and vowel /u/ at F1 = 258 Hz, F2 = 740 Hz, and F3 = 2819 Hz. All stimulus amplitudes were normalized to the range [−1, 1].

prediction of the following stimulus, stimuli /a/, /i/ and /u/ were randomly presented to the subjects. A pre-stimulus interval of 0.5 s was set for baseline correction and a poststimulus interval of 1.5 s for preventing an overlap of brain responses with successive stimuli (see figure 3). We acquired EEG data from 90 trials in a single session, and two sessions of the experiment were carried out for a total of 180 trials per subject.

Figure 3. Stimulus /a/, /i/ or /u/ was randomly presented during the stimulus session (approximately 1 s, depending on the length of the stimulus). A pre-stimulus interval (0.5 s) and a post-stimulus interval (1.5 s) were also assigned.

2.3. Overall signal processing procedure for classification

We applied an IIR band-pass filter (window: Butterworth, order: 5, bandwidth: 8–30 Hz), and baseline correction using pre-stimulus data to eliminate residual artifacts. Subsequently, we decomposed processed EEG data into intrinsic mode functions (IMFs) using MEMD for the estimation of the frequency component. We labeled each IMF corresponding to the beta, alpha band based on the result of average power spectra analysis (see figure 7). We selected IMFs which were dominant in the alpha band as features for classification based on the result of section 3.2. For proving the superiority of the MEMD algorithm to estimate the accurate frequency components in EEG, we compared the results of the MEMD with those for IIR band-pass filter and continuous wavelet transform (CWT). In the case of CWT, Morlet wavelet was applied, and scales corresponding to the alpha band were reconstructed. The detailed procedure for wavelet transform and MEMD for the estimation of frequency components is described in sections 2.4 and 2.5, respectively. For enhancing

the classification performance, we used CSP filter after MEMD and comparative algorithms. As a classifier, we used linear discriminant analysis (LDA). 2.4. Estimation of frequency component using wavelet for comparison

To justify the utility of IMF components as our feature vector, we tested performance using wavelet transform. The wavelet transform is a very popular method for estimating the time– frequency spectrum of a signal. The basic idea of the wavelet transform is that any general function can be expressed as a particular set of functions. The signal which is decomposed by the wavelet transform can be reconstructed as a linear combination of the wavelet function weighted by the wavelet coefficients. The main difference between the short-time Fourier transform (STFT) and the wavelet transform is that the 3

J Kim et al

J. Neural Eng. 11 (2014) 036010

time–frequency aspect ratio is varied in the wavelet transform. This characteristic makes it possible for the wavelet transform to provide good time localization at high frequencies and good frequency localization at low frequencies. Also, applying this method to EEG can more selectively extract feature vectors that are related to the transient characteristic of the signal than STFT. In this study, we used the Morlet wavelet as the mother wavelet. It applies scaling functions and wavelet functions which are related to low-pass filters and high-pass filters, respectively. By a successive application of high-pass filters (wavelet functions) and low-pass filters (scaling functions), the input signal is decomposed into different frequency bands (Subasi 2007). The wavelet coefficients which are computed by the Morlet wavelet allow EEG data to be compactly represented in time and frequency domain. In our study, to extract the components that are related to our stimuli, the scales that are related to the alpha band were utilized based on the result of grand average analysis. Detailed information about this result will be provided in section 3.2.

(8) This process is repeated until d1(t) satisfies the criteria of an IMF; the first signal to satisfy the criteria is called the first IMF, h1 (t ). For extracting the remaining IMFs (his), the same procedure is repeated for the residual signal, which is defined as the difference between the original signal and the extracted IMFs (h1, . . . ,hi-1). The signal x(t) can be expressed by EMD as follows: N  hi (t ) + r(t ), (3) x(t ) = i=1

where hi(t) is the ith IMF and r(t) is the residual signal. EMD is only useful for analyzing univariate time-series signals, not for multivariate time-series signals (Rehman and Mandic 2010). If EMD is used for multivariate time-series data, mode misalignment and mode mixing problems would occur (Meng and Hualou 2012). Therefore, methods such as MEMD should be adopted for analyzing multivariate signals such as high-density multichannel EEG. Since local extrema are not defined for a multivariate signal, a critical procedure of MEMD is calculating the local mean (Rehman and Mandic 2010). MEMD obtains multiple multi-dimensional envelopes by projecting the multivariate signal along different directions. The local mean is generated by averaging these envelopes, but it is very difficult to select suitable direction vectors. The accuracy of the local mean estimation is based on the uniformity of direction vector sampling. Direction vectors in an n-dimensional space can be represented as points on an (n − 1)-sphere, so the problem of finding direction vectors in the n-dimensional space can be changed into the problem of finding a uniform sampling scheme on the (n − 1)-sphere. There are two methods for sampling unit hyperspheres; (1) a uniform angular sampling method and (2) a quasi-Monte Carlo-based low-discrepancy sequence. The uniform angular sampling method is a convenient way to find direction vectors, but it does not generate a uniform sampling distribution. On the other hand, the quasi-Monte Carlo-based low-discrepancy sequence provides a more uniform sampling distribution. Monte Carlo methods are algorithms to approximate the integration of a function as the average of the function evaluated at the sampled points. Unlike the general Monte Carlo method, which uses random sampling such as a pseudorandom sequence to model systems stochastically, the quasiMonte Carlo method uses a low-discrepancy sequence, which provides better equidistribution in a given volume than a pseudo-random sequence. Discrepancy is a quantitative measure of deviation from a uniform distribution of points in a given volume. According to Niederreiter (1992), discrepancy of set X = {x1,. . .,xN} can be defined as follows:    C(Q; X )  (4) − λn (Q) , DN (X ) = sup  N

2.5. Estimation of frequency component using multivariate empirical mode decomposition (MEMD)

In order to understand MEMD, it is necessary to study the basic concept of empirical mode decomposition (EMD). EMD is a fully data-driven method for time–frequency analysis of non-stationary and nonlinear signals (Huang et al 1998). The main idea of EMD is an iterative sifting procedure which decomposes a signal into a sum of IMFs. There are two criteria for an IMF (Huang et al 1998): (1) The numbers of extrema (maxima and minima) and zerocrossings are the same or differ at most by one. (2) The mean of the envelopes defined by maxima (upper envelopes) and the envelopes defined by minima (lower envelopes) is 0 over the entire region. It means that the envelopes have to be symmetric with respect to zero. Based on these criteria, extracting an IMF from a signal is possible using the sifting process that is described below (Huang et al 1998): (1) Detect all local extrema over all the time points of the signal x(t). (2) Connect all local maxima to get an upper envelope using the cubic spline method. (3) Repeat procedure 2 for the local minima to get a lower envelope. (4) Calculate the mean of the upper and lower envelopes: m1 (t ) = (eup (t ) + elow (t ))/2.

(1)

(5) Calculate the difference between the mean of the envelopes and the original data, x(t): d1 (t ) = x(t ) − m1 (t ).

Q∈J

where J is the set of n-dimensional intervals, C(Q;X) counts the number of points xi that fall into Q and λs (Q) is the n-dimensional volume (Lebesgue measure) of Q. By using the low-discrepancy sequence, the quasi-Monte Carlo method achieves faster convergence and better accuracy compared to the conventional Monte Carlo method. Since the calculation of

(2)

(6) Check whether d1(t) satisfies the criteria of an IMF or not. (7) If d1(t) does not satisfy the criteria of an IMF, set d1(t) to x(t), and repeat steps 1–6. 4

J Kim et al

J. Neural Eng. 11 (2014) 036010

the local mean via envelope averaging can be changed into a numerical integration problem, the quasi-Monte Carlo method is a very suitable tool to accurately estimate the local mean for MEMD. The MEMD algorithm can be summarized as follows: (Rehman and Mandic 2010)

Using singular value decomposition (SVD), Cc can be factored as Cc = V C CV CT

where V C is the matrix of eigenvectors, and C is the diagonal matrix of eigenvalues. By using the whitening transformation, T all the  eigenvalues of PCc P become equal to unity, where

(1) Select an appropriate point set by finding a uniform sampling scheme on an (n − 1) sphere. T (2) Project the multi-dimensional signal [x(t )]t=1 along the direction vectors which we found in procedure 1. θ (3) Find the time point ti j according to the maxima of the set of the projected data, where θ j is the angle on an (n − 1) sphere and j is the index of direction vectors θ θ (4) Interpolate [ti j , x(ti j )]to calculate multivariate envelope curves [eθ j (t )]Jj=1 . (5) Estimate the mean of the envelope curves for a set of J direction vectors: 1  θj e (t ). J j=1

T −1 c Vc . If S/a/ = PC/a/ PT and S/i/ = PC/i/ PT , eigenvector matrices for S/a/ and S/i/ are identical. Therefore, the following equations are satisfied:

P=

BT S/a/ B = /a/ ,

BT S/i/ B = /i/

and

/a/ + /i/ = I (9)

where I is the identity matrix. In addition, when the eigenvector for S/a/ has the largest eigenvalue, the corresponding eigenvector for S/i/ has the smallest eigenvalue and vice versa because the sum of eigenvalues for two groups should be always unity. As a result, the final spatial filter that accomplishes the purpose of CSP is given by W = (BT P)T . Single-trial EEG data E are projected as

J

m(t ) =

(8)

(5)

(6) Calculate the ‘detail’, d(t) = x(t) – m(t). (7) Repeat the previous steps until the ‘detail’ satisfies the criteria for a multivariate IMF.

Y = W E.

(10)

If the eigenvalues /a/ are sorted in descending order, time series signals which have maximum difference in variance between two populations are contained in the first and last m row vectors y p (p = 1, . . . , m, N − m + 1, . . . , N) of Y. We obtained the features as following:   var(y p ) f p = log  (11) i=1,...,m,N−m+1,...,N var(yi )

The only difference between the criteria for multivariate IMFs and those for EMD IMFs is that for multivariate IMFs, the number of extrema does not need to be equal to the number of zero crossings. In this study, we conducted MEMD and then selected IMFs which were dominant in the alpha bands. Matlab was used for conducting MEMD, and all MEMD source codes we used were obtained from this website: www.commsp.ee.ic.ac.uk/∼mandic (Rehman and Mandic 2010).

where var(•) means the variance.

2.6. Common spatial pattern filter (CSP) 2.7. Linear discriminant analysis (LDA)

CSP filtering is widely used for extracting the features from EEG data for BCI (Ramoser et al 2000, Park et al 2013). The purpose of CSP algorithm is to find the spatial filters that maximize the variance of signals in a class and minimize the variance in the other class at the same time for the discrimination of two populations. For accomplishing this purpose, CSP algorithm constructs the spatial filters based on the simultaneous diagonalization of two covariance matrices. Let the single-trial EEG data be represented as an N × L matrix E, where N is the number of channels and L is the number of sample points. First, EEG signals should be mean subtracted. The normalized spatial covariance of E can be acquired from

LDA is a well-known method used in machine learning field. The purpose of LDA is to find the optimal linear combination of features which differentiates the classes. LDA assumes that the distribution of observations or feature vectors is Gaussian with equal covariance matrix () for both classes, (i.e. class i and j). Therefore the likelihood of the observation x with a class yi in m dimensions can be computed as follows: p(x|yi ) = N (ui , ) =

EE T (6) trace(EE T ) where T means the transpose operator and trace is the sum of the diagonal elements of the matrix. The spatial covariance C f ∈[/a/,/i/] is obtained by averaging over the normalized spatial covariance matrices of all trials for each task group, /a/or/i/. The composite spatial covariance (Cc ) is calculated as

(2π )m/2 ||1/2

 1 exp − (x − ui )T  −1 (x − ui ) 2 (12)

where ui and  are mean vector and covariance matrix of observation with class yi, respectively. In case of two class classification, discriminant function can be computed as follows:

C=

C/a/ + C/i/ = Cc .

1

gi j = gi (x) − g j (x) = ln(p(x|yi ))P(yi ) − ln(p(x|y j ))P(y j ) = w x + b. T

(7) 5

(13)

J Kim et al

J. Neural Eng. 11 (2014) 036010

average analysis. Statistically significant difference between the three stimuli is also shown on single-trial level. Significant regions differed in each trial, but much of the alpha band power in almost all trials is statistically different between the three experimental conditions (see figure 6). Therefore, we assumed that alpha band components are related to our stimuli.

Projection vector w and bias term b are computed using training set. The projection vector is computed as wT = ( −1 (ui − u j ))T .

(14)

Finally, the LDA classifier assigns an observation vector of test to the class label according to sign(wT x + b) (Venables and Ripley 2002).

3.3. Classification results of real EEG recordings

3. Result

For preprocessing, an IIR band-pass filter was applied to the raw EEG data (window: Butterworth, order: 5, bandwidth: 8–30 Hz), and MEMD was conducted for estimation of the frequency components. Based on the results of section 3.2, that alpha band components are related to our stimuli, we tried to find IMF components that are related to alpha band during the analysis period. Figure 7 shows the average power spectral density of the first four IMFs 1–4 over all trials. In all subjects, IMFs 1 and 2 are usually dominant in alpha bands; however, IMF1 contains also most of the beta band. We therefore used IMF 2 or the summation of IMF 2 and IMF 3 for further analysis. The feature vectors f = [f1, f2, f7, f8] with m = 2 were computed for classification using CSP algorithm over the whole 1 s period. The feature vectors, obtained from equation (11), were classified using LDA. For estimating the appropriate classification accuracy, ten-fold cross validation was applied. All the data were divided into training and test sets. Only the training set was used for constructing the classifier. We repeated this procedure ten times with different random partitions. The classification accuracies for seven subjects are shown in table 1, where the highest classification accuracies among the three methods are indicated in bold. Feature vectors were acquired using CSP, equation (11) with m = 2 in all three methods (MEMD, wavelet transform, IIR band-pass filter). All the overall classification accuracies were always much above chance level. MEMD features always showed better overall classification accuracies than CWT and IIR band-pass filter. One-tailed t-test was conducted between MEMD and the other methods. The result of the t-test showed that the classification performance of MEMD is significantly better than both CWT and IIR band-pass filter (p  0.01). These results clearly justify our use of MEMD for the singletrial based EEG classification of vowel speech perception.

3.1. Simulation

In this section, we will examine whether MEMD is an appropriate tool for estimation of the frequency component. For simulation, we generated three different non-stationary signals, x1, x2 and x3, by combining a multi-phasic signal and Gaussian white noise (−20 dB). We considered the multiphasic wave as an interesting frequency component that is analogous to speech-related EEG components buried in noise, and conducted MEMD over the three simulated signals. The simulation signals were decomposed into IMF components by MEMD. As shown in figure 4, IMFs 1–6 appear to be random noises, whereas IMFs 7–10 are similar to sine waves. We computed power spectra of all IMFs, and identified that IMFs 7–10 are dominant in frequency 2–13 Hz. A notable point is that the combination of IMFs 7–10 is almost identical to the original signal. The relative root mean square errors (RRMSE) of the reconstructed x1, x2 and x3 are 21.39%, 20.19% and 21.39%, respectively. For the comparison, we reconstructed simulation signals using wavelet coefficients which are in the specific sub-band (2–13 Hz). The RRMSE of the reconstructed x1, x2, and x3 using wavelet coefficients show 48.77%, 39.37% and 39.25%, respectively. RRMSE is calculated as follows: RMS(xrec − xorig ) RRMSE = × 100% (15) RMS(xorig ) where xrec is the reconstructed signal and xorig is the clean original simulation signal. According to this result, it is reasonable to expect that performing MEMD may successfully extract interesting signal sources from noisy EEG signals compared to using wavelet coefficients. 3.2. Time–frequency analysis of grand average and single trials

4. Discussion and conclusion

As a preliminary investigation, we averaged all the raw EEG data over subjects for each stimulus to identify speechrelated brain activities. We computed the time–frequency representation (TFR) of the grand-averaged EEG data. For analyzing the statistical significance of the observed power spectrum in the three experimental conditions, an ANOVA test was conducted. Afterwards, the power of statistically significant areas (p < 0.01) was represented with its original value and the power of other areas was replaced with zero. As shown in figure 5, much of the alpha band (8–13 Hz) power in both right and left temporal areas is statistically different between the three experimental conditions, and distinct responses for the three vowels can also be observed in the temporal region of topography. The TFR of singletrial EEG data was computed in the same way as the grand

Most studies about phoneme representation in the brain report the occurrence of MMN (mismatch negativity), when MMN, which can only be seen by averaging signals from many trials, provides information about the difference between brain response wave components to a ‘deviant’ phoneme and a ‘standard’ phoneme at a certain time point. This study, however, used an ASSR-based paradigm which provided timeseries information about each phoneme response, and with the application of appropriate signal processing algorithms, these responses could be classified as one of the learned phonemes (Korean vowels /a/, /i/, and /u/) on a single-trial basis. Classification of EEG responses to speech sounds in single trials was very challenging due to a low SNR of the EEG 6

J Kim et al

J. Neural Eng. 11 (2014) 036010

(a)

(b)

(c)

Figure 4. MEMD results of three different simulation data. The simulation data consist of common components and stimulus specific components. Stimulus specific components only appear in the specific part of the signals and are different for all three simulations. On the other hand, common components are present at all time points of the signals, and these components vary across simulations. Common components consist of a 2 Hz sine wave overlapped with a 5 Hz sine wave. Stimulus specific components are as follows: in X1, a 10 Hz sine wave overlapped with a 12 Hz sine wave; in X2, a 9 Hz sine wave overlapped with a 11 Hz sine wave; in X3, a 11 Hz sine wave overlapped with a 8 Hz sine wave. Gaussian white noise (−20 dB) was added to all simulation data. (a) The first and second rows represent clean simulation signals and noise-contaminated simulation signals, respectively. (b) The first to tenth rows represent intrinsic mode functions (IMFs). (c) MEMD successfully extracts interesting sources (multi-phasic waves) with the original frequencies of each signal. The original signals were reconstructed by combining IMFs 7–10. The reconstructed signal (blue solid line) is compared with the original signal (red dashed line).

signals; therefore signal processing techniques were necessary to extract phoneme response from EEG. Since there is no clear standard for selecting interesting signals related to speech

stimuli, it was imperative that we defined a standard for selecting phoneme responses prior to feature classification. Considering that most of the alpha band power of 7

J Kim et al

J. Neural Eng. 11 (2014) 036010

(a)

(b)

Figure 5. (a) Time–frequency representation (TFR) of EEG signals averaged over all subjects using a Morlet mother wavelet. The EEG signals were obtained from eight electrodes in the right and left temporal areas during each of the three experimental conditions (vowels /a/, /i/ and /u/). The TFRs are plotted for the first 1 s after stimulus onset and for the frequency range of 4–40 Hz. After performing an ANOVA test on the TFRs of the three experimental conditions, the power of a statistically significant region is represented with its original value, while the power of other regions is labeled zero (p < 0.01). The brain response to vowel /a/ shows high activation during 0–0.2 s and 0.4–0.8 s time intervals in the right and left temporal areas; the response to /i/ at 0.4–0.9 s in the right and left temporal areas; and the response to /u/ at 0.2–0.4 s in the right temporal and at 0.4–0.6 s in the left temporal area (red rectangles). (b) Topography of alpha band power (8–13 Hz) also shows high activation in the temporal areas for the same time intervals, as mentioned above.

grand-averaged EEG was statistically different between the three experimental conditions (see figures 5 and 6), we assumed the components which are dominant in the alpha bands to be speech-related responses. First, we tested the feature extraction method which uses complete finger-print in wavelet domain, i.e. wavelet coefficients in whole frequency and time interval (Bostanov 2004). But features extracted from complete finger-print revealed worse classification rates than features only using alpha band in our study. We guess that this result may be because the brain response related to the auditory linguistic

processing is mainly concentrated on the alpha band, so that features using only the alpha band revealed superior performances to complete finger-print. Many previous studies also reported that auditory linguistic processing is closely related to alpha-band components of EEG. That is, alpha-band EEG oscillations were found during speech perception and production (Krause et al 1994, Bonte et al 2009, Obleser and Weisz 2012, Tremblay et al 2008, Kawasaki et al 2013). For this reason, we used features extracted from the alpha band in this study. In case of MEMD, the powers of IMF 2 components were dominant in the alpha bands in all subjects (see figure 7). 8

J Kim et al

J. Neural Eng. 11 (2014) 036010

Figure 6. Time–frequency representation (TFR) of EEG signals using a Morlet mother wavelet for S1 based on single trial. The EEG signals

were obtained from eight electrodes in the right and left temporal areas during each of the three experimental conditions (vowels /a/, /i/ and /u/). The TFRs are plotted for the first 1 s after stimulus onset and for the frequency range of 4–40 Hz. After performing an ANOVA test on the TFRs of the three experimental conditions, the power of a statistically significant region is represented with its original value, while the power of other regions is labeled zero (p < 0.01). Statistically significant regions are usually distributed in the theta (4–8 Hz) and alpha (8–12 Hz) bands (red rectangles).

9

J Kim et al

J. Neural Eng. 11 (2014) 036010

Figure 7. Average power spectral density of the first four IMF over all trials for each subject. IMF 1 and 2 are usually dominant in most of the beta band and alpha band, respectively.

Therefore, we considered the IMF 2 as prime components for classifying speech-related EEG data. Next, we tried to use not only linear discriminant analysis (LDA) but also Gaussian mixture hidden Markov model (GMHMM) as a classifier because we supposed that the temporal information of EEG can be useful information for classification in this study. The interesting finding is that the number of states in GM-HMM did not affect the classification accuracies. Moreover, GM-HMM with one state showed slightly better classification accuracy than those of other state numbers. It means that temporal information is not an effective feature for classifying vowel perception in this study. This result obviates the need for a GM-HMM. We speculated that the short length of vowel makes temporal information ineffective for discrimination in our experiment. In addition, we also performed quadratic discriminant analysis (QDA) in our study. As a result, LDA showed better classification accuracies than QDA; therefore, we chose LDA as a classifier because of its simplicity and high accuracy in this study.

Since EMD is based on the adaptive basis, and the frequency is derived by differentiation, EMD does not need spurious harmonics for representing nonlinear waveform deformations, and uncertainty principle limitation on time/frequency resolution (Huang 2005). But for multivariate time-series data, EMD is not a suitable algorithm because of mode misalignment and mode mixing problems (Meng and Hualou 2012). Therefore MEMD is a powerful tool for nonlinear and non-stationary multivariate data analysis. For proving the utility of MEMD in our experiment, we compared the MEMD with both CWT and IIR band-pass filter. In overall classification accuracies, IMF2 shows superior performance to both CWT and IIR band-pass filter (see table 1). This result suggests that MEMD is more useful for extracting speech-related components from EEG signals than the other algorithms, and the IMF 2 can be a suitable feature vector for classifying phoneme response in single trials. The superiority of the MEMD for BCI was already demonstrated by Park et al. They classified brain responses to limb movement imagination using MEMD successfully (Park et al 2013). 10

J Kim et al

J. Neural Eng. 11 (2014) 036010

Table 1. Classification accuracies in % for MEMD, CWT and IIR band-pass filtering (BPF) (8–12 Hz). Accuracies are expressed as mean and associated standard deviation.

Subject

Feature

/a/vs/i/

/a/vs/u/

/i/vs/u/

Overall

S1

BPF CWT IMF2 BPF CWT IMF2 BPF CWT IMF2 BPF CWT IMF2 BPF CWT IMF2 + 3 BPF CWT IMF2 BPF CWT IMF2

62.50 ± 10.51 65.77 ± 9.97 71.41 ± 10.62 60.36 ± 10.24 59.25 ± 15.12 72.16 ± 11.85 73.82 ± 15.86 67.91 ± 13.91 74.00 ± 12.29 68.91 ± 10.83 68.13 ± 15.63 75.85 ± 14.38 60.00 ± 14.14 61.00 ± 15.95 71.00 ± 14.49 62.79 ± 17.16 65.88 ± 17.64 71.43 ± 13.29 58.48 ± 13.78 49.09 ± 17.12 69.21 ± 15.78

63.82 ± 14.63 61.45 ± 17.23 67.46 ± 9.78 60.00 ± 8.23 61.52 ± 7.11 69.02 ± 10.61 52.31 ± 12.47 56.35 ± 8.33 65.90 ± 9.10 63.45 ± 12.16 64.38 ± 11.11 71.29 ± 10.11 44.91 ± 12.05 55.91 ± 23.07 65.45 ± 12.76 55.39 ± 14.97 63.46 ± 11.47 69.68 ± 11.59 65.17 ± 13.96 53.89 ± 11.82 64.86 ± 13.12

55.97 ± 16.42 58.18 ± 11.02 67.89 ± 8.09 62.95 ± 17.81 63.11 ± 16.60 69.41 ± 7.59 52.20 ± 13.13 56.02 ± 14.56 70.12 ± 13.75 58.43 ± 12.90 56.00 ± 10.61 67.50 ± 10.49 68.18 ± 10.08 63.73 ± 13.37 71.27 ± 16.62 60.73 ± 19.76 51.00 ± 16.79 69.45 ± 15.79 57.31 ± 10.34 61.28 ± 16.40 68.95 ± 15.46

60.76 ± 12.25 61.80 ± 11.61 68.92 ± 8.59 61.10 ± 11.09 61.29 ± 12.44 70.19 ± 9.89 59.44 ± 12.62 60.09 ± 11.76 70.00 ± 11.17 63.59 ± 11.26 62.83 ± 11.65 71.54 ± 10.76 57.69 ± 11.69 60.21 ± 16.56 69.24 ± 13.72 59.63 ± 16.62 60.11 ±14.5 70.18 ± 13.15 60.32 ± 11.84 54.75 ± 15.03 67.67 ± 13.87

S2 S3 S4 S5 S6 S7

classifying ASSR responses poses great difficulty because they are much smaller than responses to visual stimuli. Besides, motor imagery paradigm is also widely used for BCI because of its simple task. But all conventional BCI paradigms have disadvantages that include slow communication rate and massive subject training. If it is possible to directly determine the intention of the user using speech and language processing of the brain, the performance of BCI system could be dramatically improved, and the need for subject training would be decreased (Pei et al 2011). Therefore, many researchers want to utilize the language processing of the brain for BCI system (Deng et al 2010, Brumberg et al 2010, DaSalla et al 2009). Nevertheless, since language and speech processing in the brain is very complex, more sophisticated algorithms should be utilized for feature extraction from brain signals for speech processing. In this study, we successfully classified responses to three vowels with remarkable accuracy. Our method of investigation shows the possibility of using vowel speech sounds instead of pure tone sounds in ASSR-based BCI system. Considering the results of our study, MEMDbased classification extracted the speech-related features in the brain more precisely than the conventional methods such as CWT and BP. Therefore, if MEMD-based feature extraction method is used for BCI system based on speech and language processing, we expect that BCI system might show better classification performance compared to the conventional BCI system. Next, our studies could also be used for clinical fields such as diagnosis and rehabilitation for people with speech and language disorders. Froud and Khamis-Dakwar provided randomized sequences of consonant–vowel (CV) syllables in two oddball paradigms, phonemic (/ba/, /pa/) and allophonic (/pa/, /pha/), to five children with childhood apraxia of speech (CAS) and five children without CAS (Froud and KhamisDakwar 2012). They found that in the phonetic contrast

An ERP and fMRI study (also an MMN paradigm) that used synthetic sine-wave analogues of speech (DehaeneLambertz et al 2005) demonstrated that switching perception from a ‘non-speech mode’ to a ‘speech mode’ enhances the listener’s discrimination abilities of acoustic change that crosses one’s native phonemic boundary, with enhanced activation in the posterior parts of the left superior temporal sulcus for the ‘speech mode’ and enhanced activation in the supramarginal gyrus for a phonemic change in the ‘speech mode’. Their results provided evidence for the presence of distinct networks for phonemic versus non-phonemic (acoustic) processing. In our study, the fact that we used actual speech sounds makes it very likely that the recognition of our stimuli as vowels (i.e., speech sounds), in addition to their formant characteristics, contributed to these fingerprints of EEG responses used for classification. In future studies, we expect to apply our approach to a brain–computer interface (BCI) technology aimed at allowing patients who cannot move spontaneously to communicate by using their brain signals (Brunner et al 2010). Although an EEG-based BCI has generally not been considered practical due to the use of unpleasant gels, recently developed sensing technology can acquire EEG data solely by using capacitive electrodes; therefore, an EEG-based BCI is, at present, a method most applicable to everyday life (Baek et al 2013). Almost all of the current EEG-based BCI paradigms are based on visual stimuli or visual feedback, so these paradigms cannot be adopted for patients who have lost visual function due to diseases or accidents (Kim et al 2011). Additionally, a visual stimulus-based paradigm evokes a high electrooculogram (EOG), which may contaminate EEG and worsen the stability of the system. For this reason, there has been increasing interest in the ASSR (auditory steady-state response)-based paradigm. The ASSR-based BCI system generally uses pure tone sounds for evoking distinct brain responses. However, 11

J Kim et al

J. Neural Eng. 11 (2014) 036010

experiment, MMN response to deviant sounds was only observed for children without CAS but not the children with CAS; however, in the allophonic experiment, MMN response was not elicited in children without CAS, while the CAS group showed MMN-like response. Based on this study, we can expect that different responses to speech sounds in an ASSRbased paradigm would be observed between subjects with CAS and subjects without CAS. Moreover, an application to EEG biofeedback teaching tools, which may enable the speechimpaired to change their EEG signals to normal, could be made possible from further development of our study.

Huang N E 2005 Hilbert–Huang Transform and Its Applications (Singapore: World Scientific) Huang N E, Shen Z, Long S R, Wu M C, Shih H H, Zheng Q, Yen N-C, Tung C C and Liu H H 1998 The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis Proc. R. Soc. Lond. A 454 903–95 Johnson K 2012 Acoustic and Auditory Phonetics (New York: Wiley) Kawasaki M, Yamada Y, Ushiku Y, Miyauchi E and Yamaguchi Y 2013 Inter-brain synchronization during coordination of speech rhythm in human-to-human social interaction Sci. Rep. 3 1692 Kim D-W, Hwang H-J, Lim J-H, Lee Y-H, Jung K-Y and Im C-H 2011 Classification of selective attention to auditory stimuli: toward vision-free brain–computer interfacing J. Neurosci. Methods 197 180–5 Krause C M, Lang H A, Laine M, Helle S I, Kuusisto M J and Porn B 1994 Event-related desynchronization evoked by auditory stimuli Brain Topogr. 7 107–12 Liberman A M, Harris K S, Hoffman H S and Griffith B C 1957 The discrimination of speech sounds within and across phoneme boundaries J. Exp. Psychol. 54 358–68 Meng H and Hualou L 2012 Adaptive multiscale entropy analysis of multivariate neural data IEEE Trans. Biomed. Eng. 59 12–5 N¨aa¨ t¨anen R et al 1997 Language-specific phoneme representations revealed by electric and magnetic brain responses Nature 385 432–4 Niederreiter H 1992 Random Number Generation and Quasi-Monte Carlo Methods (Philadelphia, PA: Society for Industrial and Applied Mathematics) Obleser J and Weisz N 2012 Suppressed alpha oscillations predict intelligibility of speech and its acoustic details Cereb. Cortex 22 2466–77 Park C, Looney D, Rehman N, Alrabian A and Mandic D P 2013 Classification of motor imagery BCI using multivariate empirical mode decomposition IEEE Trans. Neural Syst. Rehabil. Eng. 21 10–22 Pei X, Barbour D, Leuthardt E and Schalk G 2011 Decoding vowels and consonants in spoken and imagined words using electrocorticographic signals in humans J. Neural Eng. 8 046028 Peterson G E and Bamey H L 1952 Control methods used in a study of the vowels J. Acoust. Soc. Am. 24 175 Ramoser H, Muller-Gerking J and Pfurtscheller G 2000 Optimal spatial filtering of single trial EEG during imagined hand movement IEEE Trans. Rehabil. Eng. 8 441–6 Rehman N and Mandic D P 2010 Multivariate empirical mode decomposition Proc. R. Soc. Lond. A 466 1291–302 Stevens K N 2000 Acoustic Phonetics (Current Studies in Linguistics) (Cambridge, MA: MIT Press) Subasi A 2007 EEG signal classification using wavelet feature extraction and a mixture of expert model Expert Syst. Appl. 32 1084–93 Tremblay P, Shiller D M and Gracco V L 2008 On the time-course and frequency selectivity of the EEG for different modes of response selection: evidence from speech production and keyboard pressing J. Clin. Neurophysiol. 119 88–99 Venables W and Ripley B D 2002 Modern Applied Statistics with S-PLUS 4th edn (New York: Springer) Werker J F and Tees R C 1984 Cross-language speech perception: evidence for perceptual reorganization during the first year of life Infant. Behav. Dev. 7 49–63

Acknowledgments This research was supported by the Pioneer Research Center Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT & Future Planning (grant no. 2012-0009462). References Baek H J, Kim H S, Heo J, Lim Y G and Pak K S 2013 Brain–computer interfaces using capacitive measurement of visual or auditory steady-state responses J. Neural Eng. 10 024001 Benade A H 1990 Fundamentals of Musical Acoustics (Dover Books on Music) 2nd edn (New York: Dover) Bonte M, Valente G and Formisano E 2009 Dynamic and task-dependent encoding of speech and voice by phase reorganization of cortical oscillations J. Neurosci. 29 1699–706 Bostanov V 2004 BCI competition 2003-data sets ib and iib: feature extraction from event-related brain potentials with the continuous wavelet transform and the t-value scalogram IEEE Trans. Biomed. Eng. 51 1057–61 Brown C M and Hagoort P 2000 The Neurocognition of Language (New York: Oxford University Press) Brumberg J S, Nieto-Castanon A, Kennedy P R and Guenther F H 2010 Brain–computer interfaces for speech communication Speech Commun. 52 367–79 Brunner P, Joshi S, Briskin S, Wolpaw J R, Bischof H and Schalk G 2010 Does the ‘P300’ speller depend on eye gaze? J. Neural Eng. 7 056013 Caplan D 1987 Neurolinguistics and Linguistic Aphasiology (Cambridge: Cambridge University Press) DaSalla C S, Kambara H, Sato M and Koike Y 2009 Single-trial classification of vowel speech imagery using common spatial patterns Neural Netw. 22 1334–9 Dehaene-Lambertz G 1997 Electrophysiological correlates of categorical phoneme perception in adults Neuro Rep. 8 919–24 Dehaene-Lambertz G, Pallier C, Serniclaes W, Sprenger-Charolles L, Jobert A and Dehaene S 2005 Neural correlates of switching from auditory to speech perception Neuroimage 24 21–33 Deng S, Srinivasan R, Lappas T and D’Zmura M 2010 EEG classification of imagined syllable rhythm using Hilbert spectrum methods J. Neural Eng. 7 046006 Froud K and Khamis-Dakwar R 2012 Mismatch negativity responses in children with a diagnosis of childhood apraxia of speech (CAS) Am J. Speech-Lang. Pat. 21 302–12

12

EEG classification in a single-trial basis for vowel speech perception using multivariate empirical mode decomposition.

The objective of this study is to find components that might be related to phoneme representation in the brain and to discriminate EEG responses for e...
2MB Sizes 0 Downloads 3 Views