Eur Arch Otorhinolaryngol (2015) 272:3391–3399 DOI 10.1007/s00405-015-3708-4

LARYNGOLOGY

Exploring the feasibility of smart phone microphone for measurement of acoustic voice parameters and voice pathology screening Virgilijus Uloza1 • Evaldas Padervinskis1 • Aurelija Vegiene1 • Ruta Pribuisiene1 Viktoras Saferis2 • Evaldas Vaiciukynas3 • Adas Gelzinis3 • Antanas Verikas3,4



Received: 20 April 2015 / Accepted: 30 June 2015 / Published online: 11 July 2015 Ó Springer-Verlag Berlin Heidelberg 2015

Abstract The objective of this study is to evaluate the reliability of acoustic voice parameters obtained using smart phone (SP) microphones and investigate the utility of use of SP voice recordings for voice screening. Voice samples of sustained vowel/a/obtained from 118 subjects (34 normal and 84 pathological voices) were recorded simultaneously through two microphones: oral AKG Perception 220 microphone and SP Samsung Galaxy Note3 microphone. Acoustic voice signal data were measured for fundamental frequency, jitter and shimmer, normalized noise energy (NNE), signal to noise ratio and harmonic to noise ratio using Dr. Speech software. Discriminant analysis-based Correct Classification Rate (CCR) and Random Forest Classifier (RFC) based Equal Error Rate (EER) were used to evaluate the feasibility of acoustic voice parameters classifying normal and pathological voice classes. Lithuanian version of Glottal Function Index (LT_GFI) questionnaire was utilized & Virgilijus Uloza [email protected]

for self-assessment of the severity of voice disorder. The correlations of acoustic voice parameters obtained with two types of microphones were statistically significant and strong (r = 0.73–1.0) for the entire measurements. When classifying into normal/pathological voice classes, the Oral-NNE revealed the CCR of 73.7 % and the pair of SP-NNE and SPshimmer parameters revealed CCR of 79.5 %. However, fusion of the results obtained from SP voice recordings and GFI data provided the CCR of 84.60 % and RFC revealed the EER of 7.9 %, respectively. In conclusion, measurements of acoustic voice parameters using SP microphone were shown to be reliable in clinical settings demonstrating high CCR and low EER when distinguishing normal and pathological voice classes, and validated the suitability of the SP microphone signal for the task of automatic voice analysis and screening. Keywords phone

Acoustic analysis  Voice screening  Smart

Evaldas Padervinskis [email protected]

Introduction

Viktoras Saferis [email protected]

Dysphonia caused by laryngeal dysfunction may be related to benign, malignant, behavioral, and neurologic factors. Estimates of the prevalence of voice disorders among the general population vary from 6 to 9 % [1, 2]. In United States, voice problems affect approximately one out of 13 adults annually [3]. Correct diagnosis of the laryngeal/voice disorder is an essential step towards its appropriate treatment and controlling costs. However, increased time from first primary care visit to first otolaryngology evaluation (laryngoscopy) is rather often associated with change of initially diagnosis and treatment, and increased health care costs [4]. Clinical

Antanas Verikas [email protected] 1

Department of Otolaryngology, Lithuanian University of Health Sciences, Eiveniu 2, 50009 Kaunas, Lithuania

2

Department of Physics, Mathematics and Biophysics, Lithuanian University of Health Sciences, Kaunas, Lithuania

3

Department of Electric Power Systems, Kaunas University of Technology, Kaunas, Lithuania

4

Department of Intelligent Systems, Halmstad University, Halmstad, Sweden

123

3392

diagnostics of laryngeal/voice disorders is rather complex and based on multidimensional approach including perception of voice changes, video laryngostroboscopy, acoustic voice analysis, measurement of voice aerodynamics, and subjective rating of voice by the patient [5]. If necessary, the diagnosis is finalized by results of histological examination. Therefore, earlier and qualified otolaryngology examination involving basic measurements mentioned above may reduce health care expenditures in the evaluation and management of patients with diverse laryngeal/voice disorders including laryngeal carcinoma. Automated acoustic analysis-based voice screening could be one of potential approaches helping primary care physicians and other public health care services to identify the patients who require early otolaryngological referral thereby improving diagnostics and management of laryngeal/voice disorder patients. The main goal of automated pathological voice/speech detection systems is to categorize any input voice as either normal or pathological [6]. Currently, there is an increasing demand for robust measures of voice quality. However, a comprehensive, systematic and routine measurement of acoustic voice parameters for diagnostic and/or voice screening purposes or following treatment is only possible in hospitals with voice laboratory facilities [7]. One of possible solutions providing automated acoustic analysis-based voice screening could be the use of telemedicine and/or telephony-based voice pathology assessment using various automated analysis algorithms. Several earlier studies showed that the performance of speech and speaker recognition systems decreased when processing telephone-quality signals comparing to systems utilizing high-quality recordings [8]. However, more recent studies highlighted the real possibility for cost-effective remote detection and assessment of voice pathology over telephone channels reaching normal/pathological voice classification accuracy close to 90 % [6, 9–11]. Current advancement in digital technology has widened access to portable devices capable for recording acoustic signals in high-quality audio formats and also transmitting the digitized audio files via computer network. The high sampling rate (48.0–90.0 kHz) afforded by contemporary models of smart phones may prove to be an important aspect enabling easily accessible audio recording tool to collect voice recordings preserving sufficient acoustic details for voice analysis and monitoring [12]. As the result, some sporadic reports in the literature regarding the applicability and effectiveness of iPhonebased voice recordings for acoustic voice assessment have already appeared [12]. More recent study of Mat Baki et al. demonstrated that voice recordings performed with iPod’s internal microphone and analyzed with OperaVoxTM software application installed on an iPod touch 4th generation

123

Eur Arch Otorhinolaryngol (2015) 272:3391–3399

were statistically comparable to the ‘‘gold standard’’, i.e., Multidimensional Voice Program (MDVP, KayPentax, NJ, USA) [7]. Specific voice-related questionnaire data is also an important information source and may contain additional information, which is not present in acoustic or visual modalities. In 2005, Bach et al. developed and validated the Glottal Function Index (GFI) questionnaire—easily self-administered and reliable 4-item battery that was designed to assess the presence and degree of vocal dysfunction in adults [13]. Subsequently, culturally adapted and validated Lithuanian version of GFI (GFI_LT) was confirmed to be a valid and reliable tool for self-assessment of the severity of voice disorders in Lithuanian-speaking patients [14]. Analysis of voice related questionnaires data can also be used for voice screening purposes, however, only a few attempts in this field have been made [15, 16]. Nevertheless, the results of the studies demonstrated that questionnaire data may contain important information that is significant for voice classification into normal/pathological classes. Therefore, the aim of the present study was to evaluate reliability of acoustic voice parameters obtained simultaneously using oral and smart phone microphones, and to investigate the utility of combined use of SP microphone signal and GFI questionnaire data for voice categorization for the voice screening purposes.

Methods Ethical considerations This reliability study has been approved by the Local Ethics Committee (reference number: P2-24/2013). A study group consisted of 118 individuals examined at the Department of Otolaryngology of the Lithuanian University of Health Sciences, Kaunas, Lithuania. The normal voice subgroup was composed of 34 selected healthy volunteer individuals who considered their voice as normal and had no history of chronic laryngeal diseases or other long-lasting voice disorders. No pathological alterations in the larynx were found during video laryngostroboscopy performed with an XION EndoSTROB DX device (XION GmbH, Berlin, Germany) using a 70° rigid endoscope. Acoustic voice signal parameters of these normal voice subgroup subjects that were obtained using Tiger Electronics (Seattle, WA) Dr. Speech software (Voice Assessment, Version 3.0) were within the normal range [17]. The pathological voice subgroup consisted of 84 patients who represented a rather common, clinically discriminative group of laryngeal diseases including mass

Eur Arch Otorhinolaryngol (2015) 272:3391–3399

lesions of the vocal folds (nodules, polyps, cysts, papillomata, keratosis, and carcinoma), paralysis and reflux laryngitis. Demographic data of the total study group and diagnoses of the pathological voice subgroup are presented in Table 1. Voice recordings The mixed gender database of voice recordings used in this study contained 118 digital voice recordings of sustained phonation of the vowel sound/a/. Voice samples obtained from each subject were recorded in a sound-proof booth simultaneously through two microphones: oral cardioid AKG Perception 220 (AKG Acoustics, Vienna, Austria) microphone, and internal smart phone Samsung Galaxy Note 3 microphone. Both microphones were placed alongside at a 10.0 cm distance from the mouth (the subjects were seated with a head rest), keeping at about 90° microphone-to-mouth angle. The subjects were asked to phonate sustained vowel/a/at comfortable pitch and loudness level for at least 5 s duration. The voice signal from the oral microphone was recorded in the ‘‘wav’’ file format using AudacityÒ software (http:// audacity.sourceforge.net/) at the rate of 44.100 samples per second. Sixteen bits were allocated for one sample. The external sound card M-Audio (Cumberland, RI) was used for digitization of the voice recordings. SP recordings were recorded in the ‘‘wav’’ file format using Smart Voice Recorder application (Smartmob Development. http:// recorder.smartmobdev.com). Glottal function index questionnaire Each participant of the study (normal and pathological voice groups) filled in the GFI_LT questionnaire at the baseline along with voice recordings—at least 1 week before the treatment.

3393

Acoustic analysis To ensure correct comparison of acoustic voice signal recorded with two different microphones, the equal segments of recordings of sustained vowel/a/(5 s duration starting from the onset of phonation) of separate voice samples from each recording session were analyzed using Tiger Electronics (Seattle, WA) Dr. Speech software (Voice Assessment, Version 3.0). Acoustic voice signal data were measured for fundamental frequency (F0), percent of jitter and shimmer, normalized noise energy (NNE), signal to noise ratio (SNR) and harmonic to noise ratio (HNR). Statistics Statistical analysis was performed using IBM SPSS Statistics for Windows, Version 20.0. (Armonk, NY: IBM Corp. Software). Data were presented as mean ± standard deviation (SD). The Student’s t test was used for testing hypotheses about equality of the means. The size of the differences among the mean values of the groups was evaluated by estimation of type II error b. The size of the difference was considered to be significant if b B 0.2 (i.e. the post hoc power of statistical test C0.8) as type I error a = 0.05. The correlations among acoustic voice parameters were evaluated using Pearson correlation coefficients (r). The level of statistical significance by testing statistical hypothesis was 0.05. Level of agreement between acoustic voice parameters measured from recordings done using oral and SP microphones was evaluated visually using Bland–Altman plots. This type of analysis is suitable when neither way of recording can be considered as ‘gold standard’. Also it makes comparison scale-sensitive, because correlation does not require that measurements compared are on the same scale, and due to this the samples in poor agreement may in fact have high correlation [18]. The Bland–Altman

Table 1 Demographic data of the study group Diagnosis

Total number (n = 118)

Gender

Age (years)

Female (n = 73)

Male (n = 45)

Mean

SD

Normal voice

34

23

11

41.8

16.96

Nodules, cysts

16

14

2

34.6

14.48

Polyps

26

15

11

45.9

10.76

Carcinoma, keratosis, papillomatosis

21

11

10

50.5

13.25

9

6

3

54.7

11.93

10

3

7

52

14.02

2

1

1

61.5

24.75

Vocal fold paralysis Reflux laryngitis Dysphonia, presbylaryngis SD standard deviation

123

3394

plot, also known as Tukey mean-difference plot, is a scatterplot of two variables: the average between the two measurements on the horizontal axis and the difference between the two measurements on the vertical axis. Plot shows the amount of disagreement between the two measures (via differences) and indicates how this disagreement relates to the magnitude of the measurements [19]. Classifiers Linear discriminant analysis (LDA) was performed to determine suitability of acoustic voice parameters for discriminating normal and pathological voice groups and selecting an optimum set of parameters for the classification task. Performance of the LDA was summarized by correct classification rate (CCR). Stepwise feature selection was selected for LDA and CCR was obtained using leaveone-out validation. Random forest classifier (RFC) was used to evaluate the feasibility of acoustic voice parameters for discrimination of subjects into normal and pathological voice classes. RFC is a committee of decision trees built using different bootstrap samples of the original and a random subset of features [20]. Out-of-bag validation in RFC is done similarly to leave-one-out strategy—each data instance is classified only by trees, which did not have this instance in their bootstrap sample. Out-of-bag based equal error rate (EER) is obtained after thresholding soft decisions at a specific operating point, where the error for one class becomes equal to the error of the other class, or, if speaking in accuracy terms, specificity becomes equal to sensitivity. Methodological recommendations for voice pathology detection advice detection error trade-off (DET) curve, because such curve, due to logarithmic axes, tends to be linear and this allows comparison of several systems at a glance more easily than with receiver operating characteristic (ROC) curve [21]. Nevertheless, measurements of area under the curve (AUC) from the ROC plot can be valuable for objective comparisons. DET, EER, ROC and AUC were estimated from out-of-bag data using convex hull approximation of the BOSARIS toolkit [22]. Audio and questionnaire modalities were combined on the feature-level by a simple concatenation of acoustic parameters and responses to questionnaire items into a single feature vector. Conditional RFC was used to obtain an unbiased measure of variable importance [23, 24].

Results In Table 2, the mean values and standard deviations of the acoustic voice parameters obtained both with oral and SP microphones in normal voice subgroup are presented.

123

Eur Arch Otorhinolaryngol (2015) 272:3391–3399

Statistically significant differences were revealed for all acoustic voice parameters measured, except shimmer and F0. The mean values of acoustic voice parameters showed the tendency to be higher for oral microphone recordings. Differences of acoustic voice parameters reflecting voice signal turbulences were mostly within 3.4–9.5 % range. However, for jitter this difference reached 19.9 %. Table 3 presents the mean values and standard deviations of the acoustic voice parameters in pathological voice subgroup. No statistically significant differences between the means of acoustic voice parameters obtained with the oral and SP microphones were found for jitter, shimmer and F0 parameters. Mean values of acoustic voice parameters reflecting voice signal turbulences (NNE, HNR and SNR) differed statistically significantly within 8.3–19.0 % range. Results of paired correlation analysis are presented in Table 4. Generally, the statistical analysis showed significant strong correlations (r = 0.78–0.91) among the measured instrumental voice parameters reflecting pitch and amplitude perturbations (jitter and shimmer) and measurements of voice signal turbulences (NNE, HNR and SNR) obtained both from the oral and SP microphones. Exception presented moderate correlations between jitter data (r = 0.63 in pathological voice group and r = 0.67 in total group). F0 registered both with the oral and SP microphones was almost identical and correlated perfectly (r = 1.0; p \ 0.01). Bland–Altman analysis, available in Fig. 1, revealed that some disagreement for F0 still exists: one normal subject had SP-based measurement higher by *20 Hz and one pathological subject had oral microphone-based measurement higher by *20 Hz. Besides these outliers, remaining differences were in the smaller (-10; 10) Hz range and mean of differences was on the zero for both types of subjects. Jitter and shimmer measurements also had mean of differences around the zero; however, variance of measurements was increasingly higher for pathological cases: higher values of measurement had higher disagreement, where jitter for three pathological subjects was evaluated as higher by oral microphone than by SP. Means of differences for measurements of voice signal turbulences (NNE, SNR and HNR) were not on the zero. Differences in NNE measurement were distinct from differences in other measurements of turbulence that stronger disagreement was observed for normal subjects and SP was found to indicate higher NNE values. Meanwhile, differences in SNR and HNR were more varied for pathological subjects and recordings by oral microphone indicated higher SNR and HNR values, especially for pathological subjects. Table 5 presents results of classification of voice signal into two classes, i.e., normal and pathological voice. As the

Eur Arch Otorhinolaryngol (2015) 272:3391–3399 Table 2 Comparison of the means of acoustic voice parameters obtained from oral and smart phone microphones in the normal voice subgroup

Acoustic parameters

3395

Mean

SD

O-Jitter

0.26

0.14

SP-Jitter

0.21

0.08

O-Shimmer

2.22

0.99

SP-Shimmer

2.13

0.95

O-NNE

-12.23

4.63

SP-NNE

-11.07

4.49

O-HNR

24.78

3.34

SP-HNR

23.94

3.78

O-SNR

23.31

3.43

SP-SNR

22.2

3.7

198.05 197.87

50.11 50.28

O-F0 SP-F0

P*

b**

Difference Absolute

%

0.003*

0.05

19.9

0.452

0.088

4.0



0.016*

1.16

9.5

0.308**

0.027*

0.845

3.4

0.389**

0.002*

1.109

4.8

0.108**

0.833

0.181

0.1



0.138**

SD standard deviation, O oral microphone, SP smart phone microphone * Statistically significant difference ** Computed as a = 0.05

Table 3 Comparison of the means of acoustic voice parameters obtained from oral and smart phone microphones in the pathological voice subgroup

Acoustic parameters

Mean

SD

O-Jitter

0.63

0.63

SP-Jitter

0.55

0.46

O-Shimmer

4

2.18

SP-Shimmer

4.21

2.17

O-NNE

-6.89

4.86

SP-NNE

-5.58

4.24

O-HNR

19.69

5.35

SP-HNR

18.1

5.36

O-SNR

18.25

5.17

SP-SNR

16.74

5.15

179.98 179.15

50.32 50.67

O-F0 SP-F0

P*

b**

Difference Absolute

%

0.132

0.085

13.5

0.261

0.204



13.5 5.1



-5.1 0.000*

1.308

19.0

0.00**

0.000*

1.583

8.0

0.004**

0.000*

1.509

8.3

0.007**

0.102

0.840

0.5



SD standard deviation, O oral microphone, SP smart phone microphone * Statistically significant difference ** Computed as a = 0.05

Table 4 Paired correlations (Pearson’s r) between acoustic voice parameters obtained with the oral and smart phone microphones

Acoustic voice parameters

Normal voice group

Pathological voice group

All data

O-Jitter & SP-Jitter

0.75*

0.63*

0.67*

O-Shimmer & SP-Shimmer

0.76*

0.73*

0.78*

O-NNE & SP-NNE

0.83*

0.91*

0.91*

O-HNR & SP-HNR

0.83*

0.84*

0.87*

O-SNR & SP-SNR

0.85*

0.83*

0.87*

O-F0 & SP-F0

1*

1*

1*

O oral microphone, SP smart phone microphone * p \ 0.01

123

3396

Eur Arch Otorhinolaryngol (2015) 272:3391–3399

Fig. 1 Bland–Altman plots for measurements of acoustical voice parameters. The horizontal solid line denotes the mean of differences and the horizontal dashed lines correspond to ±2 standard deviations from the mean of differences. Dark color and squares correspond to

pathological subjects, bright color and circles correspond to normal subjects. Mean on the X axes is the calculated average value for the pair of measurements: (Oral microphone ? SP)/2

Table 5 CCR achieved by the LDA when classifying into normal and pathological voice classes using acoustic voice parameters obtained from the oral and smart phone microphones and GFI data

LDA fusing entire acoustic voice parameters and GFI data selected an optimum pair of parameters discriminating normal and pathological voice subgroups. For oral microphone this pair included O-NNE and GFI, achieving CCR of 85.1 %, and for SP microphone the pair included SPNNE and GFI, achieving CCR of 83.8 %. Consequently, combination of acoustic voice parameters and GFI data increased the CCR when discriminating normal and pathological voice classes both for oral and SP microphones voice recordings. Results of the RFC performance when classifying data into normal and pathological voice classes using acoustic voice parameters obtained from the oral and SP microphones and GFI data are summarized in Fig. 2. As shown in Fig. 2, the oral microphone (EER = 29.78 %) was outperformed by SP microphone (EER = 21.32 %); however, GFI items (EER = 10.15 %) proved to be even better single non-invasive modality for voice pathology detection. Fusing audio data with responses to GFI items improved detection further, where SP microphone fusion with GFI was the most successful achieving the best overall EER of 7.94 %. Further combination of both microphones and the

Microphones

Parameters

CCR (%)

Oral

O-NNE

73.7

SP

SP-Shimmer

79.5

Oral and GFI

SP-NNE O-NNE

85.1

GFI SP and GFI

SP-NNE

83.8

GFI CCR correct classification rate, O oral microphone, SP smart phone microphone, GFI glottal function index

outcome of LDA of the separate acoustic voice parameters the consequent CCRs discriminating normal and pathological voice subgroups were determined. As follows from Table 5, for the oral microphone, O-NNE was the most discriminative parameter and provided CCR of 73.7 %. For the SP microphone, a pair of acoustic voice parameters, i.e., SP-shimmer and SP-NNE provided CCR of 79.5 %.

123

Eur Arch Otorhinolaryngol (2015) 272:3391–3399

3397

Fig. 2 Detection performance of Random Forest Classifier: DET curves (left) and ROC curves (right). O oral microphone, SP smart phone microphone, EER equal error rate, AUC area under the curve, GFI glottal function index

GFI data could not outperform this result. Judging from the DET curve, fusing acoustic voice parameters with GFI data is efficient not only around the EER operating point, but also has the highest specificity (lowest false alarm probability) near high sensitivity (low miss probability) mode of operation which is appealing for initial voice screening.

Discussion Automated acoustic analysis of voice is increasingly used in voice clinics for collection of objective non-invasive voice data, documenting and quantifying dysphonia changes and the outcomes of therapeutic and/or phonosurgical treatment of voice problems [17, 25–29], as well as for screening laryngeal disorders [9, 10, 30, 31]. One of the most important factors determining reliability and practical utility of screening and categorization of voice disorders is voice recordings of acceptable quality. Therefore, the type and technical characteristics of the microphone may determine the final results of acoustic voice analysis [32]. Results of the present study revealed strong statistically significant intercorrelations (r = 0.78–0.91) with small exception for jitter (r = 0.68) between acoustic voice parameters obtained using standard oral cardioid and SP microphones, thus confirming acceptability of SP microphones for acoustic voice measurements in clinical settings and/or for screening purposes. Moreover, for the F0 data there was a perfect agreement between the two microphones recordings in our series. Despite the statistically significant differences among the mean values of some acoustic voice parameters that were found in this study with oral microphone showing a slight tendency to higher mean values, these differences were in the range only of 3.4–9.5 %. Some exception was

shown for jitter in normal voice group (difference 19.9 %) and for NNE in pathological voice group (difference 19.0 %). However, these differences between acoustic voice parameters obtained with different microphones, finally had no significant impact on classification accuracy into normal/pathological voice classes. Acoustic voice parameters were more useful for voice pathology detection with RFC when estimated from recordings done with SP microphone, if compare to standard microphone voice recordings. For example, SP-based jitter was found as the most important variable for RFC after the GFI items. In this study, the use of sustained vowel/a/has been motivated for analysis because of the steady-state phonations (i.e., time and frequency invariance) are simple, time effective, allow the reduction of the variances in sustained vowels featuring simple acoustic structure and provide reliable detection and computation of acoustic features [10, 28, 33, 34]. It was demonstrated in previous studies that vowel/a/achieves the lowest equal error rate in laryngeal pathology detection [35]. Moreover, sustained vowels are not influenced by speech rate and stress; they typically do not contain fast voice onsets and terminations, voiceless phonemes, and prosodic fluctuations in F0 and amplitude [12]. Ultimately, they are relatively restrained from influences related to different languages and therefore could be considered as universal and suitable for voice screening purposes [33]. It was presumed planning design of the present study, that combination both of automated acoustic analysis of sustained vowel/a/results and voice-related questionnaire data would increase discrimination of normal and pathological voice classes. In this study, combined use of both acoustic voice analysis results and GFI_LT questionnaire data revealed evident benefits discriminating normal and pathological voice groups. To the best of our knowledge,

123

3398

this has been presented for the first time. The discriminant analysis determined the O-NNE and SP-NNE parameters as optimal providing CCR of 73.7 and 79.5 % respectively, when classifying normal and pathological voice samples. However, fusion of the results obtained from voice recordings and GFI_LT data increased the CCR to 84.2 % for oral microphone voice recordings and to 84.6 % for SP microphone recordings. Furthermore, fusing audio data with responses to GFI_LT items improved detection further, where SP microphone fusion with GFI was the most successful achieving the best overall EER of 7.94 %, and including both microphones besides the GFI_LT could not outperform this result. Noteworthy is the fact that in the task of distinguishing between the normal and pathological voice classes, the GFI_LT data outperformed the acoustic data when using RFC. This is not surprising, since our previous investigations have shown that questionnaire data may carry more information relevant for the classification task than acoustic data [16]. On the other hand, one can expect obtaining higher classification accuracy from the RFC using more parameters to represent the acoustic data than those few, computed by the Dr. Speech software and used in the present study. Moreover, the relatively high discrimination power of the GFI_LT data is an encouraging result for developers of future web-based voice screening systems, because such sensor independent data source of high discrimination power may lessen possible acoustic parameters dependent differences in sensitivity of combined classifier built using data of both types (acoustic parameters and questionnaire data). This would be of great importance, if different voice recording devices, i.e., different smart phones, different microphones will be in use. Some limitations of the present study must be considered, because only the Dr. Speech system registering a rather limited number of clinically comprehensible acoustic voice parameters reflecting perturbation (jitter and shimmer) and turbulent noise variables (NNE, HNR, SNR) in voice signal was used [26]. This limitation of the analysis system presumptively reduces the accuracy of classification into normal and pathological voice classes. Therefore, future investigation should be concentrated on the utility of a large variety of voice signal feature types in classifying the voice into healthy and different pathological voice classes, using sophisticated contemporary methods of automated voice analysis [31, 36, 37]. Also, it will be of great importance to analyze how well the SP microphone performs in an ordinary environment and in the presence of background noise. Results of the present study confirmed that SP-based voice recordings provide suitable quality for automated acoustic voice analysis. Moreover, portability, patient/userfriendliness, low cost and applicability of SP-based devices

123

Eur Arch Otorhinolaryngol (2015) 272:3391–3399

not only in clinical settings have greater utility and therefore may be preferred by patients and clinicians for voice data collection in both home and clinical settings [7]. It is important to point out, that the SP-based voice recordings and automatic voice analysis system are not considered as a substitute for clinical examination; however, it is seen to have a potential role in screening for laryngeal diseases and for subsequent referral selected individuals for earlier otolaryngological examination and visualization of the larynx (video laryngostroboscopy, indirect/direct microlaryngoscopy) thus improving diagnostics of laryngeal diseases. On the other hand, acoustic voice analysis may be an important part for follow up and monitoring of voice treatment results.

Conclusions In summary, the measurements of acoustic voice parameters using SP microphone showed to be reliable in clinical settings and demonstrated high CCR and low EER when distinguishing between the healthy and pathological voice patients groups. Our study validates the suitability of the SP microphone signal for the task of automatic voice analysis and voice screening. Acknowledgments This study was supported by grant VP1-3.1SˇMM-10-V-02-030 from the Ministry of Education and Science of Republic of Lithuania. Compilance with ethical standards Conflict of interest

No conflicts of interest to declare.

References 1. Roy N, Merrill RM, Thibeault S, Parsa RA, Gray SD, Smith EM (2004) Prevalence of voice disorders in teachers and the general population. J Speech Lang Hear Res 47:281–293 2. Branski RC, Cukier-Blaj S, Pusic A, Cano SJ, Klassen A, Mener D et al (2010) Measuring quality of life in dysphonic patients: a systematic review of content development in patient-reported outcomes measures. J Voice 24:193–198 3. Bhattacharyya N (2014) The prevalence of voice problems among adults in the united states. Laryngoscope 124:2359–2362 4. Cohen SM, Kim J, Roy N, Courey M (2014) Delayed otolaryngology referral for voice disorders increases health care costs. Am J Med 128:11–18 5. Dejonckere PH, Bradley P, Clemente P, Cornut G, CrevierBuchman L, Friedrich G et al (2001) A basic protocol for functional assessment of voice pathology, especially for investigating the efficacy of (phonosurgical) treatments and evaluating new assessment techniques. Eur Arch Otorhinolaryngol 258:77–82 6. Kaleem MF, Ghoraani B, Guergachi A, Krishnan S (2011) Telephone-quality pathological speech classification using empirical mode decomposition. Conf Proc IEEE Eng Med Biol Soc 2011:7095–7098

Eur Arch Otorhinolaryngol (2015) 272:3391–3399 7. Mat Baki M, Wood G, Alston M, Ratcliffe P, Sandhu G, Rubin JS, Birchall MA (2015) Reliability of operavox against multidimensional voice program (MDVP). Clin Otolaryngol 40:22–28 8. Reynolds DA (1995) Large population speaker identification using clean and telephone speech. Signal Process Lett IEEE 2:46–48 9. Moran RJ, Reilly RB, de Chazal P, Lacy PD (2006) Telephonybased voice pathology assessment using automated speech analysis. IEEE Trans Biomed Eng 53:468–477 10. Wormald RN, Moran RJ, Reilly RB, Lacy PD (2008) Performance of an automated, remote system to detect vocal fold paralysis. Ann Otol Rhinol Laryngol 117:834–838 11. Jokinen E, Yrttiaho S, Pulakka H, Vainio M, Alku P (2012) Signal-to-noise ratio adaptive post-filtering method for intelligibility enhancement of telephone speech. J Acoust Soc Am 132:3990–4001 12. Lin E, Hornibrook J, Ormond T (2012) Evaluating iphone recordings for acoustic voice assessment. Folia Phoniatr Logop 64:122–130 13. Bach KK, Belafsky PC, Wasylik K, Postma GN, Koufman JA (2005) Validity and reliability of the glottal function index. Arch Otolaryngol Head Neck Surg 131:961–964 14. Pribuisiene R, Baceviciene M, Uloza V, Vegiene A, Antuseva J (2012) Validation of the Lithuanian version of the glottal function index. J Voice 26:73–78 15. Verikas A, Gelzinis A, Bacauskiene M, Uloza V, Kaseta M (2009) Using the patient’s questionnaire data to screen laryngeal disorders. Comput Biol Med 39:148–155 16. Verikas A, Bacauskiene M, Gelzinis A, Vaiciukynas E, Uloza V (2012) Questionnaire-versus voice-based screening for laryngeal disorders. Expert Syst Appl 39:6254–6262 17. Uloza V, Saferis V, Uloziene I (2005) Perceptual and acoustic assessment of voice pathology and the efficacy of endolaryngeal phonomicrosurgery. J Voice 19:138–145 18. Bland JM, Altman D (1986) Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 327:307–310 19. Elliott AC, Woodward WA (2007) Statistical analysis quick reference guidebook: with SPSS examples. Sage Publications, New York 20. Breiman L (2001) Random forests. Mach Learn 45:5–32 21. Saenz-Lechon N, Godino-Llorente JI, Osma-Ruiz V, GomezVilda P (2006) Methodological issues in the development of automatic systems for voice pathology detection. Biomed Signal Process Control 1:120–128 22. Bru¨mmer N, de Villiers E (2013) The BOSARIS toolkit: Theory, algorithms and code for surviving the new dcf. ArXiv Preprint ArXiv 1304.2865 23. Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Gr Stat 15:651–674

3399 24. Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods 14:323–348 25. Eadie TL, Doyle PC (2005) Classification of dysphonic voice: acoustic and auditory-perceptual measures. J Voice 19:1–14 26. Smits I, Ceuppens P, De Bodt MS (2005) A comparative study of acoustic voice measurements by means of Dr. Speech and computerized speech lab. J Voice 19:187–196 27. Oguz H, Demirci M, Safak MA, Arslan N, Islam A, Kargin S (2007) Effects of unilateral vocal cord paralysis on objective voice measures obtained by Praat. Eur Arch Otorhinolaryngol 264:257–261 28. Zhang Y, Jiang JJ (2008) Acoustic analyses of sustained and running voices from patients with laryngeal pathologies. J Voice 22:1–9 29. Maryn Y, Corthals P, De Bodt M, Van Cauwenberge P, Deliyski D (2009) Perturbation measures of voice: a comparative study between multi-dimensional voice program and praat. Folia Phoniatr Logop 61:217–226 30. Linder R, Albers AE, Hess M, Po¨ppl SJ, Scho¨nweiler R (2008) Artificial neural network-based classification to screen for dysphonia using psychoacoustic scaling of acoustic voice features. J Voice 22:155–163 31. Muhammad G, Mesallam TA, Malki KH, Farahat M, Mahmood A, Alsulaiman M (2012) Multidirectional regression (MDR)based features for automatic voice disorder detection. J Voice 26:19–27 32. Svec JG, Granqvist S (2010) Guidelines for selecting microphones for human voice production research. Am J Speech Lang Pathol 19:356–368 33. Moon KR, Chung SM, Park HS, Kim HS (2012) Materials of acoustic analysis: sustained vowel versus sentence. J Voice 26:563–565 34. Kaleem M, Ghoraani B, Guergachi A, Krishnan S (2013) Pathological speech signal analysis and classification using empirical mode decomposition. Med Biol Eng Comput 51:811–821 35. Henrı´quez P, Alonso JB, Ferrer MA, Travieso CM, GodinoLlorente JI, Dı´az-de-Marı´a F (2009) Characterization of healthy and pathological voice through measures based on nonlinear dynamics. Audio Speech Lang Process IEEE Trans 17:1186–1195 36. Uloza V, Verikas A, Bacauskiene M, Gelzinis A, Pribuisiene R, Kaseta M, Saferis V (2011) Categorizing normal and pathological voices: automated and perceptual categorization. J Voice 25:700–708 37. Vaiciukynas E, Verikas A, Gelzinis A, Bacauskiene M, Uloza V (2012) Exploring similarity-based classification of larynx disorders from human voice. Speech Commun 54:601–610

123

Exploring the feasibility of smart phone microphone for measurement of acoustic voice parameters and voice pathology screening.

The objective of this study is to evaluate the reliability of acoustic voice parameters obtained using smart phone (SP) microphones and investigate th...
889KB Sizes 0 Downloads 10 Views