International Journal of Audiology 2014; 53: 336–344

Original Article

User-operated speech in noise test: Implementation and comparison with a traditional test Ellen Raben Pedersen & Peter Møller Juhl The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark

Abstract Objective: The purpose of this study was to implement and evaluate a user-operated speech in noise test. Design: The test is based on the Danish speech material Dantale II, which consists of five words sentences (Wagener et al, 2003). For each word presented the subject selected a response from ten alternative words. Two versions of the test were made: one with and one without the possibility that for each word presented the subject could answer “I do not know” (?-button). Using a listening test the two versions were evaluated against a traditional test, where the subjects orally repeated the words that were perceived. Study sample: Twenty-four normal-hearing subjects. Results: The speech intelligibility as a function of the signal-to-noise ratio can be described by logistic functions in the different user-operated tests and in the traditional test. The logistic parameters obtained from the user-operated test with the ?-button agree with the parameters obtained in a traditional test. The homogeneity of the speech material is uninfluenced when the material is used in a user-operated test. Conclusions: It is reasonable to use the Dantale II speech material for a user-operated speech in noise test, and the use of the ?-button is favourable.

Key Words:  Dantale II; speech intelligibly in noise; user-operated test; matrix test; alternative forced choice; ?-button; discrimination function; normal-hearing subjects

Many people, and especially hearing-impaired listeners, find it difficult to hear and understand speech in a noisy environment. Over the years, different speech-in-noise tests have been developed to achieve a measure of hearing-impaired individuals’ ability to hear speech in background noise. In the Danish clinical practice the commonly used speech test is the Dantale I test, which consists of lists of single monosyllables and numbers of one syllable (Elberling et al, 1989). Danish speech tests based on sentences also exists. The most common are the Dantale II test (Wagener et al, 2003) and the Danish version of the hearing-in-noise test (HINT) (Nielsen & Dau, 2011), which consist of syntactically fixed but semantically unpredictable (nonsense) sentences and everyday sentences, respectively. In both tests the sentences are presented in speech-shaped interfering noise. The Dantale II speech material is used in the present study. The Danish Dantale II speech material is developed in analogy to the materials for the Swedish Hagerman test (Hagerman, 1982) and the German Oldenburg sentence test (Wagener et al, 1999a,b,c). Within the European HearCom project the material have also been developed for other languages, e.g. in Polish (Ozimek et al, 2010), Spanish (Hochmuth et al, 2012), and French (Jansen et al, 2012). Speech-in-noise tests based on a speech material as the Dantale II are due to the composition of the material normally implemented

with an adaptive test procedure using word scoring. In an adaptive test procedure sentences are presented consecutively, and the presentation level, i.e. the signal-to-noise ratio (SNR) of each sentence depends on the number of correctly answered words in the previous sentence(s). The SNR that corresponds to a particular speech intelligibility percentage is the speech reception threshold (SRT). Typically the SRT reported corresponds to a probability of 50% correct responses (Brand & Kollmeier, 2002). When the subjects answer two or fewer words correctly (below 50%), the SNR has be to be raised, and when the subjects answer three or more words correctly (above 50%), the SNR has be to be reduced. Typically, the test stops after a fixed number of sentences. The change in the SNR is determined by the expected connection between SNR and the speech intelligibility, described by a logistic function (Brand & Kollmeier, 2002). A traditional speech-in-noise test involves the subject and an operator. The subject has to orally repeat the perceived word(s) after each presented word or sentence. The operator then registers whether the subject’s answer is correct or incorrect. The purpose of this study is to implement and evaluate a user-operated speech-in-noise test based on the speech material Dantale II. The possible benefits of a user-operated test are a reduction of the test time for the operator and the elimination of the possible influence of the operator on the

Correspondence: Ellen Raben Pedersen, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Campusvej 55, DK – 5230, Odense M, Denmark. E-mail: erpe@ mmmi.sdu.dk (Received 8 April 2013; accepted 25 October 2013) ISSN 1499-2027 print/ISSN 1708-8186 online © 2014 British Society of Audiology, International Society of Audiology, and Nordic Audiological Society DOI: 10.3109/14992027.2013.860486

User-operated speech in noise test    337



Abbreviations AFC Alternative forced choice HINT Hearing-in-noise test HL Hearing level SNR Signal-to-noise ratio SRT Speech reception threshold in noise (50% speech understanding) s50 The slope at SRT

evaluation of the subject response. The German, Polish, and Spanish speech material have had been implemented and evaluated in a useroperated test version, where the subject had to select a response from ten alternative words listed in a matrix for each word presented (Brand et  al, 2004; Ozimek et  al 2010; Hochmuth et  al 2012). Besides the ten alternative words the subject in the German and in the Spanish test had the option that for each word presented the subject could answer “I do not know”1. Each “I do not know” answer was interpreted as an incorrect answer. In the Polish test the “I do not know” option was not included (and the test used sentence scoring instead of word scoring). Without the possibility to answer “I do not know” (and when using word scoring) the Polish user-operated test was designed as five subsequent ten-alternative forced choice (10AFC) tests; i.e. a closed test. Brand et  al (2004), Ozimek et  al (2010), and Hochmuth et  al (2012) examined how the design of the user-operated test affects the SRT value. In the user-operated German test with the “I do not know” option the SRT value was found to be like the SRT value obtained from a traditional test (Brand et al, 2004). Accordingly, for the Polish version no statistical significant difference was found for the SRT value between the user-operated test without the “I do not know” option and a traditional test (Ozimek et al, 2010). However, for the Spanish version the SRT value in the user-operated test with the “I do not know” was found to be significantly different from the SRT value obtained from a traditional test. The SRT value was lower (better) for the user-operated test compared to the traditional test (Hochmuth et  al, 2012). Beside the SRT values Hochmuth et  al (2012) examined how the design of the user-operated test affects the slope of the discrimination function at SRT and found similar values in the user-operated test and in the traditional test. In a user-operated test without the “I do not know” option the subject is forced to guess when a word is not heard. This study distinguishes between two types of guesses: pure guesses and motivated guesses. A pure guess is made when a subject has not heard the word at all. A motivated guess can be made if a subject has heard a part of the word, and uses this information in picking among the limited pool of alternative words. Since the subjects in a user-operated test are more likely to guess the correct word by a motivated guess, the SRT value is expected to be lower than in a traditional test. In a user-operated test without the “I do not know” option pure guesses are taken into account by adapting the logistic function for a traditional test (with, in theory, an infinite number of alternative words) to a closed test procedure (with a limited number of alternative words). The discrimination function for the useroperated test without the “I do not know” option will, in the limit of low SNRs, not have the value 0 as in a traditional test but the reciprocal value of the number of alternative words (corresponding to a forced random guess). Thus, the slope of the discrimination function

for a user-operated test without the “I do not know” option will be smaller than that for a traditional test. The homogeneity of the speech material can be assessed by the slope of the discrimination function that corresponds to the traditional test; the steeper the slope the more homogeneous is the material, i.e. the more even is the difficulty of the different words (Wagener et al, 2003). During the generation of the Danish speech material the sound pressure level of particular words were adjusted in order to optimize the homogeneity of the material (Wagener et al, 2003). Other speech materials has also been optimized, e.g. the Spanish material (Hochmuth et  al, 2012). For both the Danish and the Spanish material the optimizing process were performed with a traditional test. The possibility of guessing the correct word from the alternative words in a user-operated test may affect the homogeneity of the material, e.g. if a word in the material sounds similar as another word not included in the material a subject in a traditional test may answer that not-included word, whereas a subject in a user-operated test will have a much better change in picking the correct word being helped by the limited pool of alternative words. However, the finding of the slope by Hochmuth et al (2012) indicates that the homogeneity of the Spanish material is uninfluenced when it is used in a user-operated test compared to using the material in a traditional test for which it originally was designed and optimized. In this study the Danish speech material Dantale II is implemented as a user-operated test, where the subject had to select a response from ten alternative words in a matrix for each word presented. Two versions of the test were made: one with the option for each word to answer “I do not know” and one without the “I do not know” option (?-buttons). The “I do not know” answers were handled in two ways: (i) as incorrect answers in agreement with Brand et al (2004) and Hochmuth et al (2012) and (ii) as 1/10 correct answers, respectively. The advantage of including the ?-buttons is that the subject is not forced to choose an answer to a word, which had not been heard; i.e. the subjects are not forced to guess. However, a disadvantage is that different subjects will use the ?-buttons differently. Some subjects would use a ?-button whenever they are not quite sure which word had been presented, whereas others will choose a word instead and almost never use the ?-buttons. To equalize a potential different use of the ?-buttons, the conversion of each “I do not know” answer into a 1/10 correct answer was introduced. The conversion corresponds to picking the right word in 10% of the times by random guessing among the ten possible words. The two user-operated test versions and the different handling of the “I do not know” answers were evaluated against a traditional test by a listening test. In the present study three overlapping research questions are addressed: (1) can it, for a traditional test, be confirmed that a logistic function can be used to describe the speech intelligibility as a function of SNR, and is it also reasonable to use a logistic functions for the two versions of a user-operated test including the different handling of the “I do not know” answers?; (2) Does the possibility to answer “I do not know” and the different interpretation of the “I do not know” answers influence the test result and if so, how?; and (3) is it reasonable to use the speech material Dantale II in a user-operated test, i.e. is the homogeneity of the speech material influenced when it is used in a user-operated test instead of being used in a traditional test for which the material originally was designed? To the authors’ knowledge only the third question has been addressed in the available literature and only by the study of Hochmuth et al (2012) using the Spanish speech material.

338    E. R. Pedersen & P. M. Juhl

Methods Speech material In this study the Danish speech material Dantale II, which is described by Wagener et al (2003) was used to determine the speech reception threshold in noise. The speech material consists of 160 test sentences and a noise signal. The noise signal was generated by superimposing the test sentences many times by which the signal became speechshaped without strong fluctuations. The test sentences are assembled in 16 lists with ten sentences each. The sentences have a syntactically fixed structure of five words from different word classes: name, verb, numeral, adjective, and noun. The lists are based on the same 50 words (ten words for each of the five word classes). The different words in the Dantale II material are seen in Table 1. The sentences are not fully meaningful and therefore the words cannot be predicted from the context. The words can then be regarded as single words in sequence, hence word scoring was used. As an example the first sentence in list 1 is: “Ingrid finds seven red houses” (translation of the Danish sentence: “Ingrid finder syv røde huse”). Because of the fixed structure, a test based on a speech material as Dantale II is called a matrix test.

Discrimination functions The discrimination function describes the speech intelligibility or discrimination, p as a function of SNR. In Brand & Kollmeier (2002) the discrimination function for a traditional test, where the subjects orally repeat the perceived words, is given by a logistic function: pTT ( SNR, SRT , s50 ) 

1 , 1exp (4 ⋅ s50 ⋅ ( SRT – SNR))

Table 1.  The ten alternative Danish words for each presented word. The table is for the version of the user-operated test with the possibility that for each word presented the subject could answer “I do not know” (?-button). The five columns represent the different word classes: name, verb, numeral, adjective, and noun. For each presented word the subjects had to choose one word from the corresponding column or use the ?-button. The words in each column are listed alphabetically except for the numerals which are listed in numerical order. During the test the matrix was visible on a touch screen. ejer finder får havde købte låner ser solgte valgte vandt ?

tre fem seks syv otte ni ti tolv fjorten tyve ?

fine flotte gamle hvide nye pæne røde sjove smukke store ?

1 pnAFC ( SNR, SRT , s50 )  pTT ( SNR, SRT , s50 )  ⋅ (1 pTT ( SNR, SRT , s50 )) n 1 1  (2)  pTT ( SNR, SRT , s50 ) ⋅ 1   .  n n

Motivated guesses has not been taken into account. Equation 1 is inserted into Equation 2:

(1)

where the subscript TT denotes traditional test. The parameter SRT refers to the SNR level corresponding to a probability of 50% correct responses, and s50 is the slope of the discrimination function for the traditional test at SRT. The reference data (for normal-hearing native Danes) is: SRT  8.4 dB SNR (standard deviation: 0.95 dB SNR), and s50  13.2 %/dB (standard deviation: 1.9 %/dB) (Wagener et al, 2003).

Anders Birgit Henning Ingrid Kirsten Linda Michael Niels Per Ulla ?

The procedure for a traditional test is open, which means that the number of alternative responses for each word is in theory infinite. The logistic function for the open test procedure is adapted to a closed test procedure, i.e. an n-alternative forced choice (nAFC) procedure (without the possibility to answer “I do not know”). In a test based on a nAFC procedure, the subject has to select a response from n alternative responses. In the limit of low SNRs (i.e. where the level of the words is much less than the level of the noise) the discrimination function for a nAFC procedure will have the value 1/n due to pure guesses, whereas the value will approach 1 for high SNRs as it is also the case for the traditional test. Hence, the discrimination function for a nAFC procedure corresponds to a vertical compression of the discrimination function for the traditional test. For a nAFC procedure the probability of hearing each word is equal to pTT (SNR, SRT, s50). It is assumed, that when a word is heard it is responded to correctly. The probability of not hearing a word (and having to guess) is 1 2 pTT (SNR, SRT, s50). The probability function (discrimination function) of correctly responded words, both heard and by pure guessing, is therefore:

biler blomster gaver huse jakker kasser masker planter ringe skabe ?

pnAFC ( SNR, SRT , s50 ) 

1 1 1  ⋅ 1   , n n 1exp (4 ⋅ s50 ⋅ ( SRT  SNR)) 

(3)

where s50 and SRT are the parameters of the traditional test. The slope of Equation 3 at SNR  SRT is s50.(1  1/n). The factor (1  1/n) corresponds to the factor, by which the discrimination function for the traditional test is vertically compressed. At SNR  SRT we expect 1/2 . (1  1/n)  1/n  1/2 . (1  1/n) correct responses in average from a normal-hearing subject. Equation 3 is in agreement with Green et al (1989). In the following n is set to ten, by which all the words within the different word classes in the speech material are represented. From values of the speech intelligibility at different SNRs the discrimination function with the maximum likelihood was determined by using the method of maximum likelihood. The likelihood of the discrimination function p (SNR, SRT, s50) is: l ( p( SNR, SRT , s50 ))  ∏ k 1 p ( SNRk , SRT , s50 )c k ⋅ (1 p( SNRk , SRT , s50 )) 5ck , m

(4)

where the number five in the exponent of the last term is the number of words in each test sentence. The parameter m is the number of sentences presented and ck is the number of correct words registered at the kth sentence. To determine the most likely discrimination function the parameters SRT and s50 are varied until log l(p(SNR, SRT, s50)) is maximized (Newey & McFadden, 1994).

User-operated test versions In the user-operated test the subjects have to select a response from ten alternative words listed in a matrix for each presented word (see Table 1). The ten alternative words for each of the five words in

User-operated speech in noise test    339

each sentence are the 50 words, which the 16 lists in the speech material are based on. As described in the introduction the useroperated test was implemented in two versions: one with the possibility for each word to answer “I do not know” (press on a ?-button), and one without that possibility. When the ?-button is available the test procedure is not entirely closed. The original intention with the ?-button was to increase the subjects’ engagement in the test. It can be frustrating to be forced to choose an answer (without the possibility to answer “I do not know”) to a word, which has not been heard at all, and this can potentially draw the attention from the next sentence to be presented. Each “I do not know” answer was handled in two ways: either as an incorrect answer or each “I do not know” was converted into a 1/10 correct answer. The conversion into a 1/10 correct answer corresponds to picking the right word 10% of the times by random guessing among the ten possible words (when many sentences are presented). In the remaining 90% of the time an incorrect word will be picked. This conversion was introduced in order to equalize a potential different use of the ?-button among the subjects. Some subjects would use a ?-button whenever they are not quite sure which word had been presented, whereas others will choose a word instead and almost never use the?-button. If the conversion is to be adapted in an adaptive test procedure, the test itself has to do a random guess among the ten possible words and not to convert each “I do not know” answer into a 1/10 correct answer.

Subjects A listening test was performed with 24 normal-hearing subjects (12 males and 12 females, aged 21–39 years with an average age of 26 years). Even though all the subjects considered themselves to be normal-hearing, their hearing thresholds were measured at the frequencies 0.5, 1, 2 and 4 kHz. One of the subjects had a threshold at 15 dB hearing level (HL) at 4 kHz, whereas the remaining subjects had thresholds at maximum 10 dB HL. All the subjects were accepted for the test and they participated voluntarily without getting paid.

Test course The 24 normal-hearing subjects were divided into three listening groups of eight persons each (balanced regarding age), which were presented with the traditional test, and the two versions of the useroperated test with and without the ?-buttons, respectively. The subjects in each listening group are assumed to have a similar ability to understand speech in noise. In the traditional test the subject’s task was to orally repeat the perceived words. The operator then registered the number of correctly repeated words. In the user-operated tests the subject had for each presented word to select a response from ten alternative words listed in a matrix on a touch screen. The response alternatives were the same for each sentence presented. They were (for each word class) listed alphabetically except for the numerals which were listed in numerical order, see Table 1. In the test with the ?-button, the subjects were told to use this possibility, when they had not heard the word. No further guidance regarding the ?-button was given. The subjects in each listening group performed both a training and a test session. The training session contained four test lists (i.e. 200 words) as recommended by Wagener et al (2003) and was carried out according to the adaptive procedure described in Brand & Kollmeier (2002). The purpose of the training session was to make

the test subjects familiar with the test and the test situation. Data from the training session was not included in the results. In order to investigate whether it is reasonable to use a logistic function to describe the speech intelligibility and to interpret the “I do not know” answers in different ways an adaptive procedure was not used in the test session. Instead eight test lists (i.e. 400 words) were presented at eight different SNRs ranging from 15 to 1 dB SNR with an increment of 2 dB, i.e. 50 words were presented at each level for each person. For each SNR the discrimination score, i.e. the percentage of correctly answered words (Boothroyd, 1968) was determined. The percentages were used to determine the most likely discrimination function and thereby the value of SRT and s50. The SNRs were the same for all three listening groups and were set from the reference data for the traditional test (Wagener et al, 2003) to cover the entire discrimination function. The noise level was held constant at 65 dBC and was present in between sentences as well. The C weighting was chosen according to the international standard, ISO 8253-3:2012. The different SNR were realized by adjusting the speech level. The presentation order of the lists at the different SNR was chosen randomly.

Equipment A specially designed measurement program was developed in MATLAB 6.5 according to the test method. During the listening test a laptop with a touch screen (Acer model TravelMate C300XCi) was used. The touch screen was activated by using a special pen. The test sentences and the noise signal were presented with the subjects by a loudspeaker (Vifa P13WH00-08 in a 6.6 litres vented cabinet), which was connected to the laptop through a power amplifier (Bruel & Kjaer, type 2706). The subjects were seated 1.2 meter in front of the loudspeaker. Control measurements of the test system were made by means of a microphone (Bruel & Kjaer, type 4165), power supply (Bruel & Kjaer, type 2804), and a measuring amplifier (Bruel & Kjaer, type 2636). By the control measurements the amplification characteristic and the frequency response were verified and found to be linear. The sound field around the listening point were found to be quasi-free according ISO 8253-2:2009. The level of the background noise was lower than 45 dB(C).

Statistical analyses For the statistical analyses the computer program SPSS 11.5.1 for Windows was used (www.spss.co.in). Mainly parametric statistical models were used. All analyses were performed two-sided at a 0.05 significance level. The Shapiro-Wilk test was used to ascertain whether different samples could be assumed to come from a normal distribution. To investigate whether it is reasonable to use a logistic function to describe the speech intelligibility as a function of SNR (research question 1)), the Pearson’s product-moment coefficient between the subject’s real response and the response determined from the logistic function in Equation 3 at the eight SNRs was calculated for each of the 24 subjects. The response determined from the logistic function was found at each SNR by setting the value of SRT and s50 estimated for each subject into Equation 3. For the subjects, who were presented with the traditional test, the number n was set to infinity – i.e. Equation 3 then becomes Equation 1. Data from this study shows that the subjects at low SNRs (where the subjects are expected not to be able to hear the words) have used the “I do not

340    E. R. Pedersen & P. M. Juhl know” option instead of guessing. Thus the number n was also set to infinity for the subjects, who were presented with the useroperated test with the ?-buttons, where each “I do not know” answer was handled as an incorrect answer. In order to test for possible differences between both the mean value of SRT and s50 for the three listening groups including the two ways to interpret “I do not know” (research question 2) and 3)), the one-way ANOVA test and the t-test was used. Before the ANOVA tests were executed, the Levene’s test was used to assess the equality of variances in the different samples. To test whether the mean values of SRT and s50 differ from the reference values found by Wagener et al (2003) the one sample t-test was used. The Shapiro-Wilk test showed that values of s50 for the user-operated test with the ?-button, where “I do not know” answers were interpreted as 1/10 correct answers cannot be assumed to come from a normal distribution. Hence, the non-parametric Kruskal Wallis test was used to analyse results involving these data.

Results In order to analyse and present the results, the three listening groups will be handled as four result groups in what follows:

•  Group 1  the traditional test, • Group 2  the user-operated test with ?-button, where “I do not know” answers are handled as incorrect answers, • Group 3  the user-operated test with ?-button, where “I do not know” answers are handled as 1/10 correct answers, and •  Group 4  the user-operated test without ?-button.

Hence, result groups 1 and 4 are simply the two corresponding listening groups, whereas result groups 2 and 3 refer to the same listening group – the difference is in the post experiment handling of the “I do not know” answers.

Logistic regression The Pearson’s product-moment coefficient between the subject’s real response and the response determined from the logistic function was found comparable for all subjects across the four result groups. For all subjects the correlation coefficient was above 0.98 with p-values less than 0.001. Due to the high correlation it is reasonable to use the logistic function in Equation 3 to describe the speech intelligibility as a function of SNR for all four result groups. Figure 1 shows the ratio (in percent) of correct answers at the eight SNRs and a plot of the most likely discrimination function for eight selected subjects. The most likely discrimination function was found by inserting the value of SRT and s50 into Equation 3 for each subject. Figure 1, a and b, are for two subjects from group 1; Figure 1, c and d, are for two subjects from group 2; Figure 1, e and f, are for two subjects from group 3; and Figure 1, g and h, are for two subjects from group 4. The curves in Figure 1 correspond to the subject with the lowest (Figure 1, a, c, e, and g) and the highest (Figure 1, b, d, f, and h) correlation coefficient for each of the four result groups, i.e. the worst and the best fit. For the subjects with the highest correlation coefficient the curve for the most likely discrimination functions almost coincide with the mean values of correct answers.

of SRT and s50, respectively, based on the subject’s individual values. If the mean value of SRT and s50 were found by adjusting the logistic function directly to the overall average of the correct answers across subjects, the average value of s50 would be lower than the ones in the table due to the variations in SRT between subjects. In order to test for differences between the mean SRT obtained in group 1, 2, and 4 the ANOVA test has been performed. The ANOVA test shows no significant difference in the means (F(2,21)  3.183, p  0.062). If group 2 is replaced by group 3 the ANOVA test still shows no significant difference (F(2,21)  3.353, p  0.055). In both cases a higher degree of the variance in the SRT values (assessed by the degree of explanation R2 and the adjusted R2) could be explained when focusing on two result groups only instead of including all three groups at once. The conclusions of equality of means are thus made upon the t-test including two result groups at a time. For comparing the mean values of SRT for group 2 and 3 the paired samples t-test was used, whereas the independent samples t-test was used for comparing the means of SRT for the other groups. The paired samples t-test was used for group 2 and 3 since the data for these two groups come from the same subjects. The results from the t-tests are shown in Table 3 from which it can be concluded that the mean value of SRT for group 2 and 3 are statistically insignificant different from the SRT mean value for group 1, whereas a statistically significant different SRT mean value is found for group 4 compared to the other result groups. It can also be concluded that there are no statistical differences between SRT for group 2 and 3, i.e. no difference between the two ways to interpret “I do not know” answers was found. One sample t-tests show that the mean value of SRT for group 1–3 agree with the reference value, SRT   8.4 dB SNR found by Wagener et al (2003) (t(7)  0.984, p  0.358; t(7)  0.501, p  0.631; t(7)  0.820, p  0.439, respectively). The mean value of SRT for group 4 disagree with the reference value (t(7)  2.422, p  0.046). The mean value of SRT for group 4 is lower than the reference value. From the mean values of SRT and s50 in Table 2 an average discrimination function for each of the four result groups are determined by Equation 3 and shown in Figure 2. Figure 2, a, shows the discrimination function for group 1 and 2. The two curves almost coincide. For the curves in Figure 2, a, the number n was set to infinity. Figure 2, b, shows a theoretically predicted discrimination function for a 10AFC test which is based on using the SRT and s50 estimated for group 1. The figure also shows the discrimination function for group 3 and 4. For the curves in Figure 2, b, the number n was set to ten. The difference between the curve for group 1 and the theoretically predicted curve illustrates correctly responded words due to pure guesses. From Figure 2, b, it is seen that the curve for group 3 almost coincides with the theoretically predicted curve. It is also seen that the speech intelligibility at a given SNR is consistently higher for the group 4 than for the theoretically predicted curve, in particular for speech intelligibility in the range of 20% to 80%. The difference between these two curves describes the amount of correctly responded words due to motivated guesses.

The homogeneity of the speech material The influence of the ?-buttons The values of SRT and s50 for the 24 subjects in the listening test appear from Table 2. The mean value of SRT and s50 within each of the four result groups were found by computing the average values

The homogeneity of the speech material can be described by the slope of the discrimination function as mentioned in the introduction. To investigate whether the homogeneity of the speech material is influenced, when the material is used in a user-operated test

User-operated speech in noise test    341

Speech intelligibility [%]

Speech intelligibility [%]

Speech intelligibility [%]

Speech intelligibility [%]

100

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

50

0 100

50

0 100

50

0 100

50

0

−15 −13 −11

−9

−7

−5

−3

−1

Presentation level [dB SNR]

−15 −13 −11

−9

−7

−5

−3

−1

Presentation level [dB SNR]

Figure 1.  One subfigure for each of eight subjects: a and b, are for two subjects from group 1; c and d, are for two subjects from group 2; e and f are for two subjects from group 3; and g and h are for two subjects from group 4. The subfigures to the left (a, c, e, and g) are for the subjects with the lowest correlation coefficient within the particular result group, whereas subfigures to the right (b, d, f, and h) are for the subjects with the highest correlation coefficient within the result group. Each subfigure shows the mean value (black dots) and one standard deviation (vertical bars) of percent correct answers for the eight different SNRs for one subject, as well as the most likely discrimination function determined by Equation 3. Each subject was presented with ten sentences of each five words at each of eight SNRs, i.e. in total each of the eight subfigures shows results from the presentation of 400 words. compared to using the material in a traditional test, the mean values of s50 in Table 2 are compared. The ANOVA test between the mean value of s50 for group 1, 2, and 4 revealed no significant differences (F(2,21)  1.398, p  0.269). Equivalently, the nonparametric Kruskal-Wallis test shows no statistical significant difference between the mean values of s50 for group 1, 3, and 4 (X2(2)  1.865, p  0.394). One sample t-tests show that the mean value of s50 found for group 1 and 4 agree with the reference value, s50  13.2 %/dB found by Wagener et  al (2003) (t(7)  1.495, p  0.179; t(7)   0.069, p  0.947, respectively). The mean value of s50 found for group 2 is statistically different from (higher than) the reference value, as the mean value of s50 found for group 2 disagree with the reference value (t(7)  3.191, p  0.015). Since the values of s50 for group 3 cannot be assumed to come from a normal

distribution, a one sample t-tests could not be performed for the mean of these data.

Discussion In this study two versions of a user-operated speech-in-noise test based on the Danish speech material Dantale II were implemented and evaluated. The user-operated test in which the subject had to select a response from ten alternative words listed in a matrix for each presented word, was implemented with and without the possibility that for each word presented the subject could answer “I do not know”. A press on the ?-button was handled in two ways, either as in correct answer or converted into a 1/10 correct answer corresponding to a random pure guess among the ten possible words. There are many deliberations to be done when

342    E. R. Pedersen & P. M. Juhl Table 2.  Values of SRT and s50 for all 24 subjects in the listening test. Each subject was presented with ten sentences of each five words at each of eight SNRs, i.e. in total each subject was presented with 400 words. For the subjects which were presented with the version of the user-operated test with the ?-buttons the “I do not know” answers were handled as incorrect answers and as 1/10 correct answers (group 2 and 3, respectively). Mean value and one standard deviation (Std.) across eight subjects in each of the four result groups are shown to the right. Group 1 Subject SRT [dB SNR] s50 [%/dB]

1

2

3

 7.4 12.6

6.1 15.0

9.1 14.4

4 9.4 17.1

5

6

7

8

Mean

Std.

7.3 11.3

8.4 19.4

8.7 14.4

7.8 12.5

8.0 14.6

1.08 2.63

Group 2 Subject SRT [dB SNR] s50 [%/dB]

9

10

11

6.8 14.9

9.0 16.1

8.1 14.3

12 8.2 11.9

13

14

15

16

Mean

Std.

9.0 15.9

8.1 15.6

8.2 16.1

8.8 13.8

8.3 14.8

0.71 1.45

15

16

Mean

Std.

Group 3 Subject

9

SRT [dB SNR] s50 [%/dB]

10

11

12 8.2 10.9

13

6.8 13.9

9.0 14.8

7.7 15.2

8.9 15.8

17

18

19

20

21

8.0 11.7

9.1 15.6

9.2 9.0

10.2 12.3

8.0 13.2

14

8.2 16.1

8.6 14.7

8.2 14.5

0.71 1.62

22

23

24

Mean

Std.

9.4 12.9

9.9 14.1

8.8 16.4

9.1 13.1

0.80 2.32

8.1 15.1

Group 4 Subject SRT [dB SNR] s50 [%/dB]

implementing a user-operated matrix test. Some of them will be discussed in the following.

The number of alternative words In this study the number of alternative words for each presented word in the user-operated tests was set to ten, corresponding to the ten words within the different word classes in the speech material. The number of alternative words could be reduced in order to make the matrix from which the subjects have to select a response easier to manage for e.g. elderly people. If the number of alternative words is reduced the probability for responding the correct word by a pure guess increases, i.e. the vertical compression of the discrimination function increases (see Equation 3). Hence, the fewer alternative words, the more sentences have to be presented to get the same variance of the test. The number of correctly responded Table 3.  Results from t-tests with mean values of SRT for the four result groups. For comparing the mean values of SRT for group 2 and 3 the paired samples t-test was used. Otherwise the independent samples t-test was used. Significant results at a 0.05 level are marked with an asterisk (*). Group

1

2

3

4

1

– t(14)  0.548 p  0.592 t(14)  0.370 p  0.717 t(14)  2.228 p  0.043*

t(14)  0.370 p  0.717 t(7)  1.745, p  0.125 –

t(14)  2.228 p  0.043*

2

t(14)  0.548 p  0.592 –

3 4

t(7)  1.745 p  0.125 t(14)  2.143 p  0.050*

t(14)  2.351 p  0.034*

t(14)  2.143 p  0.050* t(14)  2.351 p  0.034* –

words will rise not only due to pure guesses but also due motivated guesses, because when a part of the word is heard it becomes easier to choose the correct word. If the number of alternative words is reduced, it has to be decided which words should be the alternative words and whether the alternative words should be the same at repeated presentations of the certain words.

The homogeneity of the speech material If the alternative words sound or are spelled like the presented word, it is harder to make a motivated guess and thereby choose the correct word than if the alternative words are not phonetically like the presented word. The difficulty of a presented word therefore depends on the alternative words. In this study it has been investigated whether the difficulty of the different words is influenced when the speech material is used in a user-operated test compared to using the material in a traditional test. No statistically difference between values of s50 was found and therefore the homogeneity of the speech material is not influenced, when the material is used in a user-operated test for which the speech material was not originally designed and optimized. The s50 for group 2 was found to disagree with the reference value from Wagener et  al (2003), which can be explained by the low standard deviation of s50 for this group. However, the reason for the low standard deviation of s50 is not known at the moment.

The “I do not know” option Beyond the number of alternative words, it should also be decided whether the possibility to answer “I do not know” , i.e. whether the ?-button should be available. From the statistical analyses of data in this study (Table 3) it is found that the mean values of SRT are lower in the user-operated test without the ?-button compared with

User-operated speech in noise test    343

100 Speech intelligibility [%]

(a)

(b)

50

0

−15 −13 −11 −9 −7 −5 −3 Presentation level [dB SNR]

−1

−15 −13 −11 −9 −7 −5 −3 Presentation level [dB SNR]

−1

Figure 2.  Average discrimination functions plotted from the data in Table 2 by setting the value of SRT and s50 into Equation 3. In a the dashed (--) curve is for group 1 and the dotted (∙∙) curve is for group 2. In b, the solid curve is the predicted discrimination function for a 10AFC test, theoretically determined from the data for the traditional test. The dotted (∙∙) curve is for group 3, and the dashed-dotted (-∙) curve is for group 4.

the two other tests (independent of the interpretation of “I do not know” answers in the user-operated test with ?-button). This finding is also illustrated in Figure 2. This lower SRT value means that when subjects have the possibility to press the ?-button, they make fewer motivated guesses, than when they do not have the ?-button available. The subjects in group 4 expressed that they were somewhat frustrated to be forced to choose an answer to a word, which had not been heard and that it had some influence on their engagement in the test. Hence, in order to make the subjects more engaged in the test including the ?-button seems favourable.

Interpretation of “I do not know” answers When including the ?-button it must be decided how to interpret “I do not know” answers. If the subjects are not using the “I do not know” option alike and the “I do not know” answers are not converted into 1/10 correct answers, it can be difficult to choose the right discrimination function to describe the speech intelligibly as a function of SNR. For subjects, who used the ?-button whenever they are not quite sure which word had been presented, the number n in Equation 3 should be set to infinity. However, in the extreme case, where a subject never uses the ?-button, the number n should be set to ten (corresponding to the user-operated test without ?-button). The standard deviation of SRT found in this study (Table 2) is the same in the user-operated test whether the “I do not know” answers are interpreted as incorrect answers or as 1/10 correct answers. This indicates that the subjects used the “I do not know” option alike, and therefore both methods of handling the “I don’t know” answers work well. However, the authors still have a preference towards handling “I don’t know” answers as guesses rather than as incorrect answers since it would equalize different use of the option to some degree.

Results in other studies In this study it is found that the mean value of SRT is lower in a user-operated test without the ?-button compared with the two other tests (independent of the interpretation of “I do not know” answers). This finding is in agreement with Brand et  al (2004)’s study on the Germen user-operated test with a ?-button,

where “I do not know” answers were interpreted as incorrect answers. However the finding is in disagreement with Ozimek et al (2010) and Hochmuth et  al (2012) involving the Polish and Spanish speech material, respectively. Ozimek et al (2010) found no statistical significant difference for the SRT value between the user-operated test without the “I do not know” option and a traditional test. The mean SRT value across the two tests was 8.0 dB SNR (standard deviation: 1.3 dB SNR) (Ozimek et  al, 2010). Hochmuth et al (2012) found a statistical significant difference for the SRT value between the user-operated test with the “I do not know” option and a traditional test. The mean SRT value for the user-operated test and for the traditional test were 27.2 dB SNR (standard deviation: 0.7 dB SNR) and 6.2 dB SNR (standard deviation: 0.8 dB SNR), respectively (Hochmuth et  al, 2012). Hochmuth et al (2012) suggests that the difference in SRT between a user-operated test and a traditional test can be reduced, if the test subjects are trained extensively before the actual measurements to make the subjects more familiar with the speech material; reasoning that the subjects will make more motivated guesses in both the user-operated test with the “I do not know” option and in the traditional test. However, the training then has to be so extensive, that the subjects in a traditional test learn the words in the material by heart. This seems unlikely to happen during a normal clinical test.

(Dis)advantages of a user-operated test A user-operated test implemented as the versions in this study requires that the subject can read and cope with the alternative words listed in the response matrix. The latter may be a problem especially for elderly people (Hochmuth et  al, 2012). Another disadvantage is that the test has a longer test time for the subject than a traditional test. The longer test time is due to the subjects having to find the word to answer among the alternatives rather than just saying the words. The test time per ten sentences was for the traditional test and the two versions of the user-operated test on average 1.9 minutes (standard deviation: 16.6 second) and 2.8 minutes (standard deviation: 31.6 seconds), respectively. However, the test time for the operator and the possible influence of the operator on the evaluation of the subject response are reduced in the

344    E. R. Pedersen & P. M. Juhl user-operated test versions compared with a traditional test. Hence, the operator does not need to be fluent in Danish. Due to the less central role of the operator in the test course a user-operated test can be designed for internet use. Deliberations about its implementation should be done, e.g. on how to ensure that the test subjects had understood their task, on how to allow for the learning effect (which will affect the test duration) and different technical equipments. Speech in noise screening tests on the internet have previously been implemented using digit triplets and monosyllables (Leensen et  al, 2011a,b), but to the knowledge of the authors an internet test based on sentences like the Dantale II speech material does not exist.

Conclusions Analysing the results from the listening test gave the following answers to the three research questions in the introduction: (1) The speech intelligibility as a function of the SNR in all three listening groups including the two ways to handle “I do not know” answers is well described by a logistic function. (2) The mean value of SRT from the user-operated test with the ?-button statistically equals results found in a traditional test, whereas a lower SRT mean value is found for the user-operated test without the ?-button than in the two other tests (independent of the interpretation of “I do not know” answers). Hence, when the subjects have the possibility to press the ?-button, they make fewer motivated guesses than when they don’t have the ?-button available. No statistical difference between the handling of “I do not know” answers as incorrect answers or as 1/10 correct answers was found. (3) The homogeneity of the speech material is not influenced when the material is used in a user-operated test compared to when the speech material is used in a traditional test.

Note 1. The information that the “I do not know” option was included in the Spanish test is obtained from an anonymous reviewer of this paper.­­­

Acknowledgements The authors thank Carsten Daugaard and Søren L. Jørgensen from DELTA Technical Audiological Laboratory, Odense, Denmark for providing some of the equipment and for valuable discussions. The authors also thank the subjects, who participated in the tests and the reviewers for valuable comments. Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

References Boothroyd A. 1968. Statistical theory of the speech discrimination score. J Acoust Soc Am, 43, 362–367. Brand T.C. & Kollmeier B. 2002. Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests. J Acoust Soc Am, 111, 2801–2810. Brand T., Wittkop T., Wagener K.C. & Kollmeier B. 2004. Comparison of the Oldenburg sentence test and the Freiburg word test as closed versions (in German). Proceedings of the 7th annual meeting of the Deutsche Gesellschaft für Audiologie (DGA), Leipzig. Elberling C., Ludvigsen C. & Lyregaard P.E. 1989. Dantale: A new Danish speech material. Scand Audiol, 18, 169–175. Green D.M., Richards V.M. & Forrest T.G. 1989. Stimulus step size and heterogeneous stimulus conditions in adaptive psychophysics. J Acoust Soc Am, 86, 629–636. Hagerman B. 1982. Sentences for testing speech intelligibility in noise. Scand Audiol, 11, 79–87. Hochmuth S., Brand T., Zokoll M.A., Castro F.Z., Wardenga N. et al. 2012. A Spanish matrix sentence test for assessing speech reception thresholds in noise. Int J Audiol, 51, 536–544. ISO 8253-2:2009, Acoustics - Audiometric test methods - Part 2: Sound field audiometry with pure-tone and narrow-band test signals. Switzerland: International Organization for Standardization. ISO 8253-3:2012, Acoustics - Audiometric test methods – Part 3: Speech audiometry, Switzerland: International Organization for Standardization. Jansen S., Luts H., Wagener K.C., Kollmeier B., Del Rio M. et  al 2012. Comparison of three types of French speech-in-noise tests: A multi-center study. Int J Audiol, 51, 164–173. Leensen M.C., de Laat J.A. & Dreschler W.A. 2011a. Speech-in-noise screening tests by internet, part 1: Test evaluation for noise-induced hearing loss identification. Int J Audiol, 50, 823–834. Leensen M.C., de Laat J.A., Snik A.F. & Dreschler W.A. 2011b Speechin-noise screening tests by internet, part 2: Improving test sensitivity for noise-induced hearing loss. Int J Audiol, 50, 835–848. Nielsen J.B. & Dau T. 2011. The Danish Hearing-in-noise test. Int J Audiol, 50, 202–208. Newey W.K. & McFadden D. 1994. Chapter 36: Large sample estimation and hypothesis testing. In: R. Engle and D. McFadden. Handbook of Econometrics, Vol. 4. Amsterdam: Elsevier Science, pp. 2111–2245. Ozimek E., Warzybok A. & Kutzner D. 2010. Polish sentence matrix test for speech intelligibility measurement in noise. Int J Audiol, 49, 444–454. Wagener K., Brand T. & Kollmeier B. 1999a. Development and evaluation of a German sentence test, part II: Optimization of the Oldenburg sentence test (in German). Zeitschrift für Audiologie, 38, 44–56. Wagener K., Brand T. & Kollmeier B. 1999b. Development and evaluation of a German sentence test, part III: Evaluation of the Oldenburg sentence test (in German). Zeitschrift für Audiologie, 38, 86–95. Wagener K.C., Josvassen J.L. & Ardenkjaer R. 2003. Design, optimization and evaluation of a Danish sentence test in noise. Int J Audiol, 42, 10–17. Wagener K., Kühnel V. & Kollmeier B. 1999c. Development and evaluation of a German sentence test, part I: Design of the Oldenburg sentence test (in German). Zeitschrift für Audiologie, 38, 4–15.

Copyright of International Journal of Audiology is the property of Taylor & Francis Ltd and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

User-operated speech in noise test: implementation and comparison with a traditional test.

The purpose of this study was to implement and evaluate a user-operated speech in noise test...
286KB Sizes 0 Downloads 0 Views