International Journal of Audiology 2013; 52: 861–864

Technical Report

Automated auditory response detection: Improvement of the statistical test strategy

Int J Audiol Downloaded from informahealthcare.com by Central Michigan University on 12/01/14 For personal use only.

Ekkehard Stürzebecher∗ & Mario Cebulla† ∗WDH Denmark, Petershagen, Germany, †Comprehensive Hearing Center (CHC), Department of Otorhinolaryngology, Plastic, Aesthetic and Reconstructive Head and Neck Surgery, Julius Maximilian-University Hospitals, Würzburg, Germany

Abstract Objective: Automated auditory response detection is always performed by applying an appropriate statistical test to a sample of stimulus-related epochs of the raw EEG. The oftenused sequential test strategy saves time, but the multiple testing increases the probability of falsely detected responses. Therefore, the critical test value must guarantee the specified error probability for the maximum test step number. However, response detection at all lower test step numbers is disadvantaged. We propose calculating the critical test values for each test step number, which correspond exactly to the given error probability. Design: The critical test values for each test step were calculated by the method described by Stürzebecher et al (2005). A table with the test values was implemented with customized software of the Eclipse ASSR system® (Interacoustics, Denmark). Study sample: Table-related testing was performed on a sample of raw EEG data collected during the routine clinical measurement of frequency-specific auditory steady-state responses (ASSR) for hearing threshold assessment. Results: The new test strategy leads to a significantly increased detection rate and a significantly shorter detection time. Conclusions: The new test strategy can improve the performance of the objective hearing threshold assessment and of the newborn hearing screening.

Key Words: ABR; automated auditory response detection; repeated testing; error probability; critical test value

Automated auditory response detection is always performed by applying an appropriate statistical test to a sample of stimulusrelated epochs of the raw electroencephalogram (EEG). Statistical testing can be performed in two ways. One is to apply the test only once to a sample of a predetermined size. The test is applied if the complete sample of EEG epochs is collected. This procedure has the following disadvantage: using a fixed, small number of epochs, small responses may not be detected. If a large sample size is chosen, time will be wasted if there are responses with a larger amplitude that could have been detected using a considerably smaller sample size. The other testing approach is to use a sequential test strategy. The statistical test is applied as soon as a predefined minimum number of stimulus-related epochs are available. If no response is detected by this first test step, the sample is extended by a given number of epochs (e.g. by one epoch) and the test is repeated. This procedure is repeated until a response is detected or a predefined maximum number of epochs has been reached (a predefined maximum examination time has therefore expired). This stepwise test procedure saves time compared to the use of a fixed sample size that is tested only once. However, there is a distinct disadvantage to a sequential test strategy: multiple testing increases the probability of a false rejection of the null hypothesis (of a false pass in a hearing screening). The probability

of a falsely detected response increases with each additional test step. Therefore, in the case of multiple testing, a correction (reduction) of the given significance level p, for instance, by Bonferroni’s correction rule (Hochberg & Tamhane, 1987) is necessary. With the corrected significance level p’, the required probability p for a false rejection of the null hypothesis is guaranteed for the given maximum number of test steps. However, as a consequence of the increased critical test value, the probability of a false rejection of the alternative hypothesis inevitably increases. Therefore, an existing response might no longer be detected. The above-mentioned Bonferroni correction only applies in the case of multiple tests of independent data. In the case of sequential tests of EEG epochs, the statistical test is applied to dependent data. With dependent data, Bonferroni’s correction is too conservative (Hochberg & Tamhane, 1987), i.e. a lower level of p’ than necessary is chosen to fulfill the given significance level p for the maximum number of test steps. As a consequence, the detection rate is reduced, and the detection time is prolonged beyond the extent required. Stürzebecher et al (2005) described a simulation procedure for assessing the critical test value necessary for sequential testing of a sample of dependent data. The proposed method avoids the described disadvantage of Bonferroni’s correction because the calculated critical test value ensures a correct significance level of p for the given maximum

Correspondence: Mario Cebulla, Department of Experimental Audiology & Electrophysiology, University Clinic Würzburg, Josef-Schneider-Str. 2, D-97080 Würzburg, Germany. E-mail: [email protected] (Received 18 April 2013; accepted 1 July 2013) ISSN 1499-2027 print/ISSN 1708-8186 online © 2013 British Society of Audiology, International Society of Audiology, and Nordic Audiological Society DOI: 10.3109/14992027.2013.822995

862

E. Stürzebecher & M. Cebulla

Abbreviations

Int J Audiol Downloaded from informahealthcare.com by Central Michigan University on 12/01/14 For personal use only.

ABR ASSR EEG OAE

Auditory brainstem responses Auditory steady-state responses Electroencephalogram Otoacoustic emissions

number of test steps. Compared to the use of Bonferroni’s correction, this produces an increased detection rate and a shorter mean detection time. However, this critical test value ensures the specified error probability only for the maximum number of test steps. The detection of all responses that could be accomplished with a number of test steps smaller than the maximum number is further disadvantaged because, with this critical test value, the error probability for all numbers of test steps below the maximum is lower than p. Therefore, a new idea, which the present report is based on, is to calculate the critical test value for each test step number that corresponds exactly to the given error probability and to implement a table of these values in the detection algorithm. Because, with every test step, the probability of a falsely detected response increases in comparison to the preceding test step, the critical test value must also increase with every step. Using a table of critical test values adapted to each test step is afterwards referred to as ‘table-related testing’. The aim of the study was to examine whether such table-related testing can enhance the test performance. The outcomes of the investigation of the table-related testing are presented in this report and will show that response detection is significantly improved by this type of statistical testing.

Methods The investigation of the table-related testing was performed on raw EEG data. The data were collected using customized software which allows the saving of the raw EEG data by the Eclipse ASSR system® (Interacoustics, Denmark) during the routine measurement of frequency-specific auditory steady-state responses (ASSR) for hearing threshold assessment. The stimuli were narrow-band chirps. The data were stored on a hard disk together with the indication of start and stop of the stimuli. No further information, e.g. the stimulus level, could be stored. ASSR at four frequencies (500, 1000, 2000, and 4000 Hz) were recorded simultaneously from both sides. Because the stimulus repetition rate differed slightly with the four frequencies, the response harmonics were located in the spectrum at different frequencies. The eight responses could, thereby, be assigned to the four frequencies easily. The minimum step width for the stimulus level was 10 dB. Forty-nine patients were included in the investigation. The patients’ ages ranged from 19 days to 6.25 years (mean of 9.8 months). The hearing loss of the patients ranged from mild to severe. Cochlear implant candidates with no response at all four frequencies were excluded. The maximum test time for the Eclipse ASSR system® is 360 seconds. With an epoch length of two seconds, the number of epochs within the maximum test time is 180. To reduce the maximum number of test steps, the test algorithm uses a test step width of 2. This means that the test is carried out again after every two further epochs are added to the sample. The testing starts with a minimum sample size of 20 epochs. This results in a maximum of 81 test steps. The stimulation at a given stimulus level was stopped as soon as a response was detected. The detection time was calculated from the number of epochs that are necessary to detect a response. In those

cases where a response was detected with table-related testing, but not when using the method with a fixed critical test value, a test duration of 360 seconds was considered with the calculation of the mean detection time for the latter. The response detection was performed in the frequency domain by a modified version (Cebulla et al, 2006) of Mardia’s q-sample uniform scores test (Mardia, 1972), which used 12 harmonics to detect a response. The statistical test was applied twice to the raw data: once with the critical test value adjusted to the maximum number of test steps and again with table-related testing. The given error probability for the false detection of a response was 5%. The verification that the specified error probability of 5% was actually observed was carried out as follows: The detection algorithm was applied to the spectral noise components adjacent to the response harmonics. The percentage of falsely detected responses was determined. It reflects the actual error probability. For this measure, all available recordings were used. Moreover, testing of the noise components was applied only to those recordings that had a record length of 360 seconds. The differences between the detection rates resulting from the two different test applications were analysed for significance using McNemar’s test (Siegel, 1956), whereas the detection-time differences were tested for significance using the paired Wilcoxon signed-rank test (Siegel, 1956). The difference between the percentages of falsely detected responses was tested for significance with the two-sample proportions test (Newcombe, 1998).

Results The lower portion of Figure 1 shows the graphic representation of the simulated critical test values for each of the test steps 1 to 81. These form the basis for the table-related testing. In the lower teststep range, the critical test value ascends steeply, but it is asymptotic in the range of the maximum number of test steps. The straight line in the upper portion represents the critical test value adjusted to the maximum number of test steps. Until now, this critical value is used for all test steps. For the test steps 1 to 80, the critical test values for table-related testing are smaller than the critical test value adjusted to the maximum number of test steps. As expected, both have the same value for test step 81. Table 1 presents the detection rates and times for the four audiometric frequencies and the two types of testing. For all four frequencies, table-related testing led to a higher detection rate and a shorter detection time when compared to the use of a critical test value adjusted to the maximum number of test steps. All differences are highly significant (p ⬍ 0.001). The detection of the 500 Hz responses had the greatest gain with table-related testing. While the detection rate grows with the remaining frequencies by approximately 9 to 11%, the gain with 500 Hz amounts to approximately 15%. The detection time shows the same trend. While the decrease in the mean detection time with the other frequencies ranges between 51 and 56 seconds, it was 82 seconds with 500 Hz. When using table-related testing, the detection of the 500 Hz responses takes approximately the same time as the response detection with the remaining frequencies. Table 2 shows the results of the verification of the specified error probability. With the critical test value adjusted to the maximum number of test steps the percentage of falsely detected responses is 2.6% if all record lengths are included. This is clearly lower than the specified 5%. The difference between 2.6% and 4.9% is significant (p ⬍ 0.001). The percentage of falsely detected responses corresponds to the given error probability only if the test lasts the

Automated ABR detection

863

24 23

Critical test value

22 21 20

Int J Audiol Downloaded from informahealthcare.com by Central Michigan University on 12/01/14 For personal use only.

19 Fixed crit.value Table related

18 17

0

20

40

60

80

100

120

140

160

180

200

Number of epochs

Figure 1. Graphic representation of the simulated critical test values. The straight line in the upper portion represents the critical test value adjusted to the maximum number of test steps. The lower portion shows the graphic representation of the simulated critical test values for each of the test steps 1 to 81 which are used for table-related testing.

complete 360 seconds. By contrast, the error probability is about 5% for all record lengths if table-related testing is used.

Discussion Repeated testing is applied in devices using ASSR for objective hearing threshold assessment, and is also used for newborn hearing screening, where a statistical detection of auditory brainstem responses (ABR) or otoacoustic emissions (OAE) is carried out. In all cases, the user must count on the manufacturer to make sure that the specified error probability for a falsely detected response is guaranteed up to the maximum test duration. The user must also be aware that they may be responsible for an unwanted increase in the error probability. For example, repeating an OAE-screening test several times on a newborn who has failed the first test constitutes multiple testing, and every repetition raises the probability of the false detection of a response. The results of the calculations to verify the specified error probability given in Table 2 show that using a constant critical test value adjusted to the maximum test step number is unfavorable. Table 1. Detection rate and detection time which arise with the two test conditions. Detection rate

500 Hz, N ⫽ 190 1000 Hz, N ⫽ 192 2000 Hz, N ⫽ 181 4000 Hz, N ⫽ 184

Critical test value adjusted to the maximum number of test steps 53.3% 65.2% 66.4% 71.4%

Mean detection time

Tablerelated testing

Critical test value adjusted to the maximum number of test steps

Tablerelated testing

68.2% 74.3% 75.1% 82.4%

178 s 143 s 149 s 141 s

96 s 92 s 93 s 88 s

It worsens the response detection at all test step numbers smaller than the maximum, because the critical test value is larger than necessary for the smaller test step numbers. Thus, with these test steps, the error probability is lower than the specified error probability. Therefore, response detection takes a greater amount of time, and small responses cannot be detected within the given maximum test time. Using a table of critical test values adjusted to each test step maintains the specified error probability at all test steps. As the results presented in Table 1 show, table-related testing leads to a higher detection rate and shorter detection times when compared to the use of a critical test value adjusted to the maximum number of test steps. However, even the improved detection rate seems rather low. This can be explained by the fact that with a threshold measurement, the stimulus level is always reduced to a value at which no response can be detected. Moreover, as can occur with children with unknown hearing loss, the starting stimulus level may have been too low, and, therefore, no response was detectable. For this reason, among the three to four recordings with different intensities at each frequency, there was at least one where no response was detectable by the algorithm present in the device. The customized software of the Eclipse ASSR system® that was used only allows the storage of the raw EEG data. There is no information about the stimulus levels used or the patient’s behavioral threshold. That is not a disadvantage for the present study because the study was not designed to examine the improvement of the objective hearing threshold assessment by the Eclipse ASSR system®.

Table 2. Percentage of falsely detected responses. Critical test value adjusted to the maximum number of test steps All record lengths

Record length 360 s

Table-related testing: all record lengths

2.6% (N ⫽ 2056)

4.9% (N ⫽ 288)

5.3% (N ⫽ 2056)

Int J Audiol Downloaded from informahealthcare.com by Central Michigan University on 12/01/14 For personal use only.

864

E. Stürzebecher & M. Cebulla

The aim of the study was to examine if table-related testing can improve the statistical detection of ASSR. The results show that table-related testing can, indeed, improve objective ASSR detection. But, in addition, the following discussion of the consequences of the results for the objective threshold assessment seems to be reasonable. The threshold assessment was performed in steps of 10 dB until no response could be detected. The 10 dB higher level was considered as the objective threshold. Therefore, one can assume that in most of the cases, where in contrast to the test procedure with a fixed critical test value a response could be detected by table-related testing, the objective threshold is lowered by 10 dB. Due to the reduced synchronization in the apical part of the cochlea, the amplitude of the 500 Hz response is smaller than that of the responses to the higher stimulus frequencies (Picton et al, 2003). This is also true with the used narrowband chirp stimuli which improve the synchronization especially in the low frequency range. The lower response amplitude at 500 Hz makes response detection more difficult. Therefore, there is a larger gap between the objective hearing threshold and the behavioral threshold when compared to the other frequencies (Herdman & Stapells, 2003). As the present results show, table-related testing has increased the detection rate with 500 Hz more than with the other frequencies. Therefore, the difference between the 500 Hz detection rate and the detection rates with the higher frequencies is smaller with table-related testing. Due to this fact, the disadvantage existing with 500 Hz is likely to be reduced by table-related testing. The detection rate with the other three frequencies was already relatively high with the critical test value adjusted to the maximum number of test steps. Therefore, only a smaller improvement was possible. In addition to the increase in detection rate, table-related testing leads to a decrease in response detection time. The mean time for detecting the single responses is significantly reduced at all four frequencies by table-related testing. Consequently, the use of tablerelated testing will significantly reduce the time required for measuring the complete objective threshold. A threshold estimation (four frequencies, both ears) by the Eclipse ASSR system® currently takes about 30 minutes. It is to be expected that threshold assessment with table-related testing will take less than 30 minutes. The presented results were gained from calculations on frequency-specific ASSR measurements with hearing-impaired subjects. Therefore, the improvements that result from the new test strategy can be claimed to be valid only for these specific conditions. But it

is very likely that this technique will also work for other responses and subject groups. This was demonstrated by using table-related testing for retesting a large ABR data base of approximately 1900 sets of stored raw EEG data from newborn hearing screening that was acquired with the screening device MB11 BERAphone® (MAICO, Germany). The results (not published) confirmed those obtained with the ASSR data. The time necessary to perform a screening test, which is already short with the algorithm present in the device, is significantly shortened, and the specificity is significantly improved.

Acknowledgements The authors thank Christiane Walk and Ralph Keim for assisting with the data collection, Department of Audiology, University Clinic Würzburg. Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

References Cebulla M., Stürzebecher E. & Elberling C. 2006. Objective detection of auditory steady-state responses: Comparison of one-sample and q-sample tests. J Am Acad Audiol, 17, 93–103. Herdman A.T. & Stapells D.K. 2003. Auditory steady-state response thresholds of adults with sensorineural hearing impairments. Int J Audiol, 42, 237–248. Hochberg Y. & Tamhane A.C. 1987. Multiple Comparison Procedures: Wiley. Mardia K.V. 1972. Statistics of Directional Data. London, New York: Academic Press. Newcombe R.G. 1998. Interval estimation for the difference between independent proportions: Comparison of eleven methods. Stat Med, 17, 873–890. Picton T.W., John M.S., Dimitrijevic A. & Purcell D. 2003. Human auditory steady-state responses. Int J Audiol, 42, 177–219. Siegel S. 1956. Non-parametric Statistics for the Behavioral Sciences. London: McGraw-Hill. Stürzebecher E., Cebulla M. & Elberling C. 2005. Automated auditory response detection: Statistical problems with repeated testing. Int J Audiol, 44, 110–117.

Automated auditory response detection: Improvement of the statistical test strategy.

Automated auditory response detection is always performed by applying an appropriate statistical test to a sample of stimulus-related epochs of the ra...
656KB Sizes 0 Downloads 0 Views