JOURNAL OF NEUROTRAUMA 32:1281–1286 (August 15, 2015) ª Mary Ann Liebert, Inc. DOI: 10.1089/neu.2014.3623

Short Communication

Procedures for the Comparative Testing of Noninvasive Neuroassessment Devices Paul E. Rapp,1 David O. Keyser,1 and Adele M. K. Gilpin 2

Abstract

A sequential process for comparison testing of noninvasive neuroassessment devices is presented. Comparison testing of devices in a clinical population should be preceded by computational research and reliability testing with healthy populations, as opposed to proceeding immediately to testing with clinical participants. A five-step process is outlined as follows: 1. 2. 3. 4. 5.

Complete a preliminary literature review identifying candidate measures. Conduct systematic simulation studies to determine the computational properties and data requirements of candidate measures. Establish the test–retest reliability of each measure in a healthy comparison population and the clinical population of interest. Investigate the clinical validity of reliable measures in appropriately defined clinical populations. Complete device usability assessment (weight, simplicity of use, cost effectiveness, ruggedness) only for devices and measures that are promising after steps 1 through 4 are completed. Usability may be considered throughout the device evaluation process but such considerations are subordinate to the higher priorities addressed in steps 1 through 4.

Key words: assessment; psychophysiology; test–retest reliability; validity qEEG

Introduction

T

his communication recommends sequential procedures for comparison testing of biomedical devices used in the noninvasive assessment of neuropsychiatric disorders. Quantitative electroencephalographic (qEEG) assessment of traumatic brain injury is used as a specific example but the issues addressed are commonly encountered in biomedical device comparative test and evaluation studies and generalize to other device categories, such as devices for assessing heart rate variability, event-related potentials, and eye tracking, and to other neuropsychiatric disorders, such as major depressive disorder, anxiety disorders, schizophrenia, and post-traumatic stress disorder. When evaluating biomedical devices, it is useful to make a distinction among ‘‘devices,’’ ‘‘signals’’ and ‘‘measures.’’ As the term is used here, a ‘‘device’’ is a hardware system that acquires signals. An example of a ‘‘signal’’ is a digitized multichannel EEG signal. Other examples include reaction times, electrocardiograms (ECGs), and eye movement trajectories. A ‘‘measure’’ is a quantity computed from a signal. In the case of EEGs, commonly encountered measures include band-specific power spectral densities, between-channel coherence and single-channel algorithmic complexity.1–3 Generally, clinical decisions are based on measures and not on signals, although it is recognized that in some instances evaluations are based on qualitative visual inspections of EEGs and ECGs. In the case of

traumatic brain injury, particularly mild traumatic brain injury, a visually examined EEG is usually uninformative. This has motivated the search for quantitative measures calculated from EEG signals that provide a clinically useful assessment when combined in a multivariate statistical analysis. This search for clinically useful quantitative measures is the focus of this communication. The question, ‘‘What device should be used?’’ cannot be answered without first answering the question, ‘‘What measure(s) should be obtained?’’ For a given device category and clinical population the operational question is, ‘‘What device should be used?’’ This may be addressed by constructing a decision matrix (Table 1). Implementing the decision process requires: 1) identifying pertinent criteria; 2) establishing a scoring system for each criterion; and 3) constructing a scoring algorithm to produce an aggregate score that includes a weighting factor for each criterion. The aggregate score is not a simple additive score. Some criteria are dichotomous in nature and would be scored Yes/No or Present/Absent. Additionally, some criteria are so critical that an entry of ‘‘No’’ or ‘‘Absent’’ results in deselection (i.e., resulting in an aggregate score of zero). This must be incorporated into the scoring algorithm. As discussed below, each measure from each device signal must be considered separately and its scores incorporated into the decision matrix. The numerical values quantifying reliability validity and usability that are introduced in the subsequent sections of this paper can be entered into the decision matrix.

1

Department of Military and Emergency Medicine, Uniformed Services University of the Health Sciences, Bethesda, Maryland. Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland.

2

1281

1282

RAPP ET AL.

Step 1: Identifying Candidate Measures For any given signal category (EEG, ECG, etc.), an extraordinarily large number of candidate measures are available, and given the rate of mathematical research, that number increases daily. Examples drawn from the qEEG literature include spectral measures (frequency band–specific power spectral density, wavelet transforms, the S transform, the Gabor transform), measures of synchronization (coherence, phase synchronization, phase locking index, phase locking value, phase lag index), measures of functional connectivity (Pearson product moment correlation, Spearman rank order correlation, Kendall rank order correlation, mutual information), network analysis/graph theory (clustering coefficient, path length), measures of causal relationships (Granger causality, directed mutual information), entropies (approximate entropy, sample entropy, multiscale entropy), and symbolic dynamics (metric entropy, normalized Lempel-Ziv complexity, n-th order entropy). Detailed, simulation-based assessment of all existing measures is clearly impossible. Therefore, as a first step, a survey of prior clinical literature is indicated. This has been done in the case of qEEG and the assessment of traumatic brain injury as described by Rapp and colleagues.4 Even when the studies themselves will not withstand statistical scrutiny, the reports may identify promising measures for further evaluation. However, given the size and diversity of the literature and the extraordinarily large number of candidate qEEG measures, an all inclusive ‘‘fishing trip’’ is not a feasible search procedure. The examination of the literature should be directed by physiological hypotheses concerning the pathophysiological etiology of the specific neuropsychiatric disorder under consideration. For example, it is reasonable to hypothesize that traumatic brain injury will cause nerve damage, leading to altered functional connectivity and network geometry. A targeted review of the literature indicated that this is indeed the case4 and therefore measures of functional connectivity would be high priority measures in the evaluation of traumatic brain injury. Similarly, it is hypothesized that stimulus response coupling and other cognitive processes fundamental to effective consciousness are implemented physiologically by transient brain cell assemblies that result from local and inter-regional synchronization of neural activity. This would suggest that aberrant EEG synchronization behavior would be observed in patients presenting psychotic disorders. Studies by Bressler,5 Rockstroh and colleagues,6 and Spencer and colleagues7,8 are representative examples drawn from a larger literature that confirms this anticipation.

Step 2: Determination of the Data Requirements and the Computational Properties of a Measure Must Precede an Investigation of Test–Retest Reliability The data requirements and the computational procedures for calculating a measure must be established before proceeding to an evaluation of its reliability in a healthy population. Inadequate data or inappropriate computational methods can result in the unnecessary failure to identify a measure that would be clinically valuable if computed correctly. Conversely, many theoretically attractive measures are of limited value when applied to real world data. Several factors must be considered when determining a measure’s data requirements: time spanned by the recorded signal (epoch length), sampling frequency, signal resolution (in the case of EEGs this is the number of bits in the digitizer), sensitivity to changes in experimental conditions, and robustness to noise. In its simplest form, the epoch length and sampling frequency question asks how many data points are needed for a meaningful calculation of the measure under consideration. For example, many measures of heart rate variability require at least 5 min of ECG data.9 Measures with acceptable reliability with 5 min of data can fail to satisfy reliability evaluations when computed using 1-min datasets. The question of optimal sampling frequency must reflect the intrinsic time-scale of the observed dynamical process. To a degree, the classical sampling theorems of signal processing theory can offer some guidance.10 Criteria based on the autocorrelation time and on the first minimum of the mutual information function also have been proposed.11,12 In the case of a digitized waveform, signal resolution is determined by the number of bits in the digitizer. In the case of interval data (i.e., heart interbeat interval sequences and neural spike trains), the resolution is determined by the clock speed of the recording device. The minimum acceptable values for digitizer resolution and clock speed can be estimated by simulation studies. An instructive computational example of the importance of amplitude resolution when computing a dynamical measure is given by Mees and colleagues.13 In this study, the dynamical measure was the number of significant singular values computed from an embedded time series. This measure is important because this number may provide an upper bound on the dimensional complexity of the dynamical system that generated the observed signal. The number of significant figures in the amplitude was reduced computationally following the sequence 18, 14, 10, and 6 significant figures. The corresponding number of significant singular values was 17, 15, 11, and 7. It is seen that very different inferences about the dynamical structure of the system under investigation would follow from different amplitude resolution.

Table 1. Device Selection Decision Matrix for Clinical Population X

1

Device A

Device B

.

Device K

Measures 1, 2,.k

Measures 1, 2,.k

Measures 1, 2,.k

Measures 1, 2,.k

2

.

k

1

2

.

k

1

2

.

k

1

2

.

k

Criterion 1 Criterion 2 .. . Criterion N Aggregate score Scores for each criterion are used to construct an aggregate score for each measure for each device. Criterion scores can be numeric or categorical (Yes/ No). The algorithm used to compute the aggregate score incorporates weighting factors for each criterion.

TESTING OF NEUROASSESSMENT DEVICES A clinically useful measure must be sensitive to clinically significant factors and comparatively robust to other factors. The complexity of the central nervous system, combined with the sensitivity of measures computed from EEG signals, presents significant challenge when searching for measures that meet this balance. For example, Polich and Herbst14 compiled extensive lists of factors altering the P300 component of event related potentials. They cited natural factors (e.g., time of day, food intake), induced factors (exercise, fatigue, drugs) and constitutional factors (age, intelligence, gender, personality). It is impossible to control rigorously for all of these factors in a clinical setting. Further, Polich and Herbst listed factors that are known and reasonably well characterized. We also must recognize that there are countless other factors influencing brain electrical behavior that are not yet discovered. These factors may change during recordings or between recording sessions and may mask disease-dependent changes in central nervous system (CNS) activity or may give false-positive indications of disease. Therefore, we need to identify dynamical measures that strike the requisite balance between sensitivity to clinical state and robustness to clinically insignificant factors. Some dynamical measures are extremely sensitive and thus probably are not appropriate in clinical applications. Nonlinear measures (e.g., the correlation dimension as quantified by the Grassberger-Procaccia algorithm), provide discouraging examples of this sensitivity.15 Several investigators calculated the correlation dimension of EEGs using the Grassberger-Procaccia algorithm (reviewed by Pritchard and Duke16) and reported evidence for low dimensional chaotic behavior in the EEG. Other investigators, such as Destexhe and colleagues,17 took a more cautious view and suggested that while dispositive evidence of chaotic behavior in the EEG might not be possible, the correlation dimension of the EEG would still be useful in discriminating between different behavioral states. Theiler and Rapp18 re-examined 110 EEG records that in an earlier analysis had provided evidence for low-dimensional behavior when analyzed by the Grassberger-Procaccia algorithm. Signals were obtained from healthy controls in three behavior states: 1) eyes closed, no task; 2) eyes closed, counting silently backwards by seven; and 3) eyes closed, counting silently forward by two. On reexamination, previously observed evidence of low-dimensional EEG structure was found to be an artifact due to autocorrelations in an oversampled signal. Additionally, the corrected correlation dimension failed to meet the more modest objective of distinguishing between behavioral states. Theiler and Rapp’s analysis raises another critical point. A distinction must be made between a statistical measure that quantifies between group differences and a statistical measure that quantifies error in a classification (diagnostic) process. For example, consider the comparison of the first minimum of the autocorrelation function obtained in the no task state and in the serial sevens state. The probability that the difference of the means of the two groups is due to random variation as determined by a t test is 0.002 The probability of classification error using this measure of 0.42, where it should be recalled that classification by random assignment results in a misclassification error of 0.5; PSAMEsPERROR. In a computational example,19 it was shown that a system in which Pt-test = PSAME =3.2 · 10 - 13 could still yield PERROR = .3.2. It is not uncommon in the qEEG community to encounter reports where it is suggested that a low value of PSAME ensures a successful diagnostic measure. This is most emphatically not the case. The infrequently reported PERROR is the statistic of interest. A determination of a measure’s sensitivity to noise is another important step in its characterization. Different measures have

1283 differing sensitivities to noise. For example, the Lyapunov exponent is sensitive to noise. In contrast, complexity measures calculated after partitioning a signal about its median value are comparatively robust to noise, and have been applied successfully in the analysis of EEG signals.20 A measure’s robustness to noise can be investigated computationally. Measures of EEG functional connectivity provide an instructive example.21 In this study, 8-sec, 10-channel EEG records were obtained from healthy controls in two behavioral states: eyes closed, no task and eyes open, no task. Four measures of functional connectivity were compared: Pearson r, Spearmen rho, Kendal tau and mutual information. Noise was added to the original EEG records to produce progressively decreasing signal to noise ratios. Mutual information was found to be more robust to noise. This is an important consideration when selecting between candidate measures. The data in the Bonita and colleagues study were obtained from highly cooperative healthy control participants.21 Signals obtained from clinical participants are often noisier. Robustness to noise is therefore an important consideration when evaluating candidate clinical qEEG measures. Bonita and colleagues also provide an example of how a qEEG measure’s sensitivity to epoch duration can be investigated.21 In this part of the analysis, the ability of the four measures of functional connectivity to distinguish between behavioral states was measured as a function of epoch duration. Of the four measures examined, only mutual information was able to distinguish between the two states with 1 sec of data. Determination of epoch length, sampling frequency, signal resolution, sensitivity to experimental conditions, and robustness to noise are interrelated. High resolution, noise-free data give the best results and provide an indication of the smallest possible data set that can be used to compute a measure. Data requirements will increase as noise levels increase. Understanding the interplay among these factors requires systematic simulation studies. Simulations can be employed to calculate long, high resolution, noisefree signals that can be used to calculate gold standard estimates of dynamical measures. Signal quality can then be manipulated computationally and the effects on the measure’s estimate can be evaluated. These results can then be used to specify, and evaluate, signal acquisition protocols. An additional complication must be recognized. Thus far, we have focused on data requirements. A measure’s value is sensitive to both the data and the computational procedure used to compute the measure from the data. The computational properties of a measure are determined not only by its mathematical definition but by the numerical algorithm used to estimate the measure. The choice of algorithm can have an important effect on the accuracy of a measure’s estimate, and data requirements can vary significantly depending on the algorithm. A measure that has failed reliability testing when computed with one algorithm might well succeed if another algorithm is used. Mutual information, an important measure of central nervous system functional connectivity,21 provides an example. It was found that the data requirements and computation times of the Fraser-Swinney mutual information algorithm12 are significantly different from those of the Cellucci algorithm.22 Additionally, the optimal choice of algorithm may change in response to changes in data set size and data quality. Kahn and colleagues23 found that kernel density estimators of mutual information performed best for very short data sets (50 or 100 points) and high noise levels, while K-nearest neighbor methods performed better with slightly longer data sets (1000 points) at lower noise levels. Another example of sensitivity to the choice of numerical algorithm is given by the singular value decomposition.

1284 The singular value decomposition can be used to determine the number of dynamically significant principal components in a signal13 and is therefore a valuable tool for comparing signal qualities obtained from different devices. Mathematically computing the singular value is equivalent to diagonalization of the covariance matrix but the Golub-Reinsch algorithm24 is more robust numerically. Comparative testing of available algorithms should therefore be an element in the assessment of candidate measures. It is not enough to compute only the point value of a measure. When possible, the computational process should include estimation of the associated confidence interval. In the case of nonlinear measures, comparisons against surrogate data should be included. Procedures for constructing surrogate data sets are described by Theiler and colleagues25 and Schreiber and Schmitz.26 These procedures can be generalized to analyze multichannel signals.27 This generalization is particularly important in qEEG studies. If an estimation process fails to reject the surrogate null hypothesis, the measure should probably not be used in a clinical decision process. It should be noted that the method of surrogate data can be misapplied. Inappropriate computational processes can result in the false-positive rejection of the surrogate null hypothesis. Procedures to guard against this possibility have been described by Rapp and colleagues.28 Step 3: Determination of a Measure’s Test–Retest Reliability in a Healthy Population and the Clinical Population Must Precede Investigation of Its Clinical Utility Test–retest reliability is an essential property of a clinically useful measure. Determination of a measure’s test-retest reliability in both a healthy and clinical population must precede investigations of its clinical utility. Reliability is necessary, but not sufficient, to ensure clinical usefulness. Body temperature provides an illustrative example. Body temperature is a useful clinical measure because it is relatively constant in a healthy individual. If temperature varied randomly between 85oF and 105oF, it would be of little value. Therefore an investigation of a measure’s clinical validity should be undertaken only if the measure has been shown to have adequate test–retest reliability. It is therefore necessary to select a procedure for quantifying test–retest reliability. A recent review4 identified seven procedures that have been used in qEEG reliability studies. That review recommended the intraclass correlation coefficient,29 where it should be noted that the equation used to calculate the coefficient will depend on the statistical model appropriate to the given study. Guidance for the selection of an appropriate intraclass correlation coefficient is provided by Mu¨ller and Bu¨ttner.30 It is essential to identify the variant intraclass correlation coefficient used in any report of the results.31 Additionally, it is necessary to estimate and report the coefficient’s confidence interval.32 There is a large literature proposing different methods to estimate the sample size needed in a test–retest reliability study. The method constructed by Zou33 is recommended. Healthy controls should be used in initial reliability testing because this population will give the best reliability results, thus establishing a ceiling for reliability. If a measure fails with this population, it will not succeed clinically. As with all other aspects of clinical research, there are methodological challenges associated with implementing the choice of the healthy control group. Determination of inclusion/exclusion criteria used to specify this group requires care so as not to be either overly inclusive or underinclusive. It has been recognized that if the criteria are too rigorous, an atypical control population has been enrolled. Specifications for

RAPP ET AL. control groups for condition-specific research are sometimes laid out in consensus documents and when available, may be good starting points.34,35 Obviously, criteria defining an appropriate control group for treatment of geriatric cancer are clearly not the same when constructing a control group for PTSD in a military population. Choice of the control and condition populations are related to the specificity with which one may answer the ultimate questions of which devices and measures are best (i.e., best for whom). The test–retest protocol must specify those factors among those identified in the previous section that will be controlled at each assessment. As indicated, the list of factors that can effect CNS activity is prohibitively long. It is impossible to control for all of these factors. Choices must be made, thus it is essential that documentation of results of device evaluation explicitly identify controlled factors. The choice of time interval between assessments presents an additional problem, and as before there is no single correct selection. In some applications, the retest interval is indicated by the objectives of the study. For instance, suppose that the objective is an assessment of the therapeutic effects of a CNS-active medication over a six-week period. This objective determines that test–retest with healthy controls should, at a minimum, be assessed at six weeks. In the case of test–retest evaluations undertaken in the absence of a specific clinical application, an explicit indication of the appropriate test–retest interval is usually not available. A best estimate based on anticipated future use must be made. A related question also requires attention. How many retest assessments should be performed? This question also may be investigated through simulations, and deserves attention. As a practical matter, the number of retest assessments is usually determined by the time and resources available to a project; however, the effect of a choice of convenience needs exploration. The interpretation of reliability measures introduces challenges. Portney and Watkins36 make the following valuable observation: ‘‘As a general guideline, we suggest that values above .75 are indicative of good reliability, and those below .75 poor to moderate reliability. For many clinical measurements, reliability should exceed .90 to ensure reasonable validity. These are only guidelines, however, and should not be used as absolute standards. Researchers and clinician must defend their judgments within the context of the specific scores being assessed and the degree of acceptable precision in the measurement,’’ (emphasis in the original text). Additionally, it should be remembered that a high degree of variability is a long known characteristic of an injured central nervous system37,38 and in this area it is not sufficient to explore reliability only in healthy controls. It is imperative to do so in the clinical population, as well. A valuable example is presented by Bleiberg and colleagues.39 Neuropsychological performance of six controls and six patients was assessed 30 times over 4 d. Patients showed ‘‘erratic and inconsistent performance,’’ while control participants showed consistent improvement. The patients, who were 12 to 30 months post-injury at the time of the study, had initially presented mild to moderate traumatic brain injury. All had made excellent recoveries and had returned to pre-injury vocational and social status. However, Bleiberg and colleagues reported, ‘‘Inconsistent performance was observed even in those subjects with TBI whose initial performance was equal to or better than that of control subjects.’’ Once a measure’s reliability has been determined, limited between-device comparisons become possible for a clinical population. A better device is provisionally considered to be the one that gives the greater healthy control and clinical population reliability. It should be stressed that this comparison is specific to each

TESTING OF NEUROASSESSMENT DEVICES measure—the device that is best by a given set of criteria for one measure is not necessarily best for all measures. It also should be recognized that reliability is not the same as validity. Validity is evaluated in the next step. Step 4: Determination of a Measure’s Clinical Validity Should Be Quantified With Calculations of Effect Size, Sensitivity, Specificity and Related Measures The clinical validity of reliable measures is investigated in the next phase of the evaluation. We operationally define a measure’s validity as its ability to discriminate between the clinical population of interest and an appropriately constructed comparison group. As in the case of reliability studies, it is essential to define the study population with precision and as before, there is no single universally appropriate set of inclusion/exclusion criteria. Patient characterization protocols specifically directed to psychophysiological studies of mild traumatic brain injury are suggested by Rapp and colleagues.40 As in the case of reliability, there are several procedures for quantifying validity. They are complementary. More than one is needed to obtain an understanding of a measure’s validity. An important element in the evaluation of validity was implicitly introduced in the discussion of the inability of the EEG’s correlation dimension to distinguish between behavioral states, specifically PSAMEsPERROR. Similarly, a statistical separation of the clinical population and the control population, giving a low value of PSAME, does not ensure diagnostic efficacy. This requires a low value of PERROR. A commonly used measure of separation is the effect size.41 A number of variant definitions of effect size have been used. An important variant, Hedge’s g42 is the special case of the between group Mahalanobis distance computed with a single variable. Effect size (Mahalanobis distance) alone is not an adequate measure of between group separation. PERROR is a function of both Mahalanobis distance and the size of the sample population. A high value of effect size can still result in an unacceptable PERROR in an underpowered study.43 Sensitivity and specificity are also important measures of diagnostic validity.36 Sensitivity is defined as the probability that a true case (a true positive) will be identified as a case. Specificity is the probability a that a true normal (a true negative) will be identified as normal. It is seen that specificity and sensitivity are interrelated. For example, suppose that continuous measure M is used as a diagnostic measure where a participant is classified as a case if M exceeds threshold MT. If MT is set to zero, all participants are classified as cases giving a perfect sensitivity but in this extreme limit all normal participant also are classified as cases. In practical applications, the value of MT will be determined by the clinical implications of misdiagnosis. It could be that a false-positive diagnosis would expose the participant to a potentially dangerous and in this case, valueless procedure. Conversely, a false-negative diagnosis might prevent access to a benign life-saving intervention. Sensitivity and specificity are thus seen to be coupled functions of threshold MT. This relationship can be quantified by the area under the receiver operating characteristic curve (AUC).44 The area under the curve takes the value of 1 for a perfect test and 0.5 for a test that gives a random diagnosis45,46 has addressed the limitations of the area under the curve. ‘‘However, the AUC also has a much more serious deficiency and one which appears not to have been previously recognized. That is that it is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification distributions for different classifiers,’’ according to Hand.45 Hand has introduced an alternative measure, the H measure, which addresses this issue.

1285 Issues of sensitivity and specificity are particularly important when evaluating psychophysiological measures of neuropsychiatric disorders. As has been argued at length elsewhere,43 these measures can be nonspecific. For example, similar abnormalities in CNS synchronization are seen in at least fifteen distinct neuropsychiatric disorders.4 This suggests that the greatest utility of these measures may lie in their longitudinal application as monitors of treatment response. Numerical values of AUC, H, PERROR, sensitivity and specificity, can be incorporated into the decision matrix. It must be recognized that this evaluation is specific to the clinical populations under investigation and must be repeated if different populations are to be examined. With the identification of quantitative measures of diagnostic efficacy in place, the next step in device comparison is now possible. For any given measure, the best device for that measure is the one that gives the greatest diagnostic success. Ideally devices should be compared with the same participant on the same day. As a methodological note, care should be taken to randomly assign study participants to different and pre-determined orders of device presentation. Step 5: Usability Can Be Considered After Reliable and Valid Measures Have Been Identified Once reliable and valid measures have been identified, it is possible to ask the question, ‘‘What is the best device for obtaining this measure or group of measures?’’ At this point, specific comparative device testing and selection activities become appropriate. Additional considerations of usability—weight, size, simplicity of use, cost effectiveness and ruggedness—can now be introduced into the analysis. Some usability data may have been collected during earlier testing, but any further needed data should be collected at this time. These considerations must be subordinate to considerations of reliability and validity. A rugged instrument that produces unreliable measurements is of limited clinical value. Summary This analysis began with the operational question, ‘‘What neuroassessment device should be purchased?’’ An answer requires systematic preparatory investigations. A five-step sequential process is recommended as follows: 1. Review the relevant clinical literature to identify candidate measures. 2. Establish optimal computational procedures for estimating candidate measures through systematic simulation studies. 3. Assess test–retest reliability of candidate measures in a healthy control population and the clinical population of interest. 4. Quantitatively establish clinical validity of any given measure specific for each clinical population under consideration. 5. Conduct comparative device testing and device selection for devices using reliable and valid clinical measures. The comparative testing of neuroassessment devices in a clinical population should be preceded by computational research and research with healthy comparison populations. Proceeding immediately to testing with clinical participants is not warranted. Acknowledgments PER and DOK would like to acknowledge support from the Uniformed Services University, the US Marine Corps Systems Command and the Defense Medical Research and Development

1286 Program. The opinions and assertions contained herein are the private opinions of the authors and are not to be construed as official or reflecting the views of the United States Department of Defense. Author Disclosure Statement No competing financial interests exist. References 1. So¨rnmo, L. and Laguna, P. (2005). Bioelectric Signal Processing in Cardiac and Neurological Applications. Elsevier Academic Press: Burlington, MA. 2. Blinowska, K. and Zygierewicz, J. (2012). Practical Biomedical Signal Analysis Using Matlab. CRC Press: Boca Raton, FL. 3. Kropotov, J. (2009). Quantitative EEG, Event Related Potentials and Neurotherapy. Elsevier: Amsterdam. 4. Rapp, P.E., Keyser, D.O., Albano, A.M., Hernandez, R., Gibson, D., Zambon, R., Hairston, W.D., Hughes, J.D., Krystal, A., and Nichols, A. (2014). Traumatic brain injury detection using electrophysiological methods. Front. Human Neurosci. submitted. 5. Bressler, S.L. (2003). Cortical coordination dynamics and the disorganization syndrome in schizophrenia. Neurpsychopharmacology. 28, 535–539. 6. Rockstroh, B.S., Wienbruch, C., Ray, W.J., and Elbert, T. (2007). Abnormal oscillatory brain dynamics in schizophrenia: a sign of deviant communication in an neural network? BMC Psychiatry 7, 44–53. 7. Spencer, K.M., Niznikiewicz, M.A., Shenton, M.E., and McCarley, R.W. (2008). Sensory-evoked gamma oscillations in chronic schizophrenia. Biol. Psychiatry 63, 744–747. 8. Spencer, K.M., Salisbury, D.F., Shenton, M.E., and McCarley, R.W. (2008). c-band auditory steady-state responses are impaired in first episode patients. Biol.l Psychiatry 64, 369–375. 9. Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology. (1996). Heart rate variability. Standards of measurement, physiological interpretation and clinical use. Circulation. 93, 1043–1065. 10. Marks, R.J. (1991). Introduction to Shannon Sampling and Interpolation Theory. Springer Verlag: New York. 11. Albano, A.M., Muench, J., Schwartz, C. Mees, A.I., and Rapp, P.E. (1988). Singular-value decomposition and the Grassberger-Procaccia algorithm. Phys. Rev. 38A, 3017–3026. 12. Fraser, A.M. and Swinney, H.L. (1986). Independent coordinates for strange attractors from mutual information. Phys. Rev. A. 3, 1134–1140. 13. Mees, A.I., Rapp, P.E., and Jennings, L.S. (1987). Singular value decomposition and embedding dimension. Phys. Rev. 36A, 340–346. 14. Polich, J. and Herbst, K.L. (2000). P300 as a clinical assay: rationale, evaluation and findings. Int. J. Psychophysiol. 38, 3–19. 15. Rapp, P.E., Albano, A.M., Schmah, T.I., and Farwell, L.A. (1993). Filtered noise can mimic low dimensional chaotic attractors. Phys. Rev. 47E, 2289–2297. 16. Pritchard, W.S. and Duke, D.W. (1992). Measuring chaos in the brain: A tutorial review of nonlinear dynamical analysis. Int. J. Neurosci. 67, 31–80. 17. Destexhe, A., Sepulchre, J.A., and Babloyantz, A. (1988). A comparative study of the experimental quantification of deterministic chaos. Phys. Lett. 132A, 101–106. 18. Theiler, J. and Rapp, P.E. (1996). Re-examination of evidence for low-dimensional nonlinear structure in the human electroencephalogram. Electroencephalogr. Clin. Neurophysiol. 98, 213–222. 19. Rapp, P.E., Watanabe, T.A.A., Faure, P., and Cellucci, C.J. (2002). Nonlinear signal classification. International Journal of Bifurcation and Chaos, 12, 1273–1293. 20. Watanabe, T.A.A., Cellucci, C.J., Kohegyi, E., Bashore, T.R., Josiassen, R.C., Greenbaun, N.N. and Rapp, P.E. (2003). The algorithmic complexity of multichannel EEGs is sensitive to changes in behavior. Psychophysiology. 40, 77–97. 21. Bonita, J.D., Ambolode, L.C.C., Rosenberg, B.M., Cellucci, C.J., Watanabe, T.A.A., Rapp, P.E., and Albano, A.M. (2014). Time domain measures of CNS functional connectivity: a comparison of linear, nonparametric and nonlinear measures. Cogn. Neurodyn. 8, 1–15. 22. Cellucci, C.J., Albano, A.M. and Rapp, P.E. (2005). Statistical validation of mutual information calculations: comparisons of alternative numerical algorithms. Phys. Rev. E. 71 (6 Pt 2), 066208.

RAPP ET AL. 23. Khan, S., Bandyopadhyay, S., Ganguly, A.R., Saigal, S., Erickson, D.J., Protopopescu, V., and Ostrouchov, G. (2007). Relative performance of mutual information estimation methods for quantifying the dependence among short noisy data. Physical Review E. 76, 026209. 24. Golub, G.H. and Reinsch, C. (1970). Singular value decomposition and least squares solutions. Numerical Mathematics 14, 403–420. 25. Theiler, J., Eubank, S., Longtin, A., Galdrikian, B., and Farmer, J.D. (1992). Testing for nonlinearity in time series: The method of surrogate data. Physica D. 58, 77–94. 26. Schreiber, T. and Schmitz, A. (1996). Improved surrogate data for nonlinearity tests. Phys.Rev. Lett. 77, 635–638. 27. Prichard, D. and Theiler, J. (1994). Generating surrogate data for time series with several simultaneously measured variables. Phys. Rev. Lett. 73, 951–954. 28. Rapp, P.E., Cellucci, C.J., Watanabe, T.A.A., Albano, A.M., and Schmah, T.I. (2001). Surrogate data pathologies and the false-positive rejection of the null hypothesis. International Journal of Bifurcation and Chaos. 11, 983–997. 29. Dunn, G. (1989). Design and Analysis of Reliability Studies: The Statistical Evaluation of Measurement Errors. Hodder Arnold: London. 30. Mu¨ller, R. and Bu¨ttner, P. (1994). A critical discussion of intraclass correlation coefficients. Statistics in Medicine. 13, 2465–2476. 31. Krebs, D.E. (1986). Declare your ICC type. Physical Therapy. 66, 1431. 32. Stratford, P.W. (1989). Confidence limits for your ICC. Phys. Ther. 69, 237–238. 33. Zou, G.Y. (2012). Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Statistics Med. 31, 3972–3981. 34. ICH Expert Working Group. (2000). Choice of control group and related issues in clinical trials E10. International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use, San Diego, CA. 35. Au, D.H., Castro, M., and Krishnan, J.A. (2007). Selection of controls in clinical trials. Proc. Am. Thorac. Soc. 4, 567–569. 36. Portney, L.G. and Watkins, M.P. (2009). Foundations of Clinical Research. Applications to Practice. Third Edition. Pearson Education: Upper Saddle River, NJ, pps. 594–595. 37. Hughlings-Jackson, J. (1882) On some implications of dissolution of the nervous system. In: J.J. Taylor (ed). Selected Writings of John Hughlings. (Vol. 2). Hodder and Stoughton: New York: pps. 29–45. 38. Head, H. (1926). Aphasia and Kindred Disorders of Speech. Cambridge University Press: Cambridge, UK. 39. Bleiberg, J., Garmoe, W.S., Halpern, E.L., Reeves, D.L., and Nadler, J.D. (1997) Consistency of within-day and across-day performance after mild brain injury. Neuropsychiatry Neuropsychol. Behav. Neurol. 10, 247–53. 40. Rapp, P.E., Rosenberg, B.M., Keyser, D.O., Nathan, D., Toruno, K.M., Cellucci, C.J., Albano, A.M., Wylie, S.A., Gibson, D., Gilpin, A.M.K., and Bashore, T.R. (2013). Patient characterization protocols for psychophysiological studies of traumatic brain injury and post-TBI psychiatric disorders. Front. Neurol. 4, 91 41. Ellis, P.D. (2010). The Essential Guide to Effect Sizes: An Introduction to Statistical Power, Meta-Analysis and the Interpretation of Research Results. Cambridge University Press: Cambridge, UK. 42. Hedges, L.V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. J. Educational Statistics. 6, 106–128. 43. Rapp, P.E., Cellucci, C.J., Keyser, D.O., Gilpin, A.M.K., and Darmon, D.M. (2013b). Statistical issues in TBI clinical studies. Front. Neurol. 4, 177. 44. Krzanowski, W. and Hand, D.J. (2009). ROC Curves for Continuous Data. Chapman Hall/CRC: Boca Raton, FL. 45. Hand, D.J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning 77, 103–123. 46. Hand, D.J. (2009). Evaluating diagnostic tests: the area under the ROC curve and the balance of errors. Statistics Med. 29, 1502–1510.

Address correspondence to: David O. Keyser, PhD Department of Military and Emergency Medicine Uniformed Services University of Health Sciences 4301 Jones Bridge Road Bethesda, MD 20814 E-mail: [email protected]

Procedures for the Comparative Testing of Noninvasive Neuroassessment Devices.

A sequential process for comparison testing of noninvasive neuroassessment devices is presented. Comparison testing of devices in a clinical populatio...
162KB Sizes 2 Downloads 14 Views