627488

research-article2016

JADXXX10.1177/1087054715627488Journal of Attention DisordersKaralunas et al.

Article

Test–Retest Reliability and Measurement Invariance of Executive Function Tasks in Young Children With and Without ADHD

Journal of Attention Disorders 1­–14 © The Author(s) 2016 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/1087054715627488 jad.sagepub.com

Sarah L. Karalunas1, Karen L. Bierman2, and Cynthia L. Huang-Pollock2

Abstract Objective: Measurement reliability is assumed when executive function (EF) tasks are used to compare groups or to examine relationships between cognition and etiologic and maintaining factors of psychiatric disorders. However, the test–retest reliabilities of many commonly used EF tasks have rarely been examined in young children. Furthermore, measurement invariance between typically developing and psychiatric populations has not been examined. Method: Test–retest reliability of a battery of commonly used EF tasks was assessed in a group of children between the ages of 5 and 6 years old with (n = 63) and without (n = 44) ADHD. Results: Few individual tasks achieved adequate reliability. However, confirmatory factor analysis (CFA) models identified two factors, working memory and inhibition, with test– retest correlations approaching 1.0. Multiple indicators multiple causes (MIMIC) models confirmed configural measurement invariance between the groups. Conclusion: Problems created by poor reliability, including reduced power to detect group differences, index change over time, or to identify relationships with other measures, may be mitigated using latent variable approaches. (J. of Att. Dis. XXXX; XX(X) XX-XX) Keywords ADD/ADHD, reliability, executive function, confirmatory factor analysis, children There is increasing recognition that behaviorally based diagnostic categories, such as those used in Diagnostic and Statistical Manual of Mental Disorders (5th ed.; DSM-5; American Psychiatric Association, 2013), result in the creation of groups that are phenotypically and mechanistically heterogeneous (Insel et al., 2010; Sanislow et al., 2010). To resolve the issues created by diagnostic heterogeneity, researchers have increasingly turned to endophenotype measures and biomarkers (Kendler & Neale, 2010; Lenzenweger, 2013; Nolen-Hoeksema & Watkins, 2011). Neurocognitive processes, such as working memory, inhibition, and other executive functions (EF), have been specifically highlighted by the recent National Institutes of Mental Health (NIMH) Research Domain Criteria Initiative (RDoC) as potential endophenotypes or biomarkers that may help elucidate mechanisms of psychiatric disorders, aid in treatment matching, and facilitate development of novel treatments (Insel et al., 2010; Nolen-Hoeksema & Watkins, 2011; Sanislow et al., 2010). However, the psychometric properties of these measures may limit their value, especially when used with young children. ADHD is emblematic of the problems created by heterogeneity within DSM diagnostic categories, and measures of EF, speed/variability of response, and response to reward contingencies feature prominently in theoretical models of

ADHD (Barkley, 1997; Castellanos, Sonuga-Barke, Milham, & Tannock, 2006; Diamond, 2005). However, a central concern and substantial obstacle to using these measures as endophenotypes or biomarkers is that their test–retest reliability is not always known (Kuntsi, Neale, Chen, Faraone, & Asherson, 2006). When test–retest reliability is low, it not only attenuates between-group differences but also reduces statistical power to detect associations with genes, disease symptoms, or other outcome measures (Green et al., 2004; Kendler & Neale, 2010). Thus, better characterization of the reliability of cognitive tasks is essential. Studies that have directly assessed test–retest reliability of EF tasks in children have focused primarily on middle childhood and adolescence. In this age range, there is at least some evidence of adequate reliability for the most commonly used neurocognitive measures, including working memory span tasks, reaction time measures, and computerized measures of inhibitory control (Archibald & 1

Oregon Health & Science University, Portland, OR, USA The Pennsylvania State University, University Park, PA, USA

2

Corresponding Author: Sarah L. Karalunas, Oregon Health & Science University, 3181 SW Sam Jackson Park Rd., Portland, OR 97239-9979, USA. Email: [email protected]

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

2

Journal of Attention Disorders 

Kerns, 1999; Bishop, Aamodt-Leeper, Creswell, McGurk, & Skuse, 2001; Kindlon, Mezzacappa, & Earls, 1995; Kuntsi, Andreou, Ma, Borger, & Van der Meere, 2005; Kuntsi, Stevenson, Oosterlaan, & Sonuga-Barke, 2001; Soreni, Crosbie, Ickowicz, & Schachar, 2009; Thorell, 2007). However, reliability estimates are population specific, and tasks that demonstrate adequate or better reliability in middle childhood and adolescence may not be adequate for younger children. In particular, the rapid pace of maturation and development in early childhood may result in lower reliability estimates if the rate of development is not consistent across individuals. Similarly, individual differences in learning effects (which occur when the initial task exposure results in improved performance on later administrations) may also lower test–retest reliability. Although learning and maturational effects are both reflected in test–retest data as improvements in performance between testing sessions, shorter test–retest intervals (several weeks) are primarily influenced by learning effects whereas longer intervals (several months) also capture the effects of maturation. Neither maturation nor learning effects prevents a measure from being reliable, as long as those effects are consistent across the entire sample (Rousson, Gasser, & Burkhardt, 2002); however, how each of these processes affect task reliability remains unclear because few studies have assessed the reliability of neurocognitive tasks in early childhood. Gnys and Willis (1991) found good test–retest reliability for the Tower of Hanoi and verbal fluency tests (rxx < .70) in a sample of 96 typically developing preschool and kindergarten children. Beck, Schaefer, Pang, and Carlson (2011) similarly found good test–retest reliabilities (intraclass correlation coefficients < .69) for a series of tasks measuring inhibitory control for typically developing children ranging from 2 to 5 years old. However, both studies used same-day test–retest intervals, potentially leading to inflated reliabilities, and both studies noted the need to examine longer retest intervals. Thorell & Wahlstedt (2006) found adequate or better test–retest reliabilities for inhibitory control and working memory tasks in a group of 4- to 5-year-old children over a 2-week retest interval; however, the sample was small (n = 22). One of the largest studies of executive task reliability in young children to date comes from Willoughby and Blair (2011). Here, authors found moderate reliabilities (rxx = .52-.66) over a 2- to 4-week time period for a test battery of inhibitory control, working memory, and attention shifting tasks in an epidemiological sample of more than 100 preschool-aged children. Thus, although there is evidence of moderate reliability for at least some tasks in early childhood, the number of studies examining these effects is small, and studies have used relatively short test–retest intervals, which emphasize learning effects, leaving the additional effects of maturation on task reliability unclear.

An additional issue is that traditional test–retest analyses rely on the correlations between individual tasks to assess reliability, which conflates true score and error variance and does not provide an adequate measure of the reliability of the underlying construct being measured. In contrast, latent variable models, such as confirmatory factor analysis (CFA), partition variance into common and specific (taskspecific and error) variance (Kline, 2013; Vandenberg & Lance, 2000). Thus, the test–retest reliability of the factor scores may better reflect the stability of the underlying EF abilities as compared with individual tests. Consistent with this hypothesis, Willoughby and Blair (2011) applied CFA in their sample of preschool children to demonstrate that reliability of the underlying EF factor approached unity even when individual task reliabilities did not. Although this suggests that latent variable approaches may be useful for improving psychometric properties of individual neurocognitive tasks, several questions remain. First, all the studies described focus on the measurement of EF factor structure and reliability in a single, typically developing population. If these methods are to be applied to studying psychiatric populations, it is necessary to establish that the tasks function similarly in each group. In other words, measurement invariance must be established (Muthén, 1989; Vandenberg & Lance, 2000; Woods, 2009). Measurement invariance refers to whether a measure, in this case tests of EF, is psychometrically equivalent in different groups. If they are not, then groups cannot be compared because the tests may be capturing fundamentally different processes in each of the groups, rendering comparisons meaningless. A critical first step in assessing measurement invariance is the demonstration of configural invariance, which requires that the number and pattern of factor loadings must be the same between groups (Muthén, 1989; Sass, 2011; Vandenberg & Lance, 2000). The question of configural invariance is particularly relevant for young children with and without ADHD. In typically developing preschool-aged children, a single factor is often adequate to capture EF abilities (but see Schoemaker et al., 2012; Wiebe, Espy, & Charak, 2008; Wiebe et al., 2011; Willoughby & Blair, 2011), but during middle childhood, EF becomes more differentiated and is better represented by multiple factors (Lee, Bull, & Ho, 2013; Miyake, Friedman, Emerson, Witzki, & Howerter, 2000; Shing, Lindenberger, Diamond, Li, & Davidson, 2010). It remains unclear at which age multiple as opposed to single factor models become appropriate and whether factors representing different executive abilities are equally reliable. In addition, neuroimaging studies have found that children with ADHD are characterized by protracted development of prefrontal areas of the brain supporting EF (Gilliam et al., 2011; Kofler et al., 2013; Lijffijt, Kenemans, Verbaten, & van Engeland, 2005; Mackie et al., 2007; Shaw et al., 2007), and the disorder is

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

3

Karalunas et al. often conceptualized as a maturational lag, suggesting that the appropriate factor model may differ between children with and without ADHD of the same age. If this were the case, latent variable approaches would be inappropriate for between-group comparisons, and so, establishing invariance is particularly important to inform additional studies of EF. The current study examined the reliability of a battery of common neurocognitive tasks in a sample of kindergartenage children with and without ADHD with assessments in the fall and spring of the kindergarten year. The study adds to a small literature examining individual task reliability in this age range and expands the range of test–retest intervals that have been examined. Furthermore, we expand on prior research by directly comparing reliability in typically developing and ADHD populations on individual tasks, as well as by using CFA models to test for configural measurement invariance between children with and without ADHD and to establish the stability of latent EF factors.

Method All data were collected as part of a larger study examining the impact of a social–emotional intervention program on self-regulatory skills in young childhood. For the larger study, children participated in either (a) a 30-session (16-18 week) small group social skills training intervention condition directly aimed at building social–emotional competency, self-regulatory skills, and EF, or (b) a control condition with the same number of sessions that focused on tutoring in emergent literacy skills (e.g., letter identification, letter-sound correspondence). The control condition was not expected to affect EF, self-regulatory skills, or social–emotional and behavioral outcomes. Because children in the intervention condition were intentionally provided instruction meant to improve executive processes, they were excluded from the current study.

Participants One hundred and seven children aged 5 to 6 years were recruited in two successive cohorts from 48 kindergarten classrooms in six Pennsylvania school districts that included both urban and rural areas. Brochures describing the study were distributed to parents of all 5- to 6-year-old children in the participating classrooms. Interested parents provided their contact information, as well as informed consent for child participation in the study. All procedures received approval from the university institutional review board. Initially, teachers completed two behavioral rating scales: Conners’ ADHD Rating Scale, Short Form– Revised (CTRS-R; Conners, 2003) and the DuPaul ADHD–Rating Scale (ADHD-RS; DuPaul, Power, Anastopoulos, & Reid, 1998). Children with elevated

teacher ratings as well as children without teacher-rated ADHD symptoms were identified as possible study participants. Their parents then completed a structured diagnostic interview (the Diagnostic Interview Schedule for Children–Version IV [DISC-IV]; Shaffer, Fisher, Lucas, Dulcan, & Schwab-Stone, 2000) and parent-report rating forms (the Conners’ Parent Rating Scale, Long Form– Revised [CPRS-R]; Conners, 2003) and the Behavioral Assessment Scale for Preschool Children–Second Edition (BASC-2; Reynolds & Kamphaus, 2004). Children were considered to have ADHD (n = 48) if (a) they met full clinical criteria for a diagnosis of ADHD on the DISC-IV, including criteria for impairment, chronicity, and cross-situational severity; and (b) both the parent- and teacher-reported age-inappropriate levels of inattention or hyperactivity defined as at least one t score ≥ 60 (84th percentile) on the Cognitive Problems/Inattention, Hyperactivity, ADHD Index, or Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM-IV; American Psychiatric Association, 1994) Total Index of the Conners’ or the Hyperactivity or Attention Problems Indices of the BASC-2, or ≥ three inattentive symptoms or ≥ three hyperactive/impulsive symptoms or ≥ four total symptoms endorsed as often or very often on the ADHD-RS. Children were considered to have emerging ADHD (n = 15) if they did not meet full diagnostic criteria on the DISC-IV but did have elevated levels of inattention or hyperactivity based on at least one parentreport measure and at least one teacher-report measure (Criterion b above). Finally, children were considered nonADHD controls (n = 44) if (a) they did not meet diagnostic criteria for ADHD on the DISC-IV, (b) teacher ratings of behavior on all relevant indices of the Conners’ and BASC-2 T-scores ≤ 59, and (c) the total number of symptoms endorsed following the “or” algorithm yielded ≤2 inattentive symptoms, ≤2 hyperactive/impulsive symptoms, and ≤3 total symptoms. For all children, dimensional scores of inattention and hyperactivity symptom counts were determined following DSM-IV field trials (Lahey et al., 1994) using an “or” algorithm between parent report on the DISC-IV and teacher report on the ADHD-RS (where a rating of often or almost always would indicate that a symptom was present). Given that many symptoms, particularly inattention symptoms, have low endorsement at this age but are highly endorsed as children reach middle childhood (CurchackLichtin, Chacko, & Halperin, 2014) and that diagnostic stability is improved across early and middle childhood when sub-threshold symptoms are considered (Bauermeister et al., 2011), children with full and emerging ADHD were grouped together for between-group analyses. However, as noted in the “Results” section, primary results were all confirmed using a continuous ADHD symptom count in addition to the categorical diagnostic indicator to ensure that this grouping strategy did not account for results. Description of sample characteristics can be found in Table 1.

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

4

Journal of Attention Disorders 

Table 1.  Sample Characteristics. Measure Gender (% male) Age (in months) Stanford–Binet Abbreviated IQ Conners’ Parent Cognitive Problems/Inattention Conners’ Parent Hyperactivity Conners’ Teacher Cognitive Problems/Inattention Conners’ Teacher Hyperactivity

Control (n = 44)

ADHD (n = 63)

F

Effect Size (η2)

55.3% 70.16 (3.55) 106.79 (9.78) 46.16 (4.55) 46.23 (5.36) 46.97 (3.34) 44.68 (3.05)

68.1% 69.45 (3.99) 97.45 (14.50) 59.68 (12.28) 62.33 (11.79) 64.27 (16.69) 64.80 (14.35)

χ2(1) = 1.03 0.81 12.46** 42.56*** 63.05*** 39.76*** 72.50***

  0.01 0.11 0.29 0.38 0.28 0.41

**p < .01. ***p < .001.

Exclusionary criteria. Exclusionary criteria for the larger study included (a) parent report of a sensorimotor disability, frank neurological disorder, or psychosis; (b) estimated Full Scale Intelligence Quotient (FSIQ) < 70 as measured by a two-subtest short form (Vocabulary and Matrices) of the Stanford–Binet Fifth Edition; (c) low levels of English proficiency that preclude children from completing the assessment battery; or (d) if they were in a temporary custody situation with uncertain outcome. Children taking psychotropic medications were not excluded from the study. One child was prescribed Focalin and asked to discontinue medication 24 hr prior to each testing session. A second child was prescribed Strattera, which could not be safely discontinued, and participated while taking their regular dose at both the test and retest visits.

Cognitive Testing Children were assessed at two time points at the school, in a quiet room outside of the classroom setting and away from peers. The time between assessments ranged between 15 and 26 weeks (M = 21.13, SD = 1.69). Number of weeks between assessments did not differ between children with and without ADHD (M weeks for non-ADHD controls = 20.9, M weeks for children with ADHD = 21.3, p = .331) and was not correlated with child age (r = −.08, p = .43) or IQ (r = −.13, p = .18). All children were tested individually by trained examiners. Inhibitory control.  Inhibitory control was assessed with five tasks: the Walk-a-Line Slowly task (Kochanska, Murray, Jacques, Koenig, & Vandegeest, 1996), the Peg Tapping task (Diamond & Taylor, 1996), the Head-Toes-Knees-Shoulders task (HTKS; Ponitz et al., 2008), a Choice Delay task (Sonuga-Barke, Taylor, Sembi, & Smith, 1992), and a Go/ No-Go (GNG) task (Berlin & Bohlin, 2002). For the Walk-a-Line Slowly task, children were asked to walk along a 6-foot piece of string taped to the floor as the examiner timed them. Children were then asked to repeat the task twice, walking slower, and then walking even slower—as slowly as they could. The total score represented the average percentage by which a child reduced his

or her speed on successive trials. This task has demonstrated adequate interrater reliability with preschool children (intraclass correlation = .98), indicating that raters are able to accurately determine the timing difference between trials (Smith-Donald, Raver, & Hayes, 2007). In the Peg Tapping task, children were asked to tap their peg twice when the interviewer tapped once, and vice versa. After a short set of practice items, their final score was the number of correct trials out of 16 total trials. The HTKS task is a more complex version of the Headto-Toes task. In this task, children habituated to several oral commands (e.g., “Touch your head” and “Touch your toes”). They were then asked to play “a silly game” in which, in response to the command, “Touch your toes,” they were to touch their head. Then, they played another silly game in which they were to touch their knees when asked to touch their shoulders and vice versa. Children earned 2 points for a correct response, 0 points for an incorrect response, and 1 point if they made any motion to the incorrect response but then self-corrected. The outcome variable used was the total number of correct points, with a maximum of 40 points possible. In the Choice Delay task (Sonuga-Barke et al., 1992), children chose between two rewards: (a) a 1-point reward available after 2 s or (b) a 2-point reward available after 30 s. Each trial began immediately after the reward was received from the preceding trial. Children had 20 trials in which they were instructed to earn as many points as possible. The variable used in analyses was the percentage of choices for the 2-point, delayed reward. In the GNG task, children viewed four stimuli (blue triangle, blue square, red triangle, and red square) and were asked to make a key press every time they saw a blue shape (target, 75% of trials) but to withhold a response when they viewed a red shape (25% of trials). Each stimulus appeared for 1,000 ms; children were allowed a total of 2,000 ms to respond. Inhibitory control outcome measures included percent correct hits and commission errors. Reaction time on correct hits and standard deviation of reaction time for correct hits were also recorded as measures of processing speed, which is a separable EF factor, at least in the middlechildhood age range (Rose, Feldman, & Jankowski, 2011).

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

5

Karalunas et al. Working memory. Both verbal and visuospatial working memory were assessed. Verbal working memory was assessed using the Backward Word Span task (Carlson, 2005; Davis & Pratt, 1995). For this task, children listened to a list of words read out loud and then were asked to repeat the words in backward order. The list started with one word and increased by one additional word with successive trials. Children received a score equal to the highest number of words they were able to repeat correctly in reverse order. Visuospatial working memory was assessed using the Finger Windows task from the Wide Range Assessment of Memory and Learning–Second Edition (WRAML-2; Sheslow & Adams, 2003). In the forward condition of this task, the child watched as the examiner put a pencil in a series of holes on a card. The child then recreated this series. The WRAML-2 Finger Windows task includes only a forward condition; however, a backward condition was also created for this study in which children needed to point to the holes the examiner identified in reverse order. In both conditions, children received 1 point for every correct sequence recalled. The Dimensional Change Card Sort (DCCS; Frye, Zelazo, & Palfai, 1995) was also included as a measure of working memory, consistent with prior literature indicating that performance in early childhood is related to ability to use higher order if-then rules (Zelazo, 2004; Zelazo & Frye, 1998) and use working memory to overcome response conflict (Munakata, 2001). Children were shown picture cards that varied along the dimensions of color and shape (e.g., red and blue, rabbits and boats). After learning to sort the cards according color, children were then asked to sort the cards according to shape instead. The score represented the number of trials (out of 6) in which the child correctly shifted sets after the sorting criteria changed.

Data Analysis Two children initially recruited into the study did not participate in either the pre- or post-test due to absence from school on the day of testing and were excluded from all analysis. Seven additional children completed only part of the pre- or post-test battery and 13 children’s files were lost to file corruption on the computerized GNG task at either the pre- or post-test. Finally, two scores were identified as outliers (>5 SDs from the sample mean): one on the GNG Reaction Time measure and one on the Walk-a-Line Slowly task. The outlying scores were treated as missing data for these tasks. In assessing reliability of individual tasks, children were only included if they completed the task at both time points (pre- and post-test). The final Ns for each task are reported in Table 2. In assessing factor structure and factor reliability, all available children were included in analyses using the full information maximum likelihood algorithms in Mplus v7.2 (Muthén & Muthén, 1998-2012) to handle missing data.

Learning and maturation effects.  Learning/maturation effects were assessed with a multivariate, repeated-measures ANOVA including the full battery of EF tasks. Time (pre-/ post-test) was the within-subjects factor and ADHD diagnosis was the between-subjects factor. A main effect of time indicates the presence of learning/maturation effects and a Time × ADHD interaction indicates differences in these effects based on ADHD status. Reliability of individual tasks.  Test–retest reliability for each group was calculated as the age-partial inter-class productmoment correlation coefficients (Pearson correlation) between the two test administrations using SPSS (Rousson et al., 2002). There are no firm criteria for what constitutes “good” reliability. Prior research examining reliability of neurocognitive tasks has adopted the criteria that reliabilities between .50 and .70 are “adequate” and those above .70 are “good” (Kindlon et al., 1995; Kuntsi, Stevenson, et al., 2001). We adopt the same criteria here. The age effects were included to account for differences in age at the initial assessment time period (Kail, 2007; Williams, Ponesse, Schachar, Logan, & Tannock, 1999). Fisher’s r-to-z tests were used to compare the correlation coefficients and determine whether these differed significantly between the diagnostic groups. Factor structure and reliability of factors.  Confirmatory factor analyses (CFAs) were conducted in Mplus v.7.2 (Muthén & Muthén, 1998-2012). First, a series of CFA models were tested in the two diagnostic groups separately. Then, a multiple indicators multiple causes (MIMIC) model (Muthén, 1989) in which ADHD diagnostic status was included as a covariate was used to test for measurement invariance (Muthén, 1989; Woods, 2009). MIMIC models can be used to test for configural invariance by including grouping variables as covariates in the factor model, rather than testing separate models for each group as is required for multiplegroup CFA (Kim, Yoon, & Lee, 2012; Muthén, 1989). MIMIC models assume equivalent factor loading across groups, rather than testing this directly (Muthén, 1989); thus, they cannot be used for testing metric and other types of invariance. However, MIMIC models are preferred for testing measurement invariance in small samples (Muthén, 1989; Woods, 2009), and configural invariance must be established for subsequent tests of metric and other types of measurement invariance to be meaningful (Vandenberg & Lance, 2000). Furthermore, using an MIMIC model approach, both categorical and continuous measures of ADHD could be used in tests of measurement invariance, which is not possible with a multiple-groups approach. Thus, using the MIMIC model approach to establish configural invariance is a first and critical step in determining the utility of latent variable models for comparing between typically developing and psychiatric populations. MIMIC models were estimated using ADHD diagnostic status as a

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

6

Journal of Attention Disorders 

Table 2.  Learning Effects. Measure Delay Aversion (% choice for large reward) Backward Word Span (total points) DCCS (% correct post-switch) Finger Windows   Forward (total points)   Backward (total points) Go/No-Go  Hits  Commissions   RT (ms)   SDRT (ms) Peg Tapping (total points) Walk-a-Line (total points) Head-Toes-Knees-Shoulder (total points)

n

Time 1

Time 2

Learning effect (t)

103 97 104

0.31 (0.19) 2.25 (0.72) 0.82 (0.35)

0.31 (0.22) 2.44 (0.65) 0.94 (0.20)

0.32 2.15* 3.91***

104 104

6.73 (3.29) 3.11 (2.35)

7.84 (3.24) 3.96 (2.42)

3.17** 4.12***

91 91 90 90 104 103 103

53.27 (6.53) 3.76 (3.80) 510.21 (95.84) 374.97 (86.02) 14.04 (3.95) 0.85 (0.86) 27.19 (13.07)

53.76 (5.32) 3.97 (3.61) 472.03 (85.32) 351.37 (74.85) 14.45 (3.03) 1.30 (1.44) 32.30 (8.14)

0.87 0.65 3.86*** 2.83*** 1.50 2.83** 5.23***

Note. DCCS = Dimensional Change Card Sort; RT = reaction time; SDRT = standard deviation of reaction time. *p < .05. **p < .01. ***p < .001.

covariate. Each direct effect was estimated in a separate model with false discovery rate correction employed (Benjamini & Hochberg, 1995). Results were confirmed with total ADHD symptoms (continuous) as a covariate using the same procedures. As a final step in the analyses, preand post-test assessments were used in a single CFA model to determine the test–retest reliability (stability) of the identified factors across time. Power analysis. MIMIC models estimate a much smaller number of parameters than multiple-group CFA approaches by treating the grouping variable as a covariate, thus making them more appropriate in small samples. Monte Carlo simulation power analysis in Mplus indicated adequate power (>.80) for all models, including the MIMIC models. Thus, models were adequately powered to detect configural invariance where it existed.

Results Learning and Maturational Effects A multivariate repeated-measures ANOVA including the EF measures revealed a significant main effect of time (p < .001) and diagnosis (p < .001), but no significant Time × Diagnosis interaction effect (p = .205), indicating the presence of learning and maturational effects (i.e., improvements over time) but no difference in these effects based on diagnostic status. Follow-up univariate tests for the main effect of time indicated significant learning effects for eight of the 12 measures. In all cases, performance improved at the second administration of the test. See Table 2 for means and standard deviations at each time point and summary of significance tests for learning/maturation effects.

Reliability of Individual Measures Table 3 shows the correlation matrix for all EF tasks in the full sample, and Table 4 shows the age-partial test–retest correlations by group. In the typically developing group, three measures reached adequate or better levels of reliability: mean and standard deviation of RT from the GNG task and HTKS. In the ADHD sample, six tasks reached adequate or better reliability: DCCS, Finger Windows Backward, HTKS, as well as GNG Hit Accuracy and commission errors, and Peg Tapping. Test–retest correlations for Word Span, GNG Hit rate, and Peg Tapping were significantly higher in the ADHD than in the typically developing group.

CFA and Reliability of Latent Variables CFA.  A three-factor CFA with (a) Inhibitory Control (GNG Accuracy, GNG Commissions, Peg Tapping, Delay Aversion, Walk-a-Line Slowly, HTKS), (b) Working Memory (Backward Word Span, DCCS, Finger Windows Forward, Finger Windows Backward), and (c) Processing Speed (GNG RT, GNG SDRT) was fit in the full sample. The model did not converge. Problems with model convergence were a result of the linear dependency between the two indicators of the Processing Speed factor: RT and SDRT. This factor was also problematic in that CFA factors defined by only two indicators are not identified and so a minimum of three indicators is recommended to include a factor in CFA (Muthén & Muthén, 2009). Thus, in remaining analyses, we focus on a two-factor CFA model using only the Working Memory and Inhibition factors. The two-factor CFA model for the full sample showed only moderate fit to the data at Time 1, pre-test: Adjusted

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

7

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

9

10

.30** .35** .56*** .60*** .31** .21 .46*** .31** .34** .32 .60*** .40*** −.33** −.15 −.35** −.36* −.39** −.20 −.38** −.41** .11 .21* .18 .09

.08

.12 .13 .07 .02 −.01 .09

.11*** .14

.25* .31*** .44** .40** .20* .48* .39*** .30** .40*** .53** .70*** .63*** .35** .27** .52*** .66*** −.27* −.29* −.50*** −.62***

.12

−.01 .13 −.03 .20 −.18

.28**

.09

.08

.10

.06

−.42*** .74*** .34** .28** −.30* −.14 .38*** .38*** .42***−.19 −.19 .34** .52*** .60***−.21 .04 −.20 −.28* −.23* .57*** .11 −.33* −.32** −.28** .39** −.05 .09 .18 .29** −.12

−.14 .37** .42*** .46***−.13 −.09 .24* .19 .20* −.03 −.36** .61*** .40*** .44***−.26 −.48*** .66*** .31** .24* −.08 .56*** −.63*** −.21* −.19 .06

−.07

−.41*** — −.19 .33** — −.17 .27** .64*** — −.18 −.27 −.25* −.19 — −.17 −.62*** −.46** −.40** .98*** −.06 .13 .11 .14 −.01 .24*



12



13

14

15

16

17

−.60*** −.40** −.43** .79** .76** −.14

.10 .19 .16 .01 .07 .06

.17 .34* .33** .55*** .63*** .03 .30** .29** .32** .37** .26* .37** .29** .47*** .42*** −.11 −.22 −.16 −.33* −.34* −.15 −.28* −.23 −.40** −.48*** .02 .28* .02 .11 .09

−.39* .33** .08 — −.35* .14 .14 .21 — −.66*** .16 .13 .44** .45*** — −.54** .03 .20 .30* .35** .60*** — .36* −.10 −.15 −.26 −.35** −.58*** −.98***

.00

— .11

11

19

20

21

22

23

         



           

−.67*** —     −.33** .36** — −.35** .35** .47*** —   .14 −.22 −.29* −.23* —   .20 −.30* −.32*** −.27* .92*** — −.01 .07 .16 .28** −.30** −.33**



18

.41*** .28** .57*** .67*** .38*** .27** .53*** .36*** .27** .34** .51*** .32** −.28* .03 −.35** −.30 −.49** −.15 −.61*** −.84*** .12 .24* .14 .04

8

.09 −.07 .01 .11 .11 .09



7

         

6

— .21* — .50*** .40*** — .35*** .23* .53*** — −.20* −.24* −.35** −.47***

5

−.06 −.08 .00 .04 −.17

4



3



2

Note. DCCS = Dimensional Change Card Sort; HTKS = Head-Toes-Knees-Shoulders; GNG = Go/No-Go; FWF = Finger Windows Forward; FWB = Finger Windows Backward. RT = reaction time; SDRT = standard deviation of reaction time. *p < .05. **p < .01. ***p < .001.

  1. Delay Aversion   2.  Word Span  3.  DCCS  4.  HTKS   5.  GNG hits   6. GNG commissions   7.  Peg Tapping  8.  FWF  9.  FWB 10.  GNG RT 11.  GNG SDRT 12. Walk-a-Line Slowly 13. Delay Aversion 14.  Word Span 15. DCCS 16. HTKS 17.  GNG hits 18. GNG commissions 19.  Peg Tapping 20. FWF 21. FWB 22.  GNG RT 23.  GNG SDRT 24. Balance

1

Table 3.  Correlation Matrix.

8

Journal of Attention Disorders 

Table 4.  Individual Task Reliability Coefficients. Control Measure

n

Pearson r

Delay Aversion (% choice for large reward) Backward Word Span (total points) DCCS (% correct postswitch) Finger Windows   Forward (total points)   Backward (total points) Go/No-Go  Hits  Commissions   RT (ms)   SDRT (ms) Peg Tapping (total points) Walk-a-Line Head-Toes-Knees-Shoulder (total points)

44

.080

44

ADHD Age-partial r

Comparison

n

Pearson r

Age-partial r

Fisher’s r-to-z (p)

.083

59

.375

.377

.127

–.063

–.067

60

.460

.465

.005

44

.479

.486

60

.493

.501

.922

44 44

.303 .519

.324 .463

60 60

.284 .543

.289 .525

.850 .688

37 37 37 37 44 44 44

.130 .360 .625 .575 .093 .021 .683

.125 .361 .586 .540 .088 .067 .685

54 54 53 53 60 59 59

.596 .568 .350 .446 .727 .064 .590

.588 .559 .303 .364 .720 .064 .584

.013 .253 .106 .316 .05 for χ2 tests, all RMSEA < .03, all CFI > 0.98) and fit significantly better than the one-factor model for both groups at both time points. We next combined the two groups into a single MIMIC analysis in which the ADHD diagnostic indicator was regressed onto the factors to assess measurement invariance between the diagnostic groups. A significant direct effect of ADHD diagnosis on the factor indicators (i.e., on observed task performance) would indicate a lack of measurement invariance. At Time 1, the model fit was good, Adjusted BIC = 2,048.1; χ2(24) = 19.8, p = .711; RMSEA = 0.0; CFI = 1.00. ADHD diagnosis was significantly related to the factors, indicating that the well-documented ADHD-related deficits in working memory and inhibitory control are captured by the latent variables; however, there were no significant direct effects of ADHD diagnosis on any of the factor indicators (all p > .05). The lack of direct effects indicates measurement invariance for this time point. All results were replicated for Time 2, confirming measurement invariance across the diagnostic groups at both time points. All results were also confirmed

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

9

Karalunas et al.

Figure 1.  Figure 1 shows the two-factor CFA model for the full sample at Time 1, including the factor loading for each indicator and the correlation between factors.

Note. CFA = confirmatory factor analysis; GNG Hits = Go/No-Go Correct Go Trials; GNG Com = Go/No-Go Commission Errors; HTKS = HeadToes-Knees-Shoulder; DCCS = Dimensional Change Card Sort; Word Span = Backward Word Span; FWF = Finger Windows Forward; FWB = Finger Windows Backward.

using total ADHD symptoms as a continuous covariate instead of the categorical diagnostic indicator. Stability of factor scores.  Finally, a two-time point, two-factor CFA model was fit to establish the stability of the factor scores across time. Residual covariances between Time 1 and Time 2 tasks scores were allowed to correlate. The model was a good fit for the data, Adjusted BIC = 3,927.0; χ2(90) = 93.0, p = .393; RMSEA = 0.018; CFI = 0.99. The full model is shown in Figure 2. Consistent with prior studies, the stability of the latent variables approached unity. The correlation (stability coefficient) for the Inhibitory Control factor was .98 and for the Working Memory factor was .99.

Discussion Adequate test–retest reliability for neurocognitive tasks is not only critical to the accurate estimation of betweengroup effect sizes (Huang-Pollock, Karalunas, Tam, & Moore, 2012) but also to the ability to detect associations between these cognitive processes, putative genetic mechanisms, symptom domains, and other outcome measures (Green et al., 2004; Kendler & Neale, 2010; Kuntsi et al., 2005; Kuntsi et al., 2006). In the current study, individual

measures used to assess neurocognitive functioning in young children showed a wide range of reliabilities, with only a handful achieving adequate levels. Despite moderate or worse reliabilities for many individual tasks, latent variable modeling indicated test–retest correlations approaching 1.0 for factors measuring both Inhibition and Working Memory. High reliability was achieved over a longer test– retest interval that has previously been examined in this age range, suggesting that maturational effects did not limit the reliability of the underlying EF constructs as measured by the latent variable models. In addition, MIMIC models confirmed configural invariance across typically developing and ADHD samples, which is critical for establishing that latent variable approaches can be used for comparison of these populations. The question of configural invariance between diagnostic groups in this age range is not trivial. In particular, in preschool-aged children, EF appears to be best represented as a unidimensional construct captured by a single factor (Hughes, Ensor, Wilson, & Graham, 2009; Wiebe et al., 2011; Willoughby & Blair, 2011). However, in middle childhood, a multidimensional factor structure with three to five factors is most often found, suggesting that EF abilities become more differentiated with age (Lee et al., 2013; Miyake et al., 2000; Shing et al., 2010). Children in our age

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

10

Journal of Attention Disorders 

Figure 2.  Figure 2 shows the two-time point, two-factor CFA model, including factor loading at each time point and the correlation (stability) of factors across the two testing occasions.

Note. CFA = confirmatory factor analysis; T1 = Time 1 (test); T2 = Time 2 (retest); GNG Hits = Go/No-Go Correct Go Trials; GNG Com = Go/ No-Go Commission Errors; HTKS = Head-Toes-Knees-Shoulder; DCCS = Dimensional Change Card Sort; Word Span = Backward Word Span; FWF = Finger Windows Forward; FWB = Finger Windows Backward.

range are on the cusp of these two time periods. Furthermore, children with ADHD are often conceptualized as having maturational delays in prefrontal regions supporting EF (Gilliam et al., 2011; Kofler et al., 2013; Lijffijt et al., 2005; Mackie et al., 2007; Shaw et al., 2007), which implies that different factor structures may be needed to capture EF in typically developing and ADHD children of the same age. However, this was not the case. In the current study, the same multidimensional factor solution fit the ADHD and typically developing groups equally well. Although groups differed in factor means, capturing well-documented ADHD-related deficits on both working memory and inhibitory control, there were no direct effects of ADHD diagnosis on individual tasks. The lack of direct effects confirms that the tasks functioned similarly in both groups. Future studies with larger samples will be needed to establish strong measurement invariance, including metric and scalar invariance. However, these results suggest that latent variable approaches are a viable solution for addressing problems created by low individual task reliabilities, which reduce power for detecting between-group effects, limit the ability to detect developmental change, and interfere with the use of neurocognitive measures in endophenotype studies.

Although one recommendation from this study is to capitalize on latent variable approaches to maximize reliability, this may not always be possible, and so, several individual tasks that did not achieve adequate reliability are worth highlighting. Consider, first, the Delay Aversion task. Previous delay tasks for which adequate reliability has been reported in preschool-aged children either used test–retest intervals that were exceptionally short (~15 min), which would have artificially inflated their reliabilities, or used tangible rewards (cookies, pennies; Beck et al., 2011; Thorell & Wahlstedt, 2006). Thus, it may be that young children require tangible rewards to elicit reliable reward choices. Second, in middle childhood, mean RT and SDRT have each demonstrated adequate reliability and shown promising associations with behavioral symptoms and putative genetic mechanisms of ADHD (Kuntsi et al., 2005; Kuntsi, Oosterlaan, & Stevenson, 2001; Wood, Asherson, van der Meere, & Kuntsi, 2010). In contrast, in this study, reaction time measures failed to achieve adequate reliability in young children with ADHD, and other studies have found that they do not differentiate young children with and without ADHD (Kalff et al., 2005). The current results suggest that the inability to differentiate groups may be due to low

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

11

Karalunas et al. reliability, rather than because young children with ADHD do not have deficits in processing speed and efficiency. This difference in interpretation has important implications for the search for endophenotypes in particular. One requirement of cognitive endophenotypes is that they should be stable over time regardless of disease course (Gottesman & Gould, 2003). Thus, if speed and variability of information processing do not characterize young children with ADHD, then this would argue against their mediating between gene action and symptom domains. However, if measures of speed and variability of information processing are unreliable in young children with ADHD, then this suggests that reliable measures first need to be identified before developmental trends can be assessed. This problem with interpretation is not limited to reaction time measures. There are a growing number of studies aimed at characterizing the relationships between neurocognitive processes and ADHD symptom domains. These include studies comparing the relative heritability of different cognitive processes (Stins et al., 2005), the strength of relations between endophenotype and symptom domains in different age groups (Brocki, Fan, & Fossella, 2008), and the strength of relations between different neurocognitive processes and symptom domains. In each case, specific attention to task reliability is critical, given that any differences in the strength of association may reflect differences in task reliability rather than the underlying construct being assessed. In the current study, the Delay Aversion and Walk-a-Line Slowly tasks did not load on the Inhibitory Control factor. The inhibition of gross motor movement required for Walka-Line Slowly makes it substantially different from the other inhibitory control tasks, which emphasizes inhibition of prepotent fine motor movements and require stimulus discrimination and choice between two responses options. The low factor loading of the Delay Aversion task is consistent with prior work suggesting reward delay tasks tap a different type of inhibitory control than non-reward-based tasks (Willoughby, Kupersmidt, Voegler-Lee, & Bryant, 2011; Zelazo & Carlson, 2012). In particular, whereas the majority of inhibitory tasks for the current study tapped “cool” executive processes (i.e., inhibitory control elicited in relatively emotion-free contexts), the Delay Aversion tasks taps a “hot” inhibitory process, elicited by the emotionally salient rewards offered. Future studies in which more “hot” executive tasks are administered could test for the presence and reliability of an additional “hot” executive factor. In addition, in the best-fitting final model, the HTKS task loaded on the Working Memory factor as opposed to the Inhibitory Control factor. No cognitive task is process pure; however, the HTKS task is particularly unique in the degree to which it requires many overlapping cognitive processes for successful completion. It is sometimes described

in the literature as a “behavioral regulation” task (Ponitz et al., 2008) to emphasize that multiple cognitive abilities determine a child’s ultimate performance. Our current results indicate that the HTKS task shares more variance with measures of working memory than inhibitory control, suggesting that working memory demands are most salient in determining performance, at least for children in our study’s age range. The current study confirmed configural invariance only for the working memory and inhibitory control factors. A third processing speed factor could not be tested in the CFA models because there were too few and too highly correlated indicators for this factor. Future test batteries that incorporate a larger number of tasks assessing processing speed will be required to address this limitation. Finally, additional studies with larger sample sizes will be required to confirm the current results and to test for other types of measurement invariance, including invariance of factor loadings between groups.

Conclusion and Future Directions Current results provide evidence that many measures of cognitive abilities do not achieve adequate levels of reliability in preschool-aged children. Results have implications for both clinical practice and research. Interpretation of individual test scores is still the norm in most clinical settings, particularly with young children for whom fewer test options are available; however, this practice may lead to misrepresentation of children’s cognitive abilities when test reliability is poor. Emphasis on composite scores or patterns of performance across several tests may be more appropriate. Additional research to develop guidelines for evidence-based assessment is needed. In clinical research, the use of measures with poor reliability hinders the field’s ability to identify associations with symptom domains and may lead to erroneous conclusions about the developmental stability of cognitive phenotypes in psychiatric disorders. Selection of reliable measures is increasingly important as greater emphasis is placed on the use of neurocognitive measures to identify genetic mechanisms with relatively small individual effects, and the current study suggests that latent variable approaches are a viable solution to problems created by low reliability of individual tasks. Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

12

Journal of Attention Disorders 

Funding for this study was provided by the National Institutes of Health (grant R34 MH085889).

References American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: Author. American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Arlington, VA: American Psychiatric Publishing. Archibald, S. J., & Kerns, K. A. (1999). Identification and description of new tests of executive functioning in children. Child Neuropsychology, 5, 115-129. Barkley, R. A. (1997). Behavioral inhibition, sustained attention, and executive functions: Constructing a unifying theory of ADHD. Psychological Bulletin, 121, 65-94. Bauermeister, J. J., Bird, H. R., Shrout, P. E., Chavez, L., Ramírez, R., & Canino, G. (2011). Short-term persistence of DSM-IV ADHD diagnoses: Influence of context, age, and gender. Journal of the American Academy of Child & Adolescent Psychiatry, 50, 554-562. Beck, D. M., Schaefer, C., Pang, K., & Carlson, S. M. (2011). Executive function in preschool children: Test-retest reliability. Journal of Cognitive Development, 12, 169-193. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57, 289-300. Berlin, L., & Bohlin, G. (2002). Response inhibition, hyperactivity, and conduct problems among preschool children. Journal of Clinical Child & Adolescent Psychology, 31, 242-251. Bishop, D. V. M., Aamodt-Leeper, G., Creswell, C., McGurk, R., & Skuse, D. H. (2001). Individual differences in cognitive planning on the Tower of Hanoi task: Neuropsychological maturity or measurement error? Journal of Child Psychology and Psychiatry, 42, 551-556. Brocki, K., Fan, J., & Fossella, J. (2008). Placing neuroanatomical models of executive function in a developmental context: Imaging and imaging genetics strategies. Annals of the New York Academy of Sciences, 1129, 246-255. Carlson, S. M. (2005). Developmentally sensitive measures of executive function in preschool children. Developmental Neuropsychology, 28, 595-616. Castellanos, F. X., Sonuga-Barke, E., Milham, M. P., & Tannock, R. (2006). Characterizing cognition in ADHD: Beyond executive dysfunction. Trends in Cognitive Sciences, 10, 117-123. Conners, C. K. (2003). Conners’ Rating Scales: Revised technical manual. New York, NY: Multi-Health Systems. Curchack-Lichtin, J. T., Chacko, A., & Halperin, J. M. (2014). Changes in ADHD symptom endorsement: Preschool to school age. Journal of Abnormal Child Psychology, 42, 993-1004. Davis, H. L., & Pratt, C. (1995). The development of children’s theory of mind: The working memory explanation. Australian Journal of Psychology, 47, 25-31. Diamond, A. (2005). Attention-deficit disorder (attention-deficit/ hyperactivity disorder without hyperactivity): A neurobiologically and behaviorally distinct disorder from attention-deficit/ hyperactivity disorder (with hyperactivity). Development and

Psychopathology. Special Issue: Integrating Cognitive and Affective Neuroscience and Developmental Psychopathology, 17, 807-825. Diamond, A., & Taylor, C. (1996). Development of an aspect of executive control: Development of the abilities to remember what I said and to “do as I say, not as I do.” Developmental Psychobiology, 29, 315-334. DuPaul, G., Power, T., Anastopoulos, A., & Reid, R. (1998). ADHD Rating Scale–IV: Checklists, norms, and clinical interpretation. New York, NY: Guilford Press. Frye, D., Zelazo, P. D., & Palfai, T. (1995). Theory of mind and rule-based reasoning. Cognitive Development, 10, 483-527. Gilliam, M., Stockman, M., Malek, M., Sharp, W., Greenstein, D., Lalonde, F., . . . Shaw, P. (2011). Developmental trajectories of the corpus callosum in attention-deficit/hyperactivity disorder. Biological Psychiatry, 69, 839-846. Gnys, J. A., & Willis, G. (1991). Validation of executive function tasks with young children. Developmental Neuropsychology, 7, 487-501. Gottesman, I. I., & Gould, T. D. (2003). The endophenotype concept in psychiatry: Etymology and strategic intentions. American Journal of Psychiatry, 160, 636-645. Green, M. F., Nuechterlein, K. H., Gold, J. M., Barch, D. M., Cohen, J., Essock, S., . . . Heaton, R. K. (2004). Approaching a consensus cognitive battery for clinical trials in schizophrenia: The NIMH-MATRICS conference to select cognitive domains and test criteria. Biological Psychiatry, 56, 301-307. Huang-Pollock, C. L., Karalunas, S. L., Tam, H., & Moore, A. N. (2012). Evaluating vigilance deficits in ADHD: A meta-analysis of CPT performance. Journal of Abnormal Psychology, 121, 360-371. Hughes, C., Ensor, R., Wilson, A., & Graham, A. (2009). Tracking executive function across the transition to school: A latent variable approach. Developmental Neuropsychology, 35, 20-36. Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., . . . Wang, P. (2010). Research domain criteria (RDoC): Toward a new classification framework for research on mental disorders. American Journal of Psychiatry, 167, 748-751. Kail, R. (2007). Longitudinal evidence that increases in processing speed and working memory enhance children’s reasoning. Psychological Science, 18, 312-313. Kalff, A. C., De Sonneville, L. M. J., Hurks, P. P. M., Hendriksen, J. G. M., Kroes, M., Feron, F. J. M., . . . Jolles, J. (2005). Speed, speed variability, and accuracy of information processing in 5 to 6-year-old children at risk of ADHD. Journal of the International Neuropsychological Society, 11, 173-183. Kendler, K., & Neale, M. (2010). Endophenotype: A conceptual analysis. Molecular Psychiatry, 15, 789-797. Kim, E. S., Yoon, M., & Lee, T. (2012). Testing measurement invariance using MIMIC likelihood ratio test with a critical value adjustment. Educational and Psychological Measurement, 72, 469-492. Kindlon, D. J., Mezzacappa, E., & Earls, F. (1995). Psychometric properties of impulsivity measures: Temporal stability, validity and factor structure. Journal of Child Psychology and Psychiatry, 36, 645-661. Kline, R. (2013). Exploratory and confirmatory factor analysis. In Y. Petscher & C. Schattsschneider (Eds.), Applied quantita-

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

13

Karalunas et al. tive analysis in the social sciences (pp. 171-207). New York, NY: Routledge. Kochanska, G., Murray, K., Jacques, T. Y., Koenig, A. L., & Vandegeest, K. A. (1996). Inhibitory control in young children and its role in emerging internalization. Child Development, 67, 490-507. Kofler, M. J., Rapport, M. D., Sarver, D. E., Raiker, J. S., Orban, S. A., Friedman, L. M., & Kolomeyer, E. G. (2013). Reaction time variability in ADHD: A meta-analytic review of 319 studies. Clinical Psychology Review, 33, 795-811. Kuntsi, J., Andreou, P., Ma, J., Borger, N., & Van der Meere, J. (2005). Testing assumptions for endophenotype studies in ADHD: Reliability and validity of tasks in a general population sample. BMC Psychiatry, 5, e.1-e.11. Kuntsi, J., Neale, B. M., Chen, W., Faraone, S. V., & Asherson, P. (2006). The IMAGE project: Methodological issues of the molecular genetic analysis of ADHD. Behavioral and Brain Functions, 2, 27. Kuntsi, J., Oosterlaan, J., & Stevenson, J. (2001). Psychological mechanisms in hyperactivity: I Response inhibition deficit, working memory impairment, delay aversion, or something else? Journal of Child Psychology and Psychiatry, 42, 199-210. Kuntsi, J., Stevenson, J., Oosterlaan, J., & Sonuga-Barke, E. J. S. (2001). Test-retest reliability of a new delay aversion task and executive function measures. British Journal of Developmental Psychology, 19, 339-348. Lahey, B. B., Applegate, B., McBurnett, K., Biederman, J., Greenhill, L., Hynd, G. W., . . . Shaffer, D. (1994). DSM-IV field trials for attention-deficit hyperactivity disorder in children and adolescents. American Journal of Psychiatry, 151, 1673-1685. Lee, K., Bull, R., & Ho, R. M. (2013). Developmental changes in executive functioning. Child Development, 84, 1933-1953. Lenzenweger, M. F. (2013). Thinking clearly about the endophenotype-intermediate phenotype-biomarker distinctions in developmental psychopathology research. Development and Psychopathology, 25, 1347-1357. Lijffijt, M., Kenemans, J. L., Verbaten, M. N., & van Engeland, H. (2005). A meta-analytic review of stopping performance in attention-deficit/hyperactivity disorder: Deficient inhibitory motor control? Journal of Abnormal Psychology, 114, 216-222. Mackie, S., Shaw, P., Lenroot, R., Pierson, R., Greenstein, D., Nugent, T., . . . Rapoport, J. (2007). Cerebellar development and clinical outcome in attention deficit hyperactivity disorder. American Journal of Psychiatry, 164, 647-655. Miyake, A., Friedman, N. P., Emerson, M. J., Witzki, A. H., & Howerter, A. (2000). The unity and diversity of executive functions and their contributions to complex “frontal lobe” tasks: A latent variable analysis. Cognitive Psychology, 41, 49-100. Munakata, Y. (2001). Graded representations in behavioral dissociations. Trends in Cognitive Sciences, 5, 309-315. Muthén, B. O. (1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54, 557-585. Muthén, B. O., & Muthén, L. (2009). Topic 1. Introductory— Advanced factor analysis and structural equation modeling with continuous outcomes (Short Course Video). Retrieved from https://www.statmodel.com/course_materials.shtml

Muthén, L. K., & Muthén, B. O. (1998-2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Author. Nolen-Hoeksema, S., & Watkins, E. R. (2011). A heuristic for developing transdiagnostic models of psychopathology: Explaining multifinality and divergent trajectories. Perspectives on Psychological Science, 6, 589-609. Ponitz, C. E. C., McClelland, M. M., Jewkes, A. M., Connor, C. M., Farris, C. L., & Morrison, F. J. (2008). Touch your toes! Developing a direct measure of behavioral regulation in early childhood. Early Childhood Research Quarterly, 23, 141-158. Reynolds, C., & Kamphaus, R. (2004). Behavioral assessment system for children manual (2nd ed.). Circle Pines, MN: AGS Publishing. Rose, S. A., Feldman, J. F., & Jankowski, J. J. (2011). Modeling a cascade of effects: The role of speed and executive functioning in preterm/full-term differences in academic achievement. Developmental Science, 14, 1161-1175. Rousson, V., Gasser, T., & Burkhardt, S. (2002). Assessing intrarater, interrater and test-retest reliability of continuous measurement. Statistics in Medicine, 21, 3431-3446. Sanislow, C. A., Pine, D. S., Quinn, K. J., Kozak, M. J., Garvey, M. A., Heinssen, R. K., . . . Cuthbert, B. N. (2010). Developing constructs for psychopathology research: Research domain criteria. The Journal of Abnormal Psychology, 119, 631-639. Sass, D. (2011). Testing measurement invariance and comparing latent factor means within a confirmatory factor analysis framework. Journal of Psychoeducational Assessment, 29, 347-363. Schoemaker, K., Bunte, T., Wiebe, S. A., Espy, K. A., Deković, M., & Matthys, W. (2012). Executive function deficits in preschool children with ADHD and DBD. Journal of Child Psychology and Psychiatry, 53, 111-119. Shaffer, D., Fisher, P., Lucas, C., Dulcan, M., & Schwab-Stone, M. (2000). NIMH Diagnostic Interview Schedule for Children– Version IV (NIMH DISC-IV): Description, differences from previous versions, and reliability of some common diagnoses. Journal of the American Academy of Child & Adolescent Psychiatry, 39, 28-38. Shaw, P., Eckstrand, K., Sharp, W., Blumenthal, J., Lerch, J. P., Greenstein, D., . . . Rapoport, J. L. (2007). Attention-deficit/ hyperactivity disorder is characterized by a delay in cortical maturation. Proceedings of the National Academy of Sciences of the United States of America, 104, 19649-19654. Sheslow, D., & Adams, W. (2003). Wide Range Assessment of Memory and Learning, 2nd ed. (WRAML-2): Administration and technical manual. Wilmington, DE: Wide Range. Shing, Y. L., Lindenberger, U., Diamond, A., Li, S.-C., & Davidson, M. C. (2010). Memory maintenance and inhibitory control differentiate from early childhood to adolescence. Developmental Neuropsychology, 35, 679-697. Smith-Donald, R., Raver, C. C., & Hayes, T. (2007). Preliminary construct and concurrent validity of the Preschool SelfRegulation Assessment (PSRA) for field-based research. Early Childhood Research Quarterly, 22, 173-187. Sonuga-Barke, E., Taylor, E., Sembi, S., & Smith, J. (1992). Hyperactivity and delay aversion: I. The effect of delay on choice. Journal of Child Psychology and Psychiatry, 33, 387-398.

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

14

Journal of Attention Disorders 

Soreni, N., Crosbie, J., Ickowicz, A., & Schachar, R. (2009). Stop signal and Conners’ continuous performance tasks: Testretest reliability of two inhibition measures in ADHD children. Journal of Attention Disorders, 13, 137-143. Stins, J. F., de Sonneville, L. M., Groot, A. S., Polderman, T. C., van Baal, C. G., & Boomsma, D. I. (2005). Heritability of selective attention and working memory in preschoolers. Behavior Genetics, 35, 407-416. Thorell, L. B. (2007). Do delay aversion and executive function deficits make distinct contributions to the functional impact of ADHD symptoms? A study of early academic skill deficits. Journal of Child Psychology and Psychiatry, 48, 1061-1070. Thorell, L. B., & Wahlstedt, C. (2006). Executive functioning deficits in relation to symptoms of ADHD and/or ODD in preschool children. Infant and Child Development, 15, 503-518. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4-70. Wiebe, S. A., Espy, K. A., & Charak, D. (2008). Using confirmatory factor analysis to understand executive control in preschool children: I. Latent structure. Developmental Psychology, 44, 575-587. Wiebe, S. A., Sheffield, T., Nelson, J. M., Clark, C. A., Chevalier, N., & Espy, K. A. (2011). The structure of executive function in 3-year-olds. Journal of Experimental Child Psychology, 108, 436-452. Williams, B. R., Ponesse, J. S., Schachar, R. J., Logan, G. D., & Tannock, R. (1999). Development of inhibitory control across the life span. Developmental Psychology, 35, 205-213. Willoughby, M., & Blair, C. (2011). Test-retest reliability of a new executive function battery for use in early childhood. Child Neuropsychology, 17, 564-579. Willoughby, M., Kupersmidt, J., Voegler-Lee, M., & Bryant, D. (2011). Contributions of hot and cool self-regulation to preschool disruptive behavior and academic achievement. Developmental Neuropsychology, 36, 162-180. Wood, A. C., Asherson, P., van der Meere, J. J., & Kuntsi, J. (2010). Separation of genetic influences on attention deficit

hyperactivity disorder symptoms and reaction time performance from those on IQ. Psychological Medicine: A Journal of Research in Psychiatry and the Allied Sciences, 40, 1027-1037. Woods, C. M. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44, 1-27. Zelazo, P. D. (2004). The development of conscious control in childhood. Trends in Cognitive Sciences, 8, 12-17. Zelazo, P. D., & Carlson, S. M. (2012). Hot and cool executive function in childhood and adolescence: Development and plasticity. Child Development Perspectives, 6, 354-360. Zelazo, P. D., & Frye, D. (1998). Cognitive complexity and control: II. The development of executive function in childhood. Current Directions in Psychological Science, 7, 121-126.

Author Biographies Sarah L. Karalunas is an Assistant Professor of Psychiatry and Behavioral Neuroscience at Oregon Health & Science University. Her research takes a multi-method approach to characterizing the basic cognitive, emotional, and neurophysiological processes that contribute to ADHD, and to understanding how these processes differ between individuals with the disorder. Karen L. Bierman is the McCourtney Professor of Child Studies, Professor of Psychology and Human Development and Family Studies, and Director of the Child Study Center at The Pennsylvania State University. Her research career has focused on social-emotional development and children at risk, with an emphasis on the design and evaluation of school-based programs that promote social competence, school readiness, positive peer relations, and that reduce aggression and related behavior problems. Cynthia L. Huang-Pollock is an Associate Professor of Psychology at The Pennsylvania State University. She is interested in the cognitive and neuropsychological mechanisms that contribute to the development of severe attention, learning, and disruptive behavior problems in children.

Downloaded from jad.sagepub.com at Bobst Library, New York University on February 11, 2016

Test-Retest Reliability and Measurement Invariance of Executive Function Tasks in Young Children With and Without ADHD.

Measurement reliability is assumed when executive function (EF) tasks are used to compare groups or to examine relationships between cognition and eti...
1MB Sizes 0 Downloads 9 Views