LSHSS

Review Article

A Psychometric Review of Norm-Referenced Tests Used to Assess Phonological Error Patterns Cecilia Kirka and Laura Vigelanda

Purpose: The authors provide a review of the psychometric properties of 6 norm-referenced tests designed to measure children’s phonological error patterns. Three aspects of the tests’ psychometric adequacy were evaluated: the normative sample, reliability, and validity. Method: The specific criteria used for determining the psychometric adequacy of these tests were based on current recommendations in the literature. Test manuals and response forms were reviewed for psychometric adequacy according to these criteria. Results: The tests included in this review failed to exhibit many of the psychometric properties required of

well-designed norm-referenced tests. Of particular concern was lack of adequate sample size, poor evidence of construct validity, and lack of information about diagnostic accuracy. Conclusions: To ensure that clinicians have access to valid and reliable tests, test developers must make a greater effort to establish that the tests they design have adequate psychometric properties. The authors hope that this review will help clinicians and other professionals to be more aware of some of the limitations of using these tests to make educational decisions.

T

tests that use the elicitation of single words to measure children’s phonological error patterns. According to a survey conducted by Skahan et al. (2007), phonological error pattern analysis is the most commonly used speech sound analysis procedure administered by pediatric SLPs. This type of analysis is recommended for children with multiple speech errors as part of a comprehensive assessment battery that also includes stimulability testing, the elicitation of a connected speech sample, a case-history review, interviews with parents and teachers, an oral cavity examination, a hearing screening, and perceptual testing (Bernthal et al., 2013). Because tests of phonological error patterns are often used in high-stakes educational decision making, it is important that they can be trusted to provide reliable and valid results. In previously published work, authors have discussed the psychometric adequacy of tests used to identify a variety of impairments in children, including language and articulation tests for preschoolers (McCauley & Swisher, 1984), language tests for young children (Andersson, 2005; Plante & Vance, 1994), and childhood vocabulary tests (Bogue, DeThorne, & Schaefer, 2014). In the current review, we expand on such work by providing recommendations that are specific to evaluating tests used to assess children’s phonological development. In the next section, we discuss the rationale for the criteria that we have used to determine the psychometric adequacy of the tests included in this

he majority of speech-language pathologists (SLPs) who work with pediatric populations report that they always administer a norm-referenced singleword test to assess the speaking skills of children suspected of having a speech sound disorder (SSD; Skahan, Watson, & Lof, 2007). SLPs use single-word tests because they are an efficient way of collecting information about a child’s production of speech sounds, although a connected speech sample should also be included in any phonological assessment battery because it allows speech to be evaluated in a more natural context (Bernthal, Bankson, & Flipsen, 2013). The results of single-word tests are useful for a variety of purposes. One of the most common uses of these tests is to determine whether a client meets eligibility requirements to receive speech-language services. Single-word tests are also used to inform the writing of intervention goals and to document a client’s performance in response to intervention (Hodson, Scherz, & Strattman, 2002). The purpose of the current article is to provide a review of the psychometric properties of six norm-referenced a

University of Oregon, Eugene Correspondence to Cecilia Kirk: [email protected] Editor: Marilyn Nippold Associate Editor: Rebecca McCauley Received July 29, 2013 Revision received March 17, 2014 Accepted July 29, 2014 DOI: 10.1044/2014_LSHSS-13-0053

Disclosure: The authors have declared that no competing interests existed at the time of publication.

Language, Speech, and Hearing Services in Schools • Vol. 45 • 365–377 • October 2014 • © American Speech-Language-Hearing Association

365

review. The criteria that we have adopted to evaluate the psychometric adequacy of these tests are based on the assumption that these tests will be used for the most important purposes for which they may have been developed: to diagnose a phonological disorder and determine eligibility for speech services. Each test included in our review is evaluated regarding the appropriateness of its normative sample, as well as whether adequate evidence of reliability and validity are documented by the test developers. The cutoff values for the criteria we have adopted are consistent with the recommendations in the literature where these recommendations exist. However, there are few established guidelines regarding specific cutoff values for determining the adequacy of validity. Instead, test makers are responsible for “furnishing relevant evidence and a rationale in support of the intended test use,” and test users are responsible for “evaluating the evidence in the particular setting in which the test is to be used” (American Educational Research Association [AERA], American Psychological Association, & National Council on Measurement in Education, 1999, p. 11). We have suggested cutoff values that assume the tests will be used for high-stakes educational decision making, especially that related to diagnosis of a phonological disorder.

The Normative Sample Standardized, norm-referenced educational tests provide normative data that allow a clinician to compare a child’s ability to that of his or her same-age peers. Clinicians use these normative data to make judgments about a child’s score(s) on the test in comparison to the general population even though the entire population was not tested. Because a child’s test score is often used to compare his or her standing with the entire population of same-age peers, it is important that the normative sample represents the range of variability that is present in the general population (Salvia, Ysseldyke, & Bolt, 2013). Normative Sample Size In educational testing, normative data are usually provided for specific age groups and occasionally for subgroups within these age groups, such as gender. When test developers collect the normative data, it is important that a sufficiently large number of participants in each subgroup be tested, because this increases the likelihood that the sample represents the variability and distribution of scores of the wider population. To evaluate whether the tests have an adequate sample size, it has been recommended that there should be at least 100 individuals in each subgroup for which normative data are reported (Andersson, 2005; Salvia et al., 2013; Sattler, 2008). Demographics of the Normative Sample The demographics of the normative sample should be similar to the demographics of the entire population who may take the test (Anastasi & Urbina, 1987). It is important that the manuals of tests intended for nationwide

366

use report the race/ethnicity, socioeconomic status (SES), gender, and geographic region of the normative sample because these factors may influence a child’s speech and language development. For example, children from lower SES backgrounds often have lower language test scores than children from higher SES backgrounds (Dollaghan et al., 1999; Hoff, 2003). Furthermore, the demographic subgroups should be represented in proportions within 5% of the general population at the time the normative data are collected (Andersson, 2005). Recency of the Normative Data Population demographics change over time and so normative data must be updated periodically to accurately represent the current population. Salvia et al. (2013) recommend 15 years as the maximum acceptable life of normative data used in ability testing. Inclusion of Individuals for Whom the Test Is Intended In Standards for Educational and Psychological Testing, AERA et al. (1999) have recommended that normative samples include individuals for whom the test is intended. Because educational testing is often used to determine whether or not a child has a disability, children with relevant disabilities should be included in the normative sample. Therefore, for tests that assess children’s phonological development, children with SSD should be included in the normative sample. It has also been recommended that when the proportion of children with a disability constitutes a relatively small proportion of the population, these children should be represented in the normative sample in proportions within 2% of the general population (Andersson, 2005). In addition, tests should report the proportions of children with SSD for all normative subgroups. This is particularly important for tests of phonological error patterns because the prevalence of SSD diminishes with age. The prevalence of 3-year-olds with SSD has been estimated to be as high as 15.6% (Campbell et al., 2003), whereas the prevalence in 6-year-olds is estimated to be only 3.8% (Shriberg, Tomblin, & McSweeny, 1999).

Reliability Reliability refers to the consistency of a test’s results when the testing procedure is repeated on a population of individuals or groups (AERA et al., 1999). Test reliability measures the effect of random error on test scores. If a test demonstrates a high degree of reliability, the test is not greatly affected by random error. No test is completely free from random error, but steps should be taken by the test makers to ensure that random error is minimized. Reliability contributes to the evidence supporting the validity of test scores. If a test gives different scores each time that it is administered to the same individual, it is less reliable, and therefore the scores are not as valid as they would be if the scores were more consistent across time. However, a test can give similar results each time it is administered to the same individual but still provide invalid test scores. The test could give the same incorrect results

Language, Speech, and Hearing Services in Schools • Vol. 45 • 365–377 • October 2014

at each administration, as would be the case if there were a systematic error in the test. Thus, reliability provides necessary but not sufficient evidence of validity (AERA et al., 1999). The amount of random error in test scores is often measured using reliability correlation coefficients. It has been recommended that if a test score is to be used to make important educational decisions about a student, such as determining a child’s eligibility for services, then the reliability coefficient should be at least .90 (Bracken, 1987; Nunnally, 1978; Salvia et al., 2013). Reliability coefficients may vary according to group differences in ability level. For example, groups of younger participants and groups of less able individuals tend to have lower reliability (Anastasi & Urbina, 1997; Bracken, 1987). It is therefore recommended that reliability should be calculated for each subgroup for which the test manual provides normative data, such as age and gender, as well as for individuals with reduced ability on the skills being tested (AERA et al., 1999). However, reliability correlation coefficients for the total scores obtained can be strong and positive even when there is lack of agreement on individual items (Hutchinson, 1996). The calculation of point-to-point percentage of agreement for individual test items addresses this issue. On the other hand, minor differences among scorers or differences in the performance of the same individual across different testings can be exaggerated by measuring pointto-point agreement. For this reason, Andersson (2005) recommended that reliability be reported using both pointto-point percentage of agreement and reliability correlation coefficients. Generalizability coefficients provide an alternative to reliability coefficients as a method of measuring potential sources of error. Generalizability theory (Cronbach, Nageswari, & Gleser, 1963) simultaneously considers a number of factors that might affect reliability and may include individuals, raters, items, and settings, among other possibilities. The amount of error caused by each factor as well as the interaction of factors is quantified by applying the techniques of analysis of variance. However, these methods are not commonly used to evaluate the reliability of tests that measure speech and language ability, and none of the tests included in this review reported generalizability coefficients. Internal Consistency Internal consistency is a measure of how well test items that are designed to measure the same construct produce similar scores. One way to calculate the internal consistency of a test is to split the test into two equal halves and then correlate participants’ scores on the two halves. This is called a split-half reliability estimate. Because there are many ways to split a test into two parts, the split-half reliability estimate may differ depending on the manner in which the test is split. A more reliable measure of internal consistency is the coefficient alpha, which estimates the average split-half reliability estimate for all possible ways of dividing the test (Hutchinson, 1996).

Test–Retest Reliability Test–retest reliability is a measure of the consistency of test scores over time. The ideal test assigns the same results to the same individual each time it is administered. To calculate test–retest reliability, a test is administered to the same set of participants twice, and reliability correlation coefficients and point-to-point percentage of agreement between the administrations should be calculated. It is important that tests used with children report the range of time intervals between the test and the retest (Linn & Gronlund, 2000; McCauley, 2001). Young children develop quickly, so a child could show differences in ability in less than a month. Conversely, if the test is readministered after only a few days, the child may remember his or her answers and repeat them during the readministration, which would inflate test–retest reliability measures. Salvia et al. (2013) recommended 2 weeks as the preferred time frame between test administrations. For most educational and psychological characteristics, this time period is long enough that the participants’ scores will not be inflated by test learning, but short enough that the participants’ scores will not change due to developmental growth. Many tests report the average interval between administrations for all participants but not the range of intervals. This is problematic because the test could be administered to some children on the same day as the first administration and to others 2 months later. In this instance, the average interval for all children could still be 2 weeks. The range of intervals between testings should be reported so that clinicians can be sure that the test was readministered to all children in the test–retest reliability study within an acceptable time frame. Interscorer Reliability Interscorer reliability measures the consistency of test scores across different scorers. To establish that a test has adequate interscorer reliability, a test is administered to a group of participants, and then two or more individuals score the participants’ responses independently. Reliability correlation coefficients and point-to-point percentage of agreement between scorers on individual test items should be calculated.

Validity Validity refers to the degree to which evidence is provided to support confident use of test scores for their intended purpose (AERA et al., 1999; Linn & Gronlund, 2000). Test authors are expected to provide validation evidence for the most commonly drawn inferences from performance on their test. The Standards (AERA et al., 1999) identified five types of validity evidence that should be included in test manuals: 1.

Evidence based on test content

2.

Evidence based on internal structure

3.

Evidence based on relations to variables external to the test

Kirk & Vigeland: Review of Tests of Phonological Error Patterns

367

4.

Evidence based on consequences of testing

5.

Evidence based on response processes.

We will evaluate only the first three types of evidence listed above. The fourth type of evidence, which examines the consequences of testing, refers to the way in which test validity is informed by whether or not the anticipated benefits of testing were realized. Salvia et al. (2013) noted that this type of validity evidence has not been widely adopted in educational testing, so it will not be included in the current review. The fifth type of evidence, which examines response processes, refers to the way that test takers arrive at their responses to test questions. This type of evidence is more relevant to the testing of skills that involve problem solving and so will not be included in this review. In addition to the types of evidence listed above, we will discuss validity evidence related to diagnostic accuracy. The identification of children with SSD is often cited as the primary use of tests of phonological error patterns (Skahan et al., 2007). Therefore, to be clinically useful, these tests must accurately diagnose a phonological disorder, and it is for this purpose that we are most interested in examining the question of validity. Evidence Based on Test Content Test content refers to the information or skills required to respond correctly to the test items as well as the administration and scoring procedures of the test (AERA et al., 1999). Important validity evidence can be gained by evaluating the extent to which a test assesses all relevant aspects of the construct it is intended to measure. Tests that do not assess all essential skills or that assess irrelevant skills do not have appropriate test content. Tests of phonological error patterns are intended to measure speech production accuracy of common phonological patterns. To supply adequate evidence of content validity, these tests should provide sufficient opportunities to assess errors on a core set of phonological patterns. Evidence Based on Internal Structure Internal structure refers to “the degree to which relationships among test items and test components conform to the construct on which the proposed test score interpretations are based” (AERA et al., 1999, p. 13). Tests of phonological error patterns assess multiple error patterns, and a child’s total score on a test of this type depends on his or her combined performance on the various error patterns sampled by the test. To have adequate internal structure, tests of phonological error patterns should provide an equal number of opportunities for each error pattern. This will avoid overestimating or underestimating the total scores of children with particular error patterns. For example, a child might demonstrate a single error pattern for which there are many opportunities on a given test. A different child, however, might demonstrate two error patterns, but the test might include relatively few opportunities for these two error patterns. The overall error score of the second child will be lower than that of the child who demonstrates only one error pattern, and this may misrepresent the seriousness of the second child’s speech sound disorder.

368

Evidence Based on Variables External to the Test An important source of evidence that can be used to support the validity of test scores is provided by analyzing the relationship between test scores and variables external to the test (AERA et al., 1999). The current review examines two types of evidence based on other variables: developmental trends and concurrent validity. Developmental trends. Many skills or behaviors assessed in educational decision making are developmental. One way to assess the validity of test scores is to examine test scores across age groups. On tests that measure developmental skills or behaviors, older age groups should demonstrate higher levels of performance than younger age groups. Because the phonological systems of typically developing children are known to develop over time, one type of validity evidence that can be used to evaluate tests of phonological error patterns is developmental trends. However, it should be recognized that age-related trends in development are a relatively weak measure of validity because so many skills and personal qualities are developmental (DeThorne & Schaefer, 2004). Concurrent validity. This type of validity measures how well an individual’s performance on a test predicts his or her current level of skill on the construct being measured. Concurrent validity is assessed by comparing participants’ scores on the test of interest to their scores on a previously validated test. A strong, positive correlation suggests that the new test measures the same construct as the previously validated test. It should be noted, however, that the value of concurrent validity rests on the reliability and validity of the comparison test. Two tests may be highly correlated with each other, but if the established test itself does not have good validity, this correlation will not support the validity of the scores from the new test. The Standards for Educational and Psychological Testing document (AERA et al., 1999; hereafter, “the Standards”) does not provide guidelines for acceptable levels for validity coefficients, so we follow McCauley’s (2001) recommendation that validity coefficients should be at least .90 if the test is going to be used to make important educational decisions about an individual. To ensure that a test of phonological error patterns has adequate evidence of concurrent validity, it should be compared to a previously validated test of phonological error patterns. Evidence Based on Diagnostic Accuracy One form of validity evidence that is not addressed by the Standards (AERA et al., 1999) is diagnostic accuracy. Diagnostic accuracy refers to evidence that the test is valid for the purpose of making diagnostic decisions (Dollaghan, 2007; Greenslade, Plante, & Vance, 2009; Plante & Vance, 1994, 1995). We believe that this form of evidence should be evaluated because a common purpose of norm-referenced tests in educational testing is to diagnose a disability. Diagnostic accuracy is often examined in two ways: (a) sensitivity and specificity, and (b) positive and negative predictive values. Sensitivity refers to the degree to which a test correctly identifies all individuals with a disorder who

Language, Speech, and Hearing Services in Schools • Vol. 45 • 365–377 • October 2014

had been previously diagnosed using the “gold standard” for diagnosing the disorder. For example, if a test has a sensitivity of .90, this means that if 100 children classified by the gold standard as having the disorder were given a test designed to diagnose that disorder, the test would correctly label 90 children as having the disorder and incorrectly label 10 children as not having the disorder. Conversely, specificity refers to the degree to which a test correctly identifies all individuals who do not have the disorder. The sensitivity and specificity of a test are calculated for a specific cut score, which is the value used to determine which individuals have the disorder and which do not (e.g., −1.5 SD). Depending on the cut score, the sensitivity and specificity values will likely vary inversely. For example, a cut score of −2 SD may be more likely to have excellent specificity because children who truly do not have the disorder (as determined by the gold standard) will have scores above the cut score. However, a cut score of −2 SD may be more likely to have relatively low sensitivity because some children who truly have the disorder will probably score higher than −2 SD. Unlike sensitivity and specificity, positive predictive value (PPV) and negative predictive value (NPV) are not influenced by the prevalence of the disorder in the population of interest (the base rate). PPV measures the probability of an individual having the disorder when the test is positive. Conversely, NPV measures the probability of an individual not having the disorder when the test is negative (Lalkhen & McCluskey, 2008). If the prevalence of a disorder is relatively low in a given population sample, the number of false positives increases, and PPV decreases. Similarly, if the prevalence of a disorder is relatively high in a population, the number of false positives decreases, and PPV increases (Lalkhen & McCluskey, 2008). Given that the diagnostic validity of a test depends on both the discriminatory value of the test and the prevalence of the disorder in the population of interest, test manuals should provide information on PPV and NPV in addition to the sensitivity and specificity of the test (Dollaghan, 2007). In the previous sections, we provided the framework that we will use to determine the psychometric adequacy of the tests included in this review. We have argued that in order to ensure that the testing process leads to correct inferences about test performance, it is important to consider the characteristics of the normative sample, as well as evidence of the reliability and validity of the test scores. We hope that this review will help clinicians to determine the psychometric adequacy of tests that they are currently using or are considering purchasing.

Method Inclusion Criteria To select the tests included in this review, we initially consulted the list of phonological and articulation tests provided online by the American Speech-Language-Hearing Association (n.d.). Tests that met the following criteria were

selected for further review: (a) the test examined phonological error patterns using single-word production, (b) the test was published in or after 1990, (c) the test provided normative data as well as standardized instructions for administration and scoring, and (d) the test could be obtained for purchase from the publisher or author. Table 1 provides a list of the six tests that met all four specified inclusion criteria.

The Review Process The psychometric properties of the six tests included in our review were evaluated according to the 12 criteria listed in Table 2. As previously indicated, these criteria were selected because of their relevance to tests of phonological error patterns. Suggested cutoff values assume the tests will be used to diagnose a phonological disorder and determine eligibility for speech services. Note that for some criteria, there were several subcriteria, for example, the criterion that examined the demographics of the normative sample. All subcriteria relating to a given criterion had to be met before it was determined that a test completely met that criterion. Both authors independently examined the test manuals and response forms for each of the six tests included in the review to identify whether each of the 12 criteria listed in Table 2 was completely met, was not met or was only partially met, or evidence was not provided within the test manual. Point-to-point agreement was calculated for a total of 72 rating judgments (6 tests × 12 criteria). The percentage of agreement between the two scorers was 100%. The current review evaluated the adequacy of each test’s normative sample in terms of sample size, the demographics of the sample, the recency of the sample, and the inclusion of children with SSD in the sample. Each test was also evaluated for three types of reliability: internal consistency, test–retest reliability, and interscorer reliability. Four types of validity evidence were evaluated: evidence based on test content, evidence based on internal structure, evidence based on variables external to the test, and evidence based on diagnostic accuracy. In order to evaluate validity evidence based on test content, the tests included in the current review were examined to determine whether they provided sufficient opportunities to assess 11 core phonological error patterns. The complete list of error patterns together with their definitions is provided in the Appendix. The error patterns selected for this review include the 10 most common patterns produced by the 670 children between the ages of 2;0 and 8;11 (years; months) who were tested to create the normative sample of the Diagnostic Evaluation of Articulation and Phonology (DEAP; Dodd, Hua, Crosbie, Holm, & Ozanne, 2006). However, we made two changes to the error patterns listed in the DEAP. Given that velar fronting resolves earlier than palatal fronting in typically developing children (StoelGammon & Dunn, 1985), we evaluated opportunities for fronting of velars separately from fronting of palatals. Furthermore, because vocalization of postvocalic /l/ is widespread in the speech of adult speakers of standard American English (Gordon, 2004a, 2004b; Kretzschmar, 2004; Schneider,

Kirk & Vigeland: Review of Tests of Phonological Error Patterns

369

Table 1. Summary of tests that met inclusion criteria.

Test

Analyses available

Consonant Inventory Word Inventory *Phonological Processes Inventory Clinical Assessment of Articulation and Consonant Inventory Phonology, 2nd ed. (CAAP–2; Secord & Vowel Checklist Donohue, 2014) School-Age Sentences Stimulability Assessment *Phonological Process Evaluation Diagnostic Evaluation of Articulation and Diagnostic Screen Phonology (DEAP; Dodd, Hua, Crosbie, Oral Motor Screen Holm, & Ozanne, 2006) Articulation Assessment Phoneme Stimulability *Phonology Assessment: Single-Word Production Phonology Assessment: Connected Speech Word Inconsistency Assessment Hodson Assessment of Phonological Patterns, Preschool Phonological Screening 3rd ed. (HAAP–3; Hodson, 2004) Consonant Inventory Vowel Inventory Stimulability *Assessment of Phonological Patterns Multisyllabic Word Screening Khan-Lewis Phonological Analysis, 2nd ed. Phonetic Inventory (KLPA–2; Khan & Lewis, 2002) *Phonological Process Analysis Structured Photographic Articulation Test II Consonant Inventory: Single Words featuring Dudsberry (SPAT–D II; Dawson & Consonant Inventory: Connected Speech Tattersall, 2001) Vowel Inventory Stimulability *Word Level Phonological Processes Bankson-Bernthal Test of Phonology (BBTOP; Bankson & Bernthal, 1990)

Number of test Age range items included in (years:months) phonological analysis 3;0 to 9;11

80

2;6 to 11;11

84

3;0 to 8;11

50

3;0 to 7;11

50

2;0 to 21;11

53

3;0 to 9;11

45

Note. An asterisk denotes the specific analysis that was included in this review.

2004; Wells, 1992), we did not examine this error pattern. For a more comprehensive discussion of the theory and research that justifies the selection and definition of all 11 error patterns, please refer to Kirk and Vigeland (in press). It should be remembered that not all of the error patterns evaluated in this review are relevant for all speakers. For example, the error pattern of derhoticization may not be relevant for children who speak African American English because many adult speakers of this dialect speak a non-rhotic variety of English (Schneider, 2004). Similarly, in Spanish, only a limited set of consonants is phonotactically legal in word-final position, which may result in some Spanish speakers deleting word-final consonants when pronouncing English words (Goldstein, 2001). Clinicians are responsible for familiarizing themselves with the cultural and linguistic norms of the clients for whom they provide services so that they can determine whether a client’s response is an appropriate dialectal variation of the standard form. For the purposes of the current review, a word was considered to provide an opportunity for an error pattern only if the target phoneme was a singleton consonant (except for the error patterns of final consonant deletion and cluster reduction). Thus, rain was considered to provide an opportunity for gliding of /r/, but crab was not. When a consonant cluster is reduced to a singleton (e.g., crab produced as cab), this very often removes the opportunity for another error pattern to occur. For example, a child who deletes /r/ in crab cannot glide

370

the /r/ to [w]. Therefore, clusters do not provide a reliable opportunity for error patterns to surface in a child’s speech. Furthermore, a phoneme or consonant cluster only counted as an opportunity for an error pattern if it was in a stressed syllable. A number of studies have shown that consonants in stressed syllables are produced more accurately than consonants in unstressed syllables (Kirk & Demuth, 2006; Klein, 1981; Zamuner & Gerken, 1998). Note, however, that because the error pattern of weak syllable deletion necessarily involves unstressed syllables, opportunities for this error pattern included unstressed syllables. In addition, the error patterns of velar fronting and stopping of fricatives and affricates are more prevalent in word-initial position than in word-final position (Chiat, 1989; McAllister Byun, 2012; Morrisette, Dinnsen, & Gierut, 2003; Smit, 1993). Asymmetries in the production accuracy of word-initial and word-final consonant clusters have also been reported (Kirk & Demuth, 2005). Therefore, sufficient opportunities had to be provided in both word-initial and word-final positions for these asymmetrical error patterns. Word-medial consonants were not considered valid opportunities for any error pattern because it is not clear whether these consonants are syllable onsets, syllable codas, or ambisyllabic (Rvachew & Andrews, 2002; Stoel-Gammon, 2002). Furthermore, sufficient opportunities for the gliding of liquids had to be provided for both /l/ and /r/, because for most children, gliding of /l/ resolves earlier than gliding of /r/ (Smit, 1993).

Language, Speech, and Hearing Services in Schools • Vol. 45 • 365–377 • October 2014

Table 2. Criteria used to evaluate the psychometric properties of tests. Type Sample size Demographics

Recency Inclusion of children with SSD Internal consistency Test–retest reliability Interscorer reliability Content validity

Internal structure validity evidence Evidence of developmental trends Concurrent validity Diagnostic accuracy

Description There must be at least 100 individuals in each subgroup for which normative data is provided. Socioeconomic status, race/ethnicity, gender, and geographic region of the sample must be within 5% of the population for all normative subgroups. The test manual must provide a comparison between the sample demographics and the most recent U.S. Census Bureau data collected at the time of the test’s development. Normative data must be no more than 15 years old. Children with SSD must be represented within 2% of the population for all normative subgroups. Coefficient alpha must be at least .90 for all normative subgroups. Correlation coefficients must be at least .90 and point-to-point percentage of agreement for each item must be at least 90% for all normative subgroups. Participants must have been retested within 1–3 weeks. Correlation coefficients must be at least .90 and point-to-point percentage of agreement for each item must be at least 90% for all normative subgroups. Tests must provide at least four opportunities for each of 11 core phonological error patterns. The error patterns of velar fronting, stopping of fricatives and affricates, and cluster reduction must provide at least four opportunities in both word-initial and word-final positions. The error pattern of gliding of liquids must provide at least four opportunities for both /l/ and /r/. Opportunities for each error pattern must contribute equally to the total error score. Reported mean test scores for normative samples should improve as age increases. Test must have been compared to another test of phonological error patterns that has been validated, and the correlation coefficient between test scores must be at least .90. Sensitivity and specificity must be at least .90 for one cut score. Positive predictive value and negative predictive value must be at least .90 for at least one cut score and one base rate.

Note. SSD = speech sound disorders.

There is currently no research indicating what constitutes a sufficient number of opportunities for determining that a child’s speech displays a particular error pattern. In the interest of balancing comprehensiveness with the time constraints of clinical practice, we follow McReynolds and Elbert’s (1981) recommendation that a test include at least four exemplars for a given error pattern to be considered active in a child’s phonological system.

Results A summary of our evaluation is provided in Table 3 with related information offered in the accompanying text.

The Normative Sample Normative Sample Size Of the tests examined by the current review, none sampled at least 100 children for each of the subgroups for which normative data were reported. Therefore, these norms are unlikely to be sufficiently large to be representative of the general population. Demographics of the Normative Sample Four tests reported the demographics of the sample for socioeconomic status, race/ethnicity, gender, and geographic location (Clinical Assessment of Articulation and Phonology—2nd Edition [CAAP–2; Secord & Donohue,

Table 3. Evaluation of tests based on specific psychometric criteria. Criterion Sample size Demographics Recency Inclusion of children with SSD Internal consistency Test–retest reliability Interscorer reliability Content validity Internal structure validity evidence Evidence of developmental trends Concurrent validity Diagnostic accuracy

BBTOP

CAAP–2

DEAP

HAPP–3

KLPA–2

SPAT–D II

✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✔ ✔ Ø Ø

✖ ✖ ✔ ✖ Ø ✖ ✖ ✖ ✖ ✔ ✖ ✖

✖ ✖ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✔ ✖ ✔

✖ ✖ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✔ Ø Ø

✖ ✖ ✔ ✖ ✖ ✖ ✖ ✖ ✖ ✔ ✖ Ø

✖ ✖ ✔ Ø ✖ ✖ ✖ ✖ ✖ ✔ ✖ Ø

Note. ✔ = criterion was completely met; ✖ = criterion was not met or was only partially met; Ø = evidence for criterion was not provided in test manual.

Kirk & Vigeland: Review of Tests of Phonological Error Patterns

371

2014]; Diagnostic Evaluation of Articulation and Phonology [DEAP; Dodd, Hua, Crosbie, Holm, & Ozanne, 2006]; Hodson Assessment of Phonological Patterns—3rd Edition [HAAP–3; Hodson, 2004]; and Khan-Lewis Phonological Analysis—2nd Edition [KLPA–2; Khan & Lewis, 2002]). The Bankson-Bernthal Test of Phonology (BBTOP; Bankson & Bernthal, 1990) and the Structured Photographic Articulation Test II featuring Dudsberry (SPAT–D II; Dawson & Tattersall, 2001) reported sample demographics only for race/ethnicity, gender, and geographic location. Two tests (DEAP, KLPA–2) reported overall sample demographic characteristics that were within 5% of the general population for each of socioeconomic status, race/ethnicity, gender, and geographic location. Only the CAAP–2 reported sample demographics for each of the four demographic characteristics by normative subgroup, but U.S. Census data were reported only for overall demographics. Recency of the Normative Data All tests examined by the current review, except the BBTOP, provide normative data that were collected within the past 15 years. However, the normative data for two tests (KLPA–2, SPAT–D II) are over 10 years old, so these tests will need to be updated soon. Inclusion of Individuals for Whom the Test Is Intended No test reported the proportion of children with SSD in each normative subgroup. This is problematic given the age-related differences in the prevalence of SSD. Four tests (CAAP–2, DEAP, HAPP–3, KLPA–2) included children with some type of speech and/or language impairment in the overall normative sample. However, the proportion of children with speech and/or language impairment in the overall normative sample of these four tests ranged from less than 1% to 7%. Given that these four tests all provide normative data for children as young as 3 years, these proportions do not reflect the prevalence of SSD in the general population. The BBTOP reported that the sample consisted of “normally developing children,” and the SPAT–D II failed to report whether or not children with SSD were included.

Reliability Internal Consistency No test provided internal consistency data for all subgroups for which it provided normative data. The HAPP–3 reported a coefficient alpha of at least .90 for all normative groups for which it provided internal consistency data. The KLPA–2 reported a coefficient alpha of at least .90 for all groups for which it provided data except one, boys ages 13– 14 years. This group had a coefficient alpha of .89, so it was very close to meeting the criterion. The DEAP did not obtain a coefficient alpha of at least .90 for any group for which it provided internal consistency data. The average coefficient alpha for the DEAP was .78, and the range was .55–.88. The BBTOP used split-half reliability coefficients to analyze item consistency and reported a coefficient of at least .90 for all normative groups for which it provided internal consistency

372

data. The SPAT–D II reported internal consistency coefficients that ranged from .77–.95 but did not report whether split-half reliability coefficients or coefficient alpha was used to calculate internal consistency. The CAAP–2 did not provide any data on internal consistency. Test–Retest Reliability None of the tests provided both correlation coefficients and point-to-point percentage of agreement for test–retest reliability. In addition, no test reported test–retest data for each subgroup for which it provided normative data. Three tests (DEAP, HAPP–3, SPAT–D II) reported test–retest reliability coefficients of at least .90 averaging across all children who were retested. The BBTOP reported an overall reliability coefficient of .85, and the reliability coefficients for the CAAP–2 ranged between .86 and .92. The KLPA–2 assessed test–retest reliability using percentage of agreement on the individual phonological error patterns that occurred in each test item, and it also assessed total agreement on all the error patterns that occurred in each test item. While most of these items had over 90% agreement, 10 words had total agreement that was less than 90%, and three error patterns had less than 90% agreement: velar fronting in swimming,1 initial voicing in pajamas, and final devoicing in balloons. Three tests (CAAP–2, DEAP, KLPA–2) reported the range of time between test administrations, and only the CAAP–2 met the requirement of 1–3 weeks between testings. The ranges between test administrations for the other two tests were as follows: DEAP, 10–29 days; and KLPA–2, 0–34 days. Two tests (HAPP–3, SPAT–D II) reported the time between administrations as an average of 2 weeks. However, because the range of times was not provided in these test manuals, it was not possible to determine if all children were retested between 1 and 3 weeks after initial testing. One test (BBTOP) reported an average time between administrations of 8 weeks, which is far outside the recommended time frame. Interscorer Reliability For most of the tests in the current review, interscorer reliability was calculated by comparing differences in scoring between two or more clinicians. However, for the DEAP, interscorer reliability coefficients were calculated by asking two clinicians to transcribe and score participants’ productions. Thus, the DEAP interscorer coefficients reflect differences in transcription ability in addition to differences in scoring. None of the tests provided both correlation coefficients and point-to-point percentage of agreement for interscorer reliability. In addition, no test reported interscorer reliability information for all the subgroups for which it provided normative data. Two tests (CAAP–2, HAAP–3) reported a single interscorer reliability correlation coefficient 1

Many adult speakers of English substitute the final velar nasal in swimming (and other words ending in –ing) with an alveolar nasal (Schneider, 2004), so this item does not provide a reliable opportunity for velar fronting.

Language, Speech, and Hearing Services in Schools • Vol. 45 • 365–377 • October 2014

and in both cases, this was at least .90. The BBTOP reported correlation coefficients for two age groups: 4-year-olds and 8-year-olds. The correlation coefficient for 4-year-olds met criterion, but the correlation coefficient for 8-year-olds was only .85. Instead of reporting an overall reliability coefficient, the DEAP reported interscorer reliability coefficients for each phonological error pattern. This provides important information for clinicians about the specific error patterns that were difficult to score. All but two error patterns (prevocalic voicing and postvocalic devoicing) had a coefficient of at least .90. The reliability coefficients for prevocalic voicing and postvocalic devoicing were both .85. The KLPA–2 provided percentage of agreement data as a measure of interscorer reliability, but it did not report correlation coefficients. The KLPA–2 provided point-topoint percentage of agreement for two types of interscorer reliability. The first type of interscorer reliability compared scorers on the identification of each individual phonological error pattern that occurred in each test item. In addition, interscorer agreement was calculated for the total agreement on all of the error patterns that occurred in a given test item. Although most items had over 90% agreement for both types of agreement, eight words had an overall agreement of less than 90%, and three error patterns had agreement of less than 90%: final consonant deletion in ball, vocalization of liquids in ball, and cluster simplification in pencils. The interscorer reliability data provided in the test manual of the SPAT–D II was obtained by asking two scorers to rate responses from two separate administrations of the test. Participants were given the test twice, sometimes up to 16 days after the first administration. This method of collecting reliability data introduces a number of variables that could affect children’s scores in addition to the scoring of the response forms. These variables include administration differences, differences in the external environment, and differences in the internal state of the participants. Because interscorer reliability should only examine differences in scoring, the interscorer reliability of the SPAT–D II was deemed inadequate.

Validity Evidence Based on Test Content All six tests provided at least four opportunities for the error patterns of final consonant deletion, initial cluster reduction, prevocalic voicing, postvocalic devoicing, palatal fronting, and stopping of fricatives and affricates. Four tests provided sufficient opportunities for assessing weak syllable deletion (BBTOP, CAAP–2, DEAP, KLPA–2). Only the HAPP–3 provided at least four opportunities for word-final cluster reduction. All tests, except the SPAT–D II, provided sufficient opportunities for assessing velar fronting in word-final position, but only three tests (BBTOP, KLPA–2, SPAT–D II) provided sufficient opportunities for word-initial velar fronting. The lack of opportunities for evaluating velar fronting in wordinitial position is of concern because, as discussed above, many children front velars only in word-initial position.

Not a single test provided sufficient opportunities for assessing gliding of the lateral liquid /l/, and only one test (BBTOP) provided adequate opportunities for gliding of the rhotic liquid /r/. Three tests (BBTOP, HAAP–3, KLPA–2) provided sufficient opportunities for assessing derhoticization. Only the BBTOP provided sufficient opportunities for assessing deaffrication. None of the tests that we reviewed provided at least four opportunities for all of the phonological error patterns discussed above. The BBTOP, however, came close to having adequate content validity by meeting criterion for all error patterns except cluster reduction in word-final position and gliding of /l/. Evidence Based on Internal Structure For most of the tests (DEAP, HAPP–3, KLPA–2, SPAT–D II), each error pattern contributed a very different number of opportunities to the total error score. However, the BBTOP factored in the unequal number of opportunities for each error pattern by transforming the raw errors for each pattern into a “developmental scale score.” The developmental scale score for each error pattern is then summed to calculate a “phonological processes composite.” The CAAP–2 went some way toward providing evidence of adequate internal structure by restricting the test items that were counted for each error pattern to between five and 10 items. Evidence Based on Variables External to the Test Developmental trends. All six tests demonstrated changes in raw score means that indicated improvements in the tested children’s phonological systems as age increased. Therefore, all tests of phonological error patterns examined by the current review provided adequate evidence of validity for this measure. Concurrent validity. Only two tests (CAAP–2, DEAP) provided correlation coefficients with another test of phonological error patterns. However, neither test reported correlation coefficients of at least .90. Two tests (KLPA–2, SPAT–D II) reported correlation coefficients with a test of articulation rather than a test of phonological error patterns, so their concurrent validity was deemed inadequate for the purposes of the current review. The remaining two tests (BBTOP, HAPP–3) did not provide any empirical evidence of concurrent validity. Evidence Based on Diagnostic Accuracy Only two tests provided data that addressed the validity of using the test for making diagnostic decisions. The DEAP reported adequate measures of sensitivity, specificity, PPV, and NPV. At a cut score of −1 SD, sensitivity and specificity for the classification of a phonological disorder by the DEAP was at least .90. In addition, the PPV and NPV were at least .90 for a base rate of 50% and a cut score of −1 SD. The only other test that examined diagnostic accuracy was the CAAP–2. This test reported only on sensitivity and specificity, which for a cut score of −1 SD came very close to meeting the criterion of at least .90.

Kirk & Vigeland: Review of Tests of Phonological Error Patterns

373

Discussion The purpose of the current review was to provide information for clinicians regarding the psychometric adequacy of tests of phonological error patterns. We examined the psychometric properties of six tests of phonological error patterns. These tests were evaluated on the basis of their normative sample and the evidence supporting their reliability and validity. Because tests of phonological error patterns are often used in high-stakes educational decision making, we set high standards for determining whether or not a test met criteria. In the next section, we discuss the strengths and weaknesses of the tests included in our review. We then provide some suggestions for improving the psychometric adequacy of these tests.

The Normative Sample Most of the reviewed tests have done a good job of ensuring that the normative data provided in their test manuals is regularly updated. However, this normative data tended to be based on sample sizes that are unacceptably small. It was common for a test manual to report that the normative sample included 100 children for each 1-year age interval but then go on to report the standard scores and percentile ranks for 6-month or even 3-month intervals. As a result, a given data point in the normative tables could have been based on as few as 25 children. The lack of a representative sample was also apparent in the demographic characteristics of the normative data. Not all tests provided demographic information for each of socioeconomic status, race/ethnicity, gender, and geographic location. Even when sample means were reported for these characteristics, they were not always within 5% of the general population. Furthermore, only four of the six tests included children with some type of speech and/or language impairment in the sample used to create the normative data. Where proportions of children with impairments were reported, they did not reflect the prevalence of SSD in the general population. Greater time and financial resources will need to be expended to ensure that a larger and more representative sample provides the basis for the normative data reported in the test manuals.

measures should be reported because high correlation coefficients for total scores can be obtained even when there is lack of agreement on individual items. Once again, more resources will need to be expended to ensure that a greater range of ages and measures are sampled when collecting the data used to support reliability.

Evidence of Validity The established guidelines for determining the adequacy of test validity are not as well specified as for the normative sample and test reliability. This could explain why so few of the tests were able to meet our criteria for determining the adequacy of validity evidence. None of the reviewed tests provided sufficient opportunities for each of a core set of phonological error patterns. The lack of evidence to adequately support content validity is of concern because it suggests that the reviewed tests do not adequately measure the construct that they are designed to assess. One factor that may have contributed to the poor content validity across tests is that many of these tests were originally designed as tests of articulation. Then, at a later date, an error pattern analysis was applied to the original set of test items without any modification of these items. Given that tests of articulation and tests of phonological error patterns provide very different types of analysis of a child’s speech, it should not be assumed that the same list of words is appropriate for both. An additional concern is that, with two exceptions, the reviewed tests did not provide any evidence that they were valid instruments for the purpose of making diagnostic decisions. This is disturbing because tests of phonological error patterns are often used to diagnose phonological disorders and determine eligibility for speech services. If a test does not report diagnostic accuracy, then clinicians cannot be sure that the test is an appropriate tool for educational decision making. In a review of preschool language tests, Plante and Vance (1994) argued for the importance of including this type of validity evidence in test manuals. However, in the 20 years that have elapsed since this recommendation was made, few test developers have taken it upon themselves to report on diagnostic accuracy in their test manuals.

Evidence of Reliability

Suggestions for Improving the Psychometric Adequacy of Tests

For most of the tests included in our review, the overall reliability coefficients met our criterion. However, none of the reviewed tests reported reliability data for all subgroups for which normative data were provided. Given that ability is more variable in younger children than in older children, it is essential that tests report reliability information for all normative age groups. Furthermore, only one test met the criterion related to the time interval between test administrations in the test–retest studies, making it difficult to evaluate this type of reliability. A further concern is that none of the tests reported both correlation coefficients and point-to-point percentage of agreement as evidence of reliability. Both of these

As discussed by McCauley and Swisher (1984), test developers should view the assessment of psychometric properties as an integral part of test development. It is the responsibility of test developers to stay current in assessment standards and to modify the procedures used in developing and revising tests according to changes in these standards. Ideally, future test developers will assess the psychometric properties of their tests in all of the areas discussed above and provide this information in the tests’ user manuals. In particular, authors of tests of phonological error patterns should pay special attention to the representativeness of the normative sample, the validity of the construct on which the test is based, and evidence for the

374

Language, Speech, and Hearing Services in Schools • Vol. 45 • 365–377 • October 2014

diagnostic accuracy of the test. These are key elements of any well-designed test, and yet most tests reviewed here did not meet criteria in these areas. Following up on our recommendations will require test developers to include a larger and more diverse subset of children in the normative sample. They will also need to provide theoretical and empirical support for the selection and definition of the error patterns used to evaluate content validity. In addition, test developers will need to conduct studies to demonstrate that their test has good diagnostic accuracy. Taking these steps will be time consuming and expensive and will require a high level of psychometric expertise. However, the end product will provide clinicians with more reliable and valid information from which to determine the psychometric adequacy of tests of phonological error patterns. Almost all tests in the current study will need to update their normative data soon, and this renorming process will provide an opportunity to address the deficits outlined above. Addressing these concerns presents an opportunity for collaboration among test developers, university researchers, and practicing SLPs. The test developers could provide logistical and financial resources. University researchers could organize projects to examine test validity, test reliability, and the representativeness of the normative sample. SLPs could help recruit children to participate in these studies (including children with SSD) and also serve as data collectors. Given that so many of the reviewed tests were unable to meet the criteria specified in our review, it would be helpful to have evidence-based practice guidelines that provide recommendations for determining the psychometric adequacy of norm-referenced tests. These recommendations could build on already existing standards, such as those published by AERA et al. (1999). However, because adequacy criteria vary depending on the intended purpose of a test (e.g., diagnostic use, screening), these recommendations will need to be tailored to specific disorders (e.g., SSD), specific populations (e.g., pediatric populations), and the construct being evaluated (e.g., phonological error patterns). We hope that this review will help clinicians to better interpret the psychometric properties of the available singleword tests of phonological error patterns. By becoming more aware of the strengths and limitations of these tests, clinicians will be able to select the instrument that is most appropriate for their specific needs. However, we wish to remind clinicians and other decision makers that although singleword tests can be a valuable part of the assessment battery used to evaluate phonological development, these tests should not be the sole means for determining a child’s eligibility for services. Instead, this decision should depend on a variety of sampling procedures so that any decisions about the need for therapy are based on a representative sample of that individual’s speech.

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education.

(1999). Standards for educational and psychological testing. Washington, DC: Author. American Speech-Language-Hearing Association. (n.d.). Directory of speech-language pathology assessment instruments: Articulation/ phonology test: Children [Online directory of tests]. Retrieved from www.asha.org/assessments.aspx Anastasi, A., & Urbina, S. (1997). Psychological testing. Upper Saddle River, NJ: Prentice Hall. Andersson, L. (2005). Determining the adequacy of tests of children’s language. Communication Disorders Quarterly, 26, 207–225. Bankson, N. W., & Bernthal, J. E. (1990). Bankson-Bernthal Test of Phonology. Austin, TX: Pro-Ed. Bernthal, J. E., Bankson, N. W., & Flipsen, P. (2013). Articulation and phonological disorders: Speech sound disorders in children (7th ed.). Upper Saddle River, NJ: Pearson. Bogue, E. L., DeThorne, L. S., & Schaefer, B. A., (2014). A psychometric analysis of childhood vocabulary tests. Contemporary Issues in Communication Science and Disorders, 41, 55–69. Bracken, B. A. (1987). Limitations of preschool instruments and standards for minimal levels of technical adequacy. Journal of School Psychology, 26, 155–166. Campbell, T. F., Dollaghan, C. A., Rockette, H. E., Paradise, J. L., Feldman, H. M., Shriberg, L. D., . . . Kurs-Lasky, M. (2003). Risk factors for speech delay of unknown origin in 3-year-old children. Child Development, 74, 346–357. Chiat, S. (1989). The relation between prosodic structure, syllabification, and segmental realization: Evidence from a child with fricative stopping. Clinical Linguistics & Phonetics, 3, 223–242. Cronbach, L. J., Nageswari, R., & Gleser, G. C. (1963). Theory of generalizability: A liberation of reliability theory. The British Journal of Statistical Psychology, 16, 137–163. Dawson, J., & Tattersall, P. (2001). Structured Photographic Articulation Test II featuring Dudsberry. DeKalb, IL: Janelle Publishing. DeThorne, L. S., & Schaefer, B. A. (2004). A guide to child nonverbal IQ measures. American Journal of Speech-Language Pathology, 13, 275–290. Dodd, B., Hua, Z., Crosbie, S., Holm, A., & Ozanne, A. (2006). Diagnostic Evaluation of Articulation and Phonology. San Antonio, TX: Pearson. Dollaghan, C. A. (2007). The handbook for evidence-based practice in communication disorders. Baltimore, MD: Pearson. Dollaghan, C. A., Campbell, T. F., Paradise, J. L., Feldman, H. M., Janosky, J. E., Pitcairn, D. N., & Kurs-Lasky, M. (1999). Maternal education and measures of early speech and language. Journal of Speech, Language, and Hearing Research, 42, 1432–1443. Goldstein, B. A. (2001). Assessing phonological skills in Hispanic/ Latino children. Seminars in Speech and Language, 22, 39–49. Gordon, M. J. (2004a). New York, Philadelphia, and other northern cities: Phonology. In E. W. Schneider, K. Burridge, B. Kortman, R. Mesthrie, & C. Upton (Eds.), A handbook of varieties of English: Vol. 1. Phonology (pp. 282–299). Berlin, Germany: Mouton de Gruyter. Gordon, M. J. (2004b). The West and Midwest: Phonology. In E. W. Schneider, K. Burridge, B. Kortman, R. Mesthrie, & C. Upton (Eds.), A handbook of varieties of English: Vol. 1. Phonology (pp. 338–350). Berlin, Germany: Mouton de Gruyter. Greenslade, K. J., Plante, E., & Vance, R. (2009). The diagnostic accuracy and construct validity of the Structured Photographic Expressive Language Test—Preschool: Second Edition. Language, Speech, and Hearing Services in Schools, 40, 150–160.

Kirk & Vigeland: Review of Tests of Phonological Error Patterns

375

Hodson, B. W. (2004). Hodson Assessment of Phonological Patterns (3rd ed.). Austin, TX: Pro-Ed. Hodson, B. W., Scherz, J. A., & Strattman, K. H. (2002). Evaluating communicative abilities of a highly unintelligible preschooler. American Journal of Speech-Language Pathology, 11, 236–242. Hoff, E. (2003). The specificity of environmental influence: Socioeconomic status affects early vocabulary development via maternal speech. Child Development, 74, 1368–1378. Hutchinson, T. A. (1996). What to look for in the technical manual: Twenty questions for users. Language, Speech, and Hearing Services in Schools, 27, 109–121. Khan, L., & Lewis, N. (2002). Khan-Lewis Phonological Analysis (2nd ed.). San Antonio, TX: Pearson. Kirk, C., & Demuth, K. (2005). Asymmetries in the acquisition of word-initial and word-final consonant clusters. Journal of Child Language, 32, 709–734. Kirk, C., & Demuth, K. (2006). Prosodic context and frequency effects on the variable acquisition of coda consonants. Language Learning and Development, 2, 97–118. Kirk, C., & Vigeland, L. (in press). Content coverage of singleword tests used to assess common phonological error patterns. Language, Speech, and Hearing Services in Schools. Klein, H. (1981). Early perceptual strategies for the replication of consonants from polysyllabic lexical models. Journal of Speech and Hearing Research, 24, 535–551. Kretzschmar, W. A. (2004). Standard American English pronunciation. In E. W. Schneider, K. Burridge, B. Kortman, R. Mesthrie, & C. Upton (Eds.), A handbook of varieties of English: Vol. 1. Phonology (pp. 257–269). Berlin, Germany: Mouton de Gruyter. Lalkhen, A. G., & McCluskey, A. (2008). Clinical tests: Sensitivity and specificity. Continuing Education in Anaesthesia, Critical Care & Pain, 8(6), 221–223. Linn, R. L., & Gronlund, N. E. (2000). Measurement and assessment in teaching. Upper Saddle River, NJ: Merrill/Prentice Hall. McAllister Byun, T. (2012). Positional velar fronting: An updated articulatory account. Journal of Child Language, 39, 1043–1076. McCauley, R. J. (2001). Assessment of language disorders in children. Mahwah, NJ: Erlbaum. McCauley, R. J., & Swisher, L. (1984). Psychometric review of language and articulation tests for preschool children. Journal of Speech and Hearing Disorders, 49, 34–42. McReynolds, L. V., & Elbert, M. (1981). Criteria for phonological process analysis. Journal of Speech and Hearing Disorders, 46, 197–204. Morrisette, M. L., Dinnsen, D. A., & Gierut, J. A. (2003). Markedness and context effects in the acquisition of place features. The Canadian Journal of Linguistics, 48, 329–355.

376

Nunnally, J. C. (1978). Psychometric theory. New York, NY: McGraw-Hill. Plante, E., & Vance, R. (1994). Selection of preschool language tests: A data-based approach. Language, Speech, and Hearing Services in Schools, 25, 15–24. Plante, E., & Vance, R. (1995). Diagnostic accuracy of two tests of preschool language. American Journal of Speech-Language Pathology, 4, 70–76. Rvachew, S., & Andrews, E. (2002). The influence of syllable position on children’s production of consonants. Clinical Linguistics & Phonetics, 16, 183–198. Salvia, J., Ysseldyke, J. E., & Bolt, S. (2013). Assessment in special and inclusive education (12th ed.). Belmont, CA: Wadsworth/Cengage Learning. Sattler, J. M. (2008). Assessment of children: Cognitive foundations (5th ed.). San Diego, CA: Author. Schneider, E. W. (2004). Global synopsis: Phonetic and phonological variation in English world-wide. In E. W. Schneider, K. Burridge, B. Kortman, R. Mesthrie, & C. Upton (Eds.), A handbook of varieties of English: Vol. 1. Phonology (pp. 1111–1138). Berlin, Germany: Mouton de Gruyter. Secord, W., & Donohue, J. (2014). Clinical Assessment of Articulation and Phonology (2nd ed.). Greenville, SC: Super Duper Publications. Shriberg, L. D., Tomblin, J. B., & McSweeny, J. L. (1999). Prevalence of speech delay in 6-year-old children and comorbidity with language impairment. Journal of Speech, Language, and Hearing Research, 42, 1461–1481. Skahan, S. M., Watson, M., & Lof, G. L. (2007). Speech-language pathologists’ assessment practices for children with suspected speech sound disorders: Results of a national survey. American Journal of Speech-Language Pathology, 16, 246–259. Smit, A. B. (1993). Phonologic error distributions in IowaNebraska Articulation Norms Project: Consonant singletons. Journal of Speech and Hearing Research, 36, 533–547. Stoel-Gammon, C. (2002). Intervocalic consonants in the speech of typically developing children: Emergence and early use. Clinical Linguistics & Phonetics, 16, 155–168. Stoel-Gammon, C., & Dunn, C. (1985). Normal and disordered phonology in children. Austin, TX: Pro-Ed. Wells, J. C. (1992). Accents of English. Cambridge, England: Cambridge University Press. Zamuner, T., & Gerken, A. (1998). Young children’s production of coda consonants in different prosodic environments. In E. Clark (Ed.), Proceedings of the Annual Child Language Research Forum, 29 (pp. 121–128). Stanford, CA: Center for the Study of Language & Information.

Language, Speech, and Hearing Services in Schools • Vol. 45 • 365–377 • October 2014

Appendix Summary of Definitions Used to Evaluate Opportunities for Error Patterns Error pattern

Definition

Examples

Final consonant deletion (FCD)

The final consonant or the entire final consonant cluster of a word is deleted. Deletion of word-final /l/ and word-final /r/ are not considered valid opportunities for FCD. Note that the deletion of a single consonant from a cluster of two or three consonants is called cluster reduction, even if the deleted consonant is wordfinal.

food → [fu] beans → [bi]

Weak syllable deletion (WSD)

A nonfinal, unstressed syllable in a word is deleted.

tomato → [ˈmeɪɾoʊ] elephant → [ˈɛfǝnt]

Cluster reduction

A cluster of two consonants is reduced to one consonant and a cluster of three consonants is reduced to either one or two consonants.

bread → [bɛd] splash → [blæʃ ] wasp → [wɑs]

Prevocalic voicing

A word-initial voiceless consonant becomes voiced.

pig → [bɪg] sock → [zɑk]

Postvocalic devoicing

A word-final voiced consonant becomes voiceless.

dog → [dɑk] nose → [noʊs]

Velar fronting

A consonant made with a constriction at the velum is replaced by a consonant made at the alveolar ridge. Words containing the present progressive morpheme -ing are not included as opportunities for velar fronting.

cow → [taʊ] sing → [sɪn]

Palatal fronting

A consonant made with a constriction at the hard palate is replaced by a consonant made at the alveolar ridge.

sheep → [sip] bridge → [brɪdz]

Stopping of fricatives and affricates

A fricative or affricate is replaced with a stop consonant that shares the same or similar place of articulation (/f/ and /v/ are replaced with labial stops; /s/, /z/, /ʃ/, /ʒ/, /ʧ/, and /ʤ/ are replaced with alveolar stops). The interdental fricatives, /θ/ and /ð/, and the phoneme /h/ are not included as opportunities for this error pattern.

four → [pɔɚ] sun → [tʌn] dish → [dɪt]

Gliding of liquids

The liquids /r/ and /l/ are replaced by the glides [w] or [ j].

ring → [wɪŋ] leaf → [ jif ]

Derhoticization

A rhoticized vowel loses its /r/-coloring.

girl → [gɜl] star → [stɑʊ]

Deaffrication

The voiceless affricate /tʃ/ is replaced by the voiceless fricative [ ʃ ].

chip → [ ʃɪp] cheese → [ ʃiz]

Note. Only singleton consonants (except for final consonant deletion and cluster reduction) in stressed syllables (except for weak syllable deletion) are considered valid opportunities for a given error pattern. Word-medial consonants are not considered valid opportunities for any error pattern.

Kirk & Vigeland: Review of Tests of Phonological Error Patterns

377

Copyright of Language, Speech & Hearing Services in Schools is the property of American Speech-Language-Hearing Association and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

A psychometric review of norm-referenced tests used to assess phonological error patterns.

The authors provide a review of the psychometric properties of 6 norm-referenced tests designed to measure children's phonological error patterns. Thr...
202KB Sizes 2 Downloads 7 Views