Spotlights flexible mechanism that adjusts to top-down demands: object-based attention amplifies task-relevant features before task-irrelevant ones, facilitating the selection of entire objects in accord with task goals. The selection of the surfaces such as those used by Schoenfeld et al. is one of many manifestations of object-based attention. Future studies need to evaluate the degree to which the time course of sequential selection depends on task demands and discriminability in the task-relevant dimension. Furthermore, it would be interesting to see whether the sequential selection mechanism is universal and would generalize to other instances, such as when all of the features of the object are task relevant or when objects are presented for very brief periods of time [2,6]. Importantly, the sequential activation mechanism would need to be tested for instances in which object-based attention selects from location-based representations [8], which, as Schoenfeld et al. note, was not the case in their study. Schoenfeld et al. propose that sequential feature amplification underlies object-based attention and feature binding. Their findings are consistent with the integrated competition [9] and incremental grouping [10] models, which suggest that features are bound partly through the interareal connections between modules. The study by Schoenfeld et al. makes an important step in suggesting

Trends in Cognitive Sciences August 2014, Vol. 18, No. 8

that the code that links features between modules would have to accommodate a time separation between sequentially activated features. The challenge for future research remains to reveal additional areas that integrate signals from feature-responsive areas, even when they are separated both spatially and temporally. References 1 Posner, M.I. et al. (1980) Attention and the detection of signals. J. Exp. Psychol. Gen. 109, 160–174 2 Duncan, J. (1984) Selective attention and the organization of visual information. J. Exp. Psychol. Gen. 113, 501–517 3 O’Craven, K.M. et al. (1999) fMRI evidence for objects as the units of attentional selection. Nature 401, 584–587 4 Martı´nez, A. et al. (2006) Objects are highlighted by spatial attention. J. Cogn. Neurosci. 18, 298–310 5 Schoenfeld, M.A. et al. (2014) Object-based attention involves the sequential activation of feature-specific cortical modules. Nat. Neurosci. 17, 619–624 6 Valdes-Sosa, M. et al. (1998) Transparent motion and object-based attention. Cognition 66, B13–B23 7 Avrahami, J. (1999) Objects of attention, objects of perception. Percept. Psychophys. 61, 1604–1612 8 Vecera, S.P. and Farah, M.J. (1994) Does visual attention select objects or locations? J. Exp. Psychol. Gen. 123, 146–160 9 Duncan, J. et al. (1997) Competitive brain activity in visual attention. Curr. Opin. Neurobiol. 7, 255–261 10 Roelfsema, P.R. (2006) Cortical algorithms for perceptual grouping. Annu. Rev. Neurosci. 29, 203–227

How phonetically selective is the human auditory cortex? Shihab Shamma Institute for Systems Research, Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA De´partement d’E´tudes Cognitives, E´cole Normale Supe´rieure, Paris 75005, France

Responses in the human auditory cortex to natural speech reveal a dual character. Often they are categorically selective to phonetic elements, serving as a gateway to abstract linguistic representations. But at other times they reflect a distributed generalized spectrotemporal analysis of the acoustic features, as seen in early mammalian auditory cortices. Recent experiments by Mesgarani et al. reported in Science [1] shed light on the long- and hotly debated question of how speech is encoded in the brain. Speech is uniquely human and hence studying it in animals has always come with the caveat that findings may not apply. Consequently, it is exciting and satisfying to gain a detailed view of the responsiveness of the human auditory cortex to phonemes in natural speech. The recordings and resulting insights are nothing short of extraordinary in their sweep and clarity, at both the neural and perceptual levels. Corresponding author: Shamma, S. ([email protected]). 1364-6613/ ß 2014 Published by Elsevier Ltd. http://dx.doi.org/10.1016/j.tics.2014.04.001

Decades of psychoacoustic research have led to two broadly contrasting hypotheses of the representation of the basic units of speech – phonemes. One view emphasized perception of the spectrotemporal patterns (acoustic features) of the speech [2]. The other favored a motor description of phonemes in terms of the articulatory gestures of the human vocal tract that generate them [3]. Furthermore, in either case, the observed variability of the speech signal contrasted strongly with the stability of its percepts, giving rise to the notion of categorical perception of phonemes as conducted by specialized phoneme detectors in the auditory cortex [4–6]. The cortical area that supplied most the recordings discussed in this paper is called the superior temporal gyrus (STG), better known as Brodmann area 22, a high-level processing region that lies outside the primary auditory cortex. The remarkable aspect of its responses is their agility and precise coincidence with the underlying features of the speech signal. By employing a database of phonetically labeled stimuli, the authors were able to describe carefully responses to thousands of phoneme samples from many speakers and thus garner detailed insights into the selectivity of each electrode’s recordings to one or a few phonemes. 391

Spotlights Earlier studies had hinted that STG responses seemed rather selective to speech and perhaps even to specific phonemes or phoneme groups. Thus, with a widely distributed array of 256 fairly localized electrodes, responses from many recording sites to continuous speech appeared sparse and temporally associated with specific phoneme groups. Some electrodes responded only to fricatives, others only to back vowels or to nasals. Closer inspection and reorganization of the responses based on phonetic features derived from manner (degree of constriction in the vocal tract; e.g., obstruent versus sonorant) and place (location of constriction along the vocal tract; e.g., labial versus alveolar) of articulation yielded clear and spatially clustered patterns of selectivity and, more importantly, similarity matrices that were sensible and consistent with results known from a long history of perceptual studies. The manner and place of a phoneme’s articulation also endows it with characteristic spectral and temporal features that are highly correlated with its articulatory features. Consequently, the observed organizational structure of the responses may have been attributable to a general pattern of spectrotemporal selectivity across the recording sites. A detailed analysis of the electrodes’ spectrotemporal tuning using reverse correlation methods revealed that many of the perceptual attributes of the speech, such as its pitch, could be accurately inferred from the distributed responses. Furthermore, although distinct phoneme groups often activated a broad pattern of electrodes, many electrodes were selective to specific phonemes and had distinctive spectrotemporal receptive fields that predicted well the average spectrotemporal structure of the target phonemes. These findings therefore seemed to support the authors’ conclusion that the ‘. . . systematic organization of speech sounds is consistent with auditory perceptual models positing that distinctions are most affected by phonetic feature contrasts compared with other feature hierarchies (articulatory or gestural theories)’. Following this fascinating and sweeping view of cortical responses to all phonemes, the study proceeded to tackle the critical unresolved question: given the distributed, yet still clustered, response patterns to different phonemes, are there unique adaptations in the STG for the representation of these phonemes? Two examples of such adaptations are found in which the STG responses encoded complex acoustic features that would have been more difficult to capture with generic simple spectrotemporal filtering. The first feature was spectral; it concerned the well-known triangular distribution of the vowels’ first and second formants (F1 and F2) [7]. Specifically, the principal axis of variability in this distribution is the so-called F2–F1 axis, which traces the progression of vowels produced by front-to-back constrictions of the vocal tract. This acoustic feature has a complex spectrotemporal profile of doubly tuned regions that must progressively drift apart and shift their frequency patterns. Yet this feature is readily and explicitly mapped by the selectivity of the electrode responses. The second acoustic feature is temporal, the voice-onset time (VOT) – the length of time between the release of a burst for a plosive (/p, t, k, b, d, g/) and the onset of voicing for the following sonorant. The duration of this VOT has been shown to be a highly effective cue to the 392

Trends in Cognitive Sciences August 2014, Vol. 18, No. 8

distinction between the unvoiced (long VOT) /p, t, k/ and the voiced (short VOT) /b, d, g/. This temporal cue is known to be perceived categorically, which suggested that it maybe nonlinearly encoded in the STG responses [8]. Indeed, when examined across different electrodes, it was evident that there was a strong nonlinear bias for short VOTs to be represented in the voiced-plosive selective electrodes and vice versa for the unvoiced electrodes. Hence, these two examples reinforce the view that phonemes are encoded as bundles of spectrotemporal features. Aside from these few specific instances, however, most of the data and their elegant analyses are consistent with previous single-unit recordings in the human STG, which have not demonstrated strong invariant and local selectivity to single phonemes but rather a distributed response representing a multidimensional feature space for encoding the acoustic parameters of speech. Interestingly, recordings in the early auditory cortex of several mammalian species [9,10] have also indicated a more generalized description of the acoustic features of speech in which the response organization and topography reflected the structure of the speech signal rather than that of a new cognitive framework imposed by the brain (e.g., at a higher stage where semantics are integrated with acoustics). The human STG may be a transitional stage, early enough to still encode the acoustic features of speech but high enough to exhibit response selectivity to complex spectrotemporal patterns, feature combinations, and nonlinear encoding of categories. If this is indeed the case, the STG should exhibit evidence of invariant responses to distorted and noisy speech that is otherwise intelligible and of learned selectivity that encodes native phonemes of the subjects better than non-native speech and non-speech sounds of similar spectrotemporal content. These as yet unmet conceptual challenges are nonetheless now far more amenable to resolution with the sophisticated suite of analysis tools and approaches demonstrated in this paper. Acknowledgments This work was funded by National Institutes of Health (NIH) grant R01 DC007657 and an Advanced European Research Council (ERC) grant from the EU (295603).

References 1 Mesgarani, N. et al. (2014) Phonetic feature encoding in human superior temporal gyrus. Science 343, 1006–1010 2 Stevens, K.N. (2002) Toward a model for lexical access based on acoustic landmarks and distinctive features. J. Acoust. Soc. Am. 111, 1872–1891 3 Liberman, A.M. and Mattingly, I.G. (1985) The motor theory of speech perception revised. Cognition 21, 1–36 4 Cutting, J.E. and Rosner, B.S. (1974) Categories and boundaries in speech and music. Percept. Psychophys. 16, 564–570 5 Steinschneider, M. et al. (2003) Representation of the voice onset time (VOT) speech parameter in population responses within primary auditory cortex of the awake monkey. J. Acoust. Soc. Am. 114, 307–321 6 Phillips, C. et al. (2000) Auditory cortex accesses phonological categories: an MEG mismatch study. J. Cogn. Neurosci. 12, 1038–1055 7 Ladefoged, P. et al. (2010) A Course in Phonetics, Cengage Learning 8 Wood, C.C. (1976) Discriminability, response bias, and phoneme categories in discrimination of voice onset time. J. Acoust. Soc. Am. 60, 1381–1389 9 Mesgarani, N. et al. (2008) Phoneme representation and classification in primary auditory cortex. J. Acoust. Soc. Am. 123, 899–909 10 Engineer, C.T. et al. (2008) Cortical activity patterns predict speech discrimination ability. Nat. Neurosci. 11, 603–608

How phonetically selective is the human auditory cortex?

Responses in the human auditory cortex to natural speech reveal a dual character. Often they are categorically selective to phonetic elements, serving...
147KB Sizes 0 Downloads 4 Views