Perception, 2014, volume 43, pages 509 – 526

doi:10.1068/p7531

Configural and featural discriminations use the same spatial frequencies: A model observer versus human observer analysis Charles A Collin1, Stéphane Rainville2, Nicholas Watier1, Isabelle Boutet1 1

 School of Psychology, University of Ottawa, Ottawa, ON K1N 6N5, Canada, 2  VizirLabs Consulting, Chelsea, QC J9B 1L8, Canada; e‑mail: [email protected] Received 29 May 2013, in revised form 7 April 2014, published online 13 June 2014 Abstract. Previous work has shown mixed results regarding the role of different spatial frequency (SF) ranges in featural and configural processing of faces. Some studies suggest no special role of any given band for either type of processing, while others suggest that low SFs principally support configural analysis. Here we attempt to put this issue on a more rigorous footing by comparing human performance when making featural and configural discriminations with that of a model observer algorithm carrying out the same task. The model uses a simple algorithm that calculates the dot product of a stimulus image with each available potential match image to find the maximally likely match. It thus provides a principled way of analyzing available image information. We find human accuracy peaks at around 10 cycles per face (cpf ) regardless of whether featural or configural manipulations are being detected. We also find accuracy peaks in the same part of the spectrum regardless of which feature is manipulated (ie eyes, nose, or mouth). Conversely, model observer performance, measured in terms of white noise tolerance, peaks at approximately 5 cpf, and this value again remains roughly constant regardless of the type of manipulation and feature manipulated. The ratio of the model’s noise tolerance to a derived equivalent noise tolerance value for humans peaks at around 10 cpf, similar to the accuracy data. These results provide evidence that the human performance maxima at 10 cpf are not due simply to the physical characteristics of face stimuli, but rather arise due to an interaction between the available information in face images and human perceptual processing. Keywords: face perception, model observer, featural, configural, spatial frequency

1 Introduction The mammalian visual system is known to break down the complex patterns of light falling on the retina amongst a set of semi-independent spatial frequency (SF) channels (Campbell & Robson, 1968; Devalois & Devalois, 1990). That is, there are neural mechanisms in the early visual system that process only the fine details of scene, others that process its overall large-scale structure, and others that deal with various intermediate detail levels. A possible advantage of this system is that it implements a ‘divide and conquer’ strategy whereby visual information is broken up into ranges of spatial detail that are most useful for various higher order perceptual tasks (Ginsburg, 1978; Schyns, 1998; Schyns & Oliva, 1997). One such task that has received a great deal of attention is face recognition. Many studies have shown that a particular range of middle relative SFs, between about 5 and 20 cycles per face (cpf ) width, is most useful for face identification (Bachmann, 1991; Boutet, Collin, & Faubert, 2003; Collin, Therrien, Martin, & Rainville, 2006; Costen, Parker, & Craw, 1994, 1996; Fiorentini, Maffei, & Sandini, 1983; Näsänen, 1999; Parker & Costen, 2001; Tanskanen, Näsänen, Montez, Päällysaho, & Hari, 2005; Watier, Collin, & Boutet, 2010). This has been shown via behavioral studies (Bachmann, 1991; Boutet et al., 2003; Collin et al., 2006; Costen et al., 1994, 1996; Fiorentini et al., 1983; Parker & Costen, 2001; Watier et al., 2010), electrophysiological measurements (Collin, Therrien, Campbell, & Hamm, 2012; Hsiao, Hsieh, Lin, & Chang, 2005; Tanskanen et al., 2005), and ideal observer analyses (Gold, Bennett, & Sekuler, 1999; Keil, 2009; Keil, Lapedriza, Masip, & Vitria, 2008; Näsänen, 1999).

510

C A Collin, S Rainville, N Watier, I Boutet

More recently, researchers have examined the possibility that different SF ranges might subserve specific aspects of face identification. In particular, the contribution of different parts of the SF spectrum to featural and configural analysis of faces have been explored in several studies (Boutet et al., 2003; Collin, Liu, Troje, McMullen, & Chaudhuri, 2004; Collin et al., 2012; Gaspar, Sekuler, & Bennett, 2008; Watier et al., 2010; Willenbockel et al., 2010). Featural analysis is an aspect of face processing that deals with the appearance of individual facial features (ie eyes, nose, and mouth), while configural analysis deals with the arrangement of those features. Configural analysis can be further subdivided into first-order and second-order types. The former involves processing the qualitative relations between facial features (ie eyes above nose, nose above mouth), while the latter involves a quantitative assessment of the distances between features and the ratios of those distances. Second-order configural processing is thought to be particularly important in face individuation, due to the fact that first-order relations are the same for all human faces (Maurer, Grand, & Mondloch, 2002). The present study concerns itself mainly with second-order configural processing and featural processing. The potential role of different SF bands in featural and configural processing has primarily been explored via two phenomena: the face inversion effect (FIE) (Valentine, 1988; Yin, 1969) and the configural effect (Freire, Lee, & Symons, 2000; Leder & Bruce, 2000). The FIE refers to the fact that rotating stimuli 180° in the picture plane reduces recognition performance with faces to a greater degree than with other objects. This is thought to occur because inversion alters the first-order configural properties of a face, which in some way makes it difficult or impossible for otherwise automatic face processing mechanisms to operate (Maurer et al., 2002). The configural effect refers to the fact that inversion causes greater impairment in detecting displacements of facial features within a face (ie configural modifications) than it does in detecting changes in the appearance of facial features (ie featural modifications). That is, inversion has a greater effect on subjects’ ability to notice (for example) that the eyes in a face have been moved up slightly than it does on their ability to notice that the eyes have been changed to the eyes from another face. This is thought to occur because second-order configural processing is accomplished by the same automatic face processing mechanisms that are impaired by inversion (Freire et al., 2000; Leder & Bruce, 2000). Boutet et al. (2003) examined the effect of SF filtering on both the FIE and the configural effect. They did this to test the hypothesis that the advantage of middle SFs for face processing arises because they carry the information most relevant to configural processing. While they found an overall advantage of the middle band, their results showed the same degree of inversion effect whether faces were unfiltered or were filtered to low (1.25–5.00 cpf ), middle (5.00–20.00 cpf ), or high (20.00–80.00 cpf ) SFs. Likewise, they found that the magnitude of the configural effect was not modulated by the SF band of the stimuli. These results suggested that the middle band does not play a preferential role in supporting configural processing. A number of more recent studies have found results compatible with those of Boutet et al. (2003). For instance, Collin et al. (2004) found that altering the range of the SF spectrum shared by inspection and test stimuli in a sequential matching task affects face processing in the same way whether the images are shown upright or inverted. Because inversion is thought to disrupt configural processing, these findings argue that no band is particularly relevant to that kind of processing. More recently, Gaspar et al. (2008) examined the effect of adding spatially filtered noise to face stimuli and found that the same band of SFs was used for both upright and inverted face identification. Similarly, Willenbockel et al. (2010), using a SF variant of the bubbles technique (Gosselin & Schyns, 2001), showed that the tuning curves for recognition of upright and inverted faces were virtually identical. Most recently, Watier et al. (2010) measured the low and high SF thresholds for making featural

Configural and featural discriminations use of the same SFs

511

and configural discriminations and found that these were the same, approximately 8 cpf, for upright faces. Together, these findings, using a wide variety of methods, support the notion that while middle SFs are most useful for face recognition, it is not because they play a preferential role in configural analysis. While the above studies provide compelling evidence that the middle SF band is not preferentially used in configural processing, other research has suggested a special role for low SFs. For instance, Collishaw and Hole (2000) examined the effect of inversion on faces that had either been scrambled, ie their facial features had been randomly moved about the face, disrupting the low SF information in the face, or blurred, ie the high SF information had been removed from them. Their data showed that blurred faces could be recognized upright but not inverted, while scrambled faces could be recognized in both orientations. They argued that the effect of orientation on blurred faces was due to low SFs carrying primarily configural information, the processing of which is disrupted by inversion. Scrambled faces, on the other hand, had their low SF information disrupted and thus carried only featural information, which could be used in both upright and inverted faces. Their findings thus suggest that low SFs carry configural information and high SFs carry featural information. A similar conclusion emerges from work by Goffaux and colleagues (Goffaux, Hault, Michel, Vuong, & Rossion, 2005). They examined the configural effect and found that low SFs were superior to high when making configural discriminations between faces, but that high SFs were superior to low when making featural discriminations. As with Collishaw and Hole, they interpret their findings as suggesting that low SFs support configural processing. A more recent study by Goffaux (2008) adds to the controversy regarding which SFs are most useful for featural versus configural discriminations by showing that the role of SFs in configural processing depends on the direction of motion of features. For vertical displacements she found that middle SFs (8–32 cpf ) were optimal, but that high SFs (> 32 cpf ) yielded better performance for featural discriminations. The literature discussed above clearly shows mixed findings regarding the role of different SF ranges in featural and configural processing. There are a number of reasons why such inconsistencies may have arisen. One is that featural and configural processing mechanisms in the visual system might not be completely independent. Indeed, evidence suggesting that featural and configural discriminations rely on the same visuospatial detail ranges (Boutet et al., 2003; Collin et al., 2004; Gaspar et al., 2008; Watier et al., 2010; Willenbockel et al., 2010) is compatible with the idea that both kinds of information are processed via a single mechanism. Another reason for inconsistencies in the literature may be that, even if configural and featural processing mechanisms are distinct, the methods used to tap into them might not access them in an orthogonal way. For instance, the FIE was originally thought to arise due to inversion disrupting configural face processing (Yin, 1969), but the degree to which this is the case was difficult to ascertain, as inversion also disrupts featural processing (McKone & Yovel, 2009). The configural effect was originally conceived of as a more direct test of the theory that the FIE arises due to configural processing disruption (Friere et al., 2000). The logic of the paradigm is straightforward: to test which processes are being disrupted, make either configural or featural changes to faces and assess whether detection of one kind of change or the other is more impaired by inversion. However, in practice, making completely orthogonal changes to features or configuration is difficult if not impossible. Many researchers have pointed out that configural changes can affect the appearance of individual features, and that changes to features can change configuration (Haig, 1984; Rakover, 2002; Rhodes, Brake, & Atkinson, 1993). Compatible with this, McKone and Yovel’s (2009) metareview showed that inversion effects could be as large for featural changes as for configural ones if the former involved modifications in feature shape rather than feature surface coloration.

512

C A Collin, S Rainville, N Watier, I Boutet

A related criticism of the configural effect paradigm is that it is difficult to equate the objective discriminability of featural versus configural changes. That is, in the stimuli used in this paradigm, configural and featural differences are generally created in an informal manner. Therefore, it is possible that configural differences between stimuli are produced such that they are simply more difficult to discern from one another than are featural differences. If this is the case, then the disproportionate effect of inversion on configural discriminations could be explained by this factor alone. Goffaux’s (2012) work shows that this factor is critical, as her findings indicate that the degree of holistic processing exhibited by subjects during face processing tasks is monotonically related to the degree of discriminability at the local featural level. It is important to note in this context that some studies have equated the difficulty of featural and configural conditions—for instance, showing no significant difference in accuracy between upright conditions (Leder & Bruce, 2000; Yovel & Kanwisher, 2004). These studies nevertheless show a disproportionate effect of inversion on configural discriminations as compared with featural ones. However, equating human performance on two tasks is not the same as equating their objective difficulty in terms of available visual information, and it therefore remains unclear to what degree the effects being seen in these studies are due to informational factors versus human processing factors. One tool that could bring clarity to the findings regarding spatial vision and configural versus featural processing is ideal observer analysis. We are not aware of previous efforts that leverage ideal observer theory to shed light on this issue, and this is unfortunate, as it has the potential to provide a more formal basis to this work. Our purpose in the present paper is to borrow from ideal observer theory and provide a systematic framework—that is, a ‘model observer’—to tease apart the roles that stimulus information and human vision play in featural and configural face processing. Although our approach deviates from strict ideal observer analysis, for reasons we explain below, it nevertheless offers a mathematically principled way of decoupling human vision from available stimulus information so as to assess the contribution of each of those two components to human performance given the data at hand. Among other things, this will provide a formal baseline against which to compare human performance across conditions, free from effects of discriminability (Goffaux, 2012). The human data used for our model observer analysis come from a previous study by three of us (Watier et al., 2010). As described in subsection 2.3 below, our model observer performed exactly the same task as human participants and used the same stimuli as in the Watier et al. study. Unlike human observers, the model observer’s performance on featural and configural discriminations is dictated strictly by stimulus properties, and the ratio of human-to-model performance can be taken as a measure of relative efficiency. When considered in the context of face stimuli filtered into distinct SF bands, model and human performance allow for the following interpretations. On the one hand, if the SF bands preferred by model and human observers are similar, then the human-to-model efficiency ratio should be flat, and any human preference for a given SF band should therefore be attributable to stimulus information rather than to an inherent selectivity of human vision. If, on the other hand, the model and human and observers differ in their SF band preferences, then human performance modulations by SF manipulations should be interpreted as a product of human perception rather than as a property of the stimulus. On the basis of previous studies comparing human and ideal observer performance in face processing, we predicted a bandpass function of relative efficiency for humans, with a peak at around 10 cpf (Gold et al., 1999; Keil, 2009; Keil et al., 2008; Näsänen, 1999). However, these previous studies examined face identification or matching, so our prediction is therefore based on the assumption that the same SF band will be used for featural and configural discriminations as for face recognition in general. On the basis of our own previous

Configural and featural discriminations use of the same SFs

513

work (Boutet et al., 2003; Watier et al., 2010) as well as that of others (Gaspar et al., 2008; Willenbockel et al., 2010), we predict that this same SF band will elicit the highest efficiency in humans for both featural and configural discriminations. However, this prediction rests on the assumption that the objectively most useful information for featural and configural discriminations lies in the same SF band, and this has not been previously shown. Predictions regarding the behavior of our model observer are more difficult to make. One reason for this is that while ideal observer analyses have been used to examine face recognition in several previous studies (Gold et al., 1999; Keil, 2009; Keil et al., 2008; Näsänen, 1999), none has attempted to elucidate the role of different SFs in featural and configural processing. Another reason is that ideal observer analyses have yielded different patterns of behavior across the SF spectrum, likely based on differences in task and stimulus sets. The remaining sections of the paper are organized as follows. First, we present a brief overview of the research design and highlight key features of the stimulus set used by Watier et al. (2010). Second, we discuss our model observer’s motivation, its relationship to ideal observers, its mathematical implementation, and its simulation parameters. Third, we show results from our model observer simulation, compare the model’s performance with that of human observers, and describe how relative efficiency—the ratio of human-to-model performance—changes with SF. Lastly, in section 4 we interpret our findings and discuss their implications regarding the respective contributions of stimulus information and visual processing to featural and configural processing of stimuli at different spatial scales. 2 Method We begin the description of our methodology with a brief overview of the Watier et al., (2010) study from which we take our human performance data. We then proceed to discuss the implementation of our model observer and the procedures used for calculating the ratio of human-to-model performance. 2.1  The Watier et al. (2010) study In Watier et al. (2010) human observers performed a visual face-discrimination task in which they reported on each trial whether they perceived two faces as being either as same or different. The study investigated the effect of SF filtering on face discrimination as well as the effect of manipulating the configural and featural properties of faces. Configural manipulations involved changing the position of features (either the eyes, the nose, or the mouth) relative to the outline of the face. Featural manipulations involved replacing a facial feature (either the eyes, the nose, or the mouth) by a corresponding feature belonging to another face. On trials in which faces were the same, the two faces consisted of identical copies of a face image. On trials in which two faces were different, one face consisted of an unmodified face and the other face consisted of a configurally or featurally manipulated version of that unmodified face. The experimental design of Watier et al. (2010) can be summarized as a fully crossed factorial matrix consisting of 2 (featural vs configural) × 3 (eyes, nose, mouth) × 11 (10 SF filtering levels + 1 unfiltered level) conditions for a total of 66 conditions. Human performance was measured over 12 trials for each condition for a total of 66 × 12 = 792 trials per observer. Watier et al. (2010) also examined the FIE across all aforementioned conditions, but this was done between subjects, with twenty observers participating in the upright condition and another twenty taking part in the inverted condition. On each trial a participant was presented with a pair of face images that were either identical or different. For half of the ‘different’ trials the faces differed configurally, and for the other half the faces differed featurally. The proportion of correct answers for each condition was recorded automatically.

514

C A Collin, S Rainville, N Watier, I Boutet

2.2  Stimuli Examples of our stimuli are shown in figure 1. The same stimuli were used in gathering both the human performance data and the model observer data. These consisted of grayscale images with overall dimensions of 256 × 256 pixels. Faces were centered in the image and were embedded in a uniform gray background of 16.7 cd m–2. Faces had an average width of 144 pixels, or 2.8 deg, at a viewing distance of 57 cm. The SF content of human faces is typically expressed in cpf rather than in cycles deg–1. This convention is especially relevant to the present study given that our model observer algorithm operates on image pixels and is unconcerned with viewing distance. A total of six original faces taken from six different male individuals were used to generate all the stimuli in the study. Base face (unfiltered and unmodified)

Featural modification

Configural modification

Figure 1. Example stimuli. The top image shows a base face from which the others were derived. Configural modifications involved moving a feature (eyes, nose, or mouth) up or down. Featural modifications involved changing a feature for one from another face. The number in the lower right of each cell shows the center frequency, in cycles per face width, of the bandpass spatial filter applied to each image. Note that contrast and brightness have been adjusted for publication.

Configural and featural discriminations use of the same SFs

515

The SF contents of the stimuli were manipulated by processing faces with radially symmetric 2‑D bandpass filters of variable center frequency. Each bandpass filter consisted of the multiplicative combination of lowpass and highpass Butterworth filters with cutoffs at two octaves apart. The resulting bandpass filter therefore has a two-octave bandwidth at half height, and center SFs fc took values of 1.0, 1.7, 3.0, 5.3, 9.2, 19.0, 26.9, 38.1, 53.8, and 76.1 cpf. The mean luminance of the filtered faces remained unaltered by constraining the filter’s DC level to 1.0. Filtered faces were not renormalized to a prescribed contrast value. While this reduces the overall contrast of the filtered image with respect to the original unfiltered image, it ensures that the contrast energy that remains in filtered images is not artificially amplified and is therefore equivalent to what would be available to the visual system from the unfiltered image. Examples of spatially filtered faces are shown in figure 1. 2.3  Model observer First, we begin by outlining the limitations of applying strict ideal observer analysis with Watier et al.’s data and briefly motivate our choice to use a less stringent ‘model observer’ analysis instead. Second, we describe the mathematical formalism of our stimuli and model observer. Lastly, we address the conversion of human performance from an accuracy measure to an equivalent noise measure. Our model observer is similar to those used in previous studies, such as Gold et al. (1999) and Näsänen (1999), in that it is based on an algorithm that compares face images via crosscorrelation and searches through a database of candidate images for the best match to a target image. This algorithm is inspired by ideal observer theory where Pelli (1985) has shown that this method of calculating dot products of images provides the ideal decision rule for many recognition tasks. Although our model observer includes many elements of ideal observer theory, we prefer to use the more conservative term ‘model observer’ for several reasons. First, unlike most tasks involved in ideal observer studies, our task involved noiseless stimuli for which psychophysical accuracy (ie proportion correct)—not psychophysical thresholds—were measured. In the absence of (external) stimulus noise, ideal observer performance would necessarily be at ceiling (100% correct) in all conditions and would therefore provide us with little insight as to which SF band is most physically informative for the task at hand. To remedy the issue, we added to our model observer a source of internal noise whose amplitude we manipulated to bring model performance in line with human psychophysical accuracy. Second, the Watier et al. task consists of comparing two visual stimuli and determining whether or not they are the same. In this respect, our task is unlike most tasks in ideal observer studies where the objective is often to report the presence or absence of stimuli or the one‑shot recognition of a single stimulus from a set of stored templates. Our goal is not—and in fact cannot be—to apply a strict ideal observer analysis in the form generally adopted in studies where stimuli are presented at several intensity levels and accuracies can be converted into thresholds via psychometric functions. As we explain in a subsequent section, the Watier et al. data are restricted to a single intensity-level measure of accuracy (ie proportion correct); and while this limitation can be overcome with few assumptions, we use the term ‘model observer’ to differentiate our analysis from the narrower and stricter meaning of ‘ideal observer’. 2.4  Stimulus and model mathematics First, we formalize the mathematics of our stimuli. Let A represent the original set of unfiltered face stimuli from Watier et al. (2010), where i indexes the identity of a specific face in A. Similarly, let B represent the original set of Watier et al.’s bandpass spatial filters where j indexes a specific filter in B. Filters in B are isotropic, and their bandpass profile can be fully defined in the 2‑D spatial Fourier domain as the product of intersecting lowpass

516

C A Collin, S Rainville, N Watier, I Boutet

and highpass Butterworth filters. Thus, the jth filter in B is defined as: Bj( f ) =

Configural and featural discriminations use the same spatial frequencies: a model observer versus human observer analysis.

Previous work has shown mixed results regarding the role of different spatial frequency (SF) ranges in featural and configural processing of faces. So...
1MB Sizes 2 Downloads 7 Views