Cognition 149 (2016) 104–120

Contents lists available at ScienceDirect

Cognition journal homepage:

Perspective-taking behavior as the probabilistic weighing of multiple domains Daphna Heller a,⇑, Christopher Parisien b, Suzanne Stevenson a a b

University of Toronto, Canada Nuance Inc., Canada

a r t i c l e

i n f o

Article history: Received 27 August 2013 Revised 18 November 2015 Accepted 12 December 2015 Available online 4 February 2016 Keywords: Common ground Perspective taking Reference Pragmatics Definites Bayesian modeling Eye-tracking

a b s t r a c t Our starting point is the apparently-contradictory results in the psycholinguistic literature regarding whether, when interpreting a definite referring expressions, listeners process relative to the common ground from the earliest moments of processing. We propose that referring expressions are not interpreted relative solely to the common ground or solely to one’s Private (or egocentric) knowledge, but rather reflect the simultaneous integration of the two perspectives. We implement this proposal in a Bayesian model of reference resolution, focusing on the model’s predictions for two prior studies: Keysar, Barr, Balin, and Brauner (2000) and Heller, Grodner and Tanenhaus (2008). We test the model’s predictions in a visual-world eye-tracking experiment, demonstrating that the original results cannot simply be attributed to different perspective-taking strategies, and showing how they can arise from the same perspective-taking behavior. Ó 2015 Elsevier B.V. All rights reserved.

Conversation takes place under knowledge mismatch: the conversational partners naturally come to the conversation with different knowledge and beliefs. At the same time, keeping track of what information is or is not shared with one’s interlocutor is crucial for conducting felicitous conversation. For example, assertions typically contain information that is not already shared, and, similarly, questions normally ask about information that is not shared (Stalnaker, 1978). Thus, conversational partners must keep track of the shared information (i.e., the common ground) alongside maintaining their own private knowledge. To make an assertion, a speaker has to use both types of information simultaneously: coming up with the content of an assertion requires using one’s own privileged knowledge, whereas determining whether the assertion will be felicitous at the current state of the conversation requires consulting the common ground. For definite referring expressions (e.g., the candle), there has been a debate regarding whether they are interpreted relative to the private (or egocentric) perspective or relative to the common ground. Theoretical approaches suggest that definite reference depends on shared information (e.g., Clark & Marshall, 1981). For example, a felicitous use of the candle depends on there being a

⇑ Corresponding author at: Department of Linguistics, University of Toronto, 100 St. George Street, Toronto, ON M5S 3G3, Canada. E-mail address: [email protected] (D. Heller). 0010-0277/Ó 2015 Elsevier B.V. All rights reserved.

uniquely-identifiable candle in the common ground (Gundel, Hedberg, & Zacharski, 1993). However, in an influential study, Keysar, Barr, Balin, and Brauner (2000) showed that listeners do not, in fact, restrict their attention to objects in common ground. They take this result as indicating that listeners initially interpret definite referring expressions relative to their private (or egocentric) perspective, integrating common ground information only if this initial processing leads to failed reference (see also Keysar, Barr, Balin, & Paek, 1998; Keysar, Lin, & Barr, 2003). Other work, starting with Nadig and Sedivy (2002), has shown instead that listeners are not initially egocentric, but rather show sensitivity to the common ground perspective from the earliest moments of processing (Brown-Schmidt, Gunlogson, & Tanenhaus, 2008; Hanna & Tanenhaus, 2004; Hanna, Tanenhaus, & Trueswell, 2003; Heller, Grodner, & Tanenhaus, 2008). The literature on perspective taking has aimed to reconcile these apparently-contradictory results by pointing to causes that could have led listeners in the different studies to adopt different perspective-taking strategies (Bezuidenhout, 2013; Brown-Schmidt & Hanna, 2011; Hanna et al., 2003; Kuhlen & Brennan, 2013). These explanations assume that listeners choose a single perspective-taking strategy in response to situational factors, and interpret a definite noun phrase either relative to their egocentric perspective or relative to the common ground. The current paper takes a radically-different approach. We propose

D. Heller et al. / Cognition 149 (2016) 104–120

that taking perspective entails the simultaneous use of both the egocentric perspective and the common ground. We then demonstrate how this new approach can give a unified account of two sets of results that previously have been taken to arise from different underlying mechanisms. Before turning to our proposal, we review the relevant literature.

1. Introduction: Common ground and domains of reference It has been widely accepted since the work of Russell (1905) that definite descriptions, like the triangle or the white candle, are used to identify a referent that is unique in satisfying the descriptive content of the definite, such as a unique triangle or a unique white candle (for one prominent theory, see Gundel et al., 1993). Uniqueness, however, is not absolute (there is clearly more than one white candle in the world), but is relativized to a contextually-restricted set, known as the domain of reference (Roberts, 2003). The domain of reference (or referential domain) is not explicitly given as part of the linguistic signal, and thus must be inferred from indirect situational cues: what objects are available in the physical surroundings, what has been said in prior conversation, and also from general world knowledge. Furthermore, domains of reference are not static, but change over time as information is updated over the course of a conversation. Indeed, a growing body of psycholinguistic evidence shows that listeners can quickly adapt domains of reference as language unfolds. Experiments have revealed that listeners use information in the linguistic signal, like the selectional restrictions of a verb, to restrict their attention to relevant entities. For example, listeners focus on edible things after hearing the verb eat, but not after hearing move (Altmann & Kamide, 1999). In addition, listeners adapt referential domains using non-linguistic information, such as the affordances of the objects in the physical context (Chambers, Tanenhaus, Eberhard, Filip, & Carlson, 2002; Chambers, Tanenhaus, & Magnuson, 2004). For example, Chambers et al. (2004) demonstrate that when listeners interpret an instruction like Pour the egg in the bowl over the flour, they develop expectations about whether the noun egg will be followed by a modifier (e.g., in the bowl) depending on how many eggs in the context are in liquid form and can be thus plausibly be poured. Despite the remarkable ability to quickly adapt referential domains using linguistic and non-linguistic information, some psycholinguistic findings, such as Keysar et al. (2000), suggest that information about common ground is not used in initially restricting referential domains. This is surprising given Clark and Marshall’s (1981) theoretical proposal that definite descriptions are used to refer to objects in common ground. Specifically, Keysar et al. (2000) used a referential communication task in which a lab confederate instructed participants to manipulate real-world objects as their eye movements were recorded. Shared information, or common ground, was established by physical co-presence: the objects were placed in a vertical display of cubbyholes, with objects visible to both interlocutors assumed to be in common ground, and objects that were blocked from the confederate’s view with a barrier assumed to be in the listener’s privileged ground. On critical trials the display contained, for example, three candles that contrasted in size, the largest of which was privileged to the listener, and the confederate speaker instructed the listener to Pick up the big candle.1 The Triplet-Privileged display in Fig. 1 illustrates this situation (except the original display had more distractor objects). Keysar et al. reasoned that if listeners restrict 1 The original example in Keysar et al. (2000) had the smallest candle privileged, and the instruction was pick up the small candle. The example was changed for ease of exposition.


the domain of reference to common ground, they would not consider the privileged biggest candle as a potential referent. They compared this situation to a control display where the privileged object was unrelated (e.g., an apple). Unlike in the control display, with the critical display they found that listeners did look at the privileged object (i.e., the biggest candle), and sometimes even reached for it and touched it. Keysar et al. (2000) interpreted this result as indicating that listeners process initially from their egocentric (or private) perspective, ignoring information about what is in common ground (see also Keysar et al., 2003). Hanna et al. (2003) proposed an alternative reason for Keysar et al. (2000) not finding an early effect of common ground: It is not that listeners are egocentric, but rather that, in Keysar et al.’s (2000) setup, the privileged object was always a better perceptual match to the descriptive content of the referring expression than any of the objects in common ground (in the example above, because it’s the biggest candle visible). To test this claim, Hanna et al. (2003, Experiment 1) examined a situation where the privileged object, a red triangle, was identical to the intended referent in common ground, a second red triangle. (Here shared status was established by linguistic mention, not physical co-presence.) Upon hearing an instruction like Put the blue circle above the red triangle, listeners were more likely to look at the red triangle in common ground than at the privileged red triangle, and were also faster to choose it, as compared with displays that contained two red triangles in common ground (see Nadig & Sedivy, 2002 for a similar result with young children). This result demonstrates – counter to Keysar et al.’s (2000) claim – that listeners do use common ground information from the earliest moments of processing. Hanna et al. (2003) adopt a constraint-based approach where interpretation is guided by multiple constraints that reflect continuous integration of evidence from multiple sources. Specifically, they propose two probabilistic constraints that can account for both their own results and those of Keysar et al. (2000). First, the common ground constraint prefers shared referents over privileged referents, with the strength of bias depending on the strength of the probabilistic cues in the situation that indicate what is in common ground. Second, the perceptual match constraint biases reference resolution toward an object whose perceptual properties best match the descriptive content of the referring expression (i.e., the noun and its modifiers) – this constraint is evaluated against all the objects perceptually available to the listener. These two constraints were able to account for both sets of results that were available at that time. First, they account for the pattern in Hanna et al. (2003, Experiment 1) because the common ground constraint favors the red triangle in common ground over the privileged red triangle; here the perceptual match constraint favors neither, as both objects have the same properties and thus match the definite referring expression (‘‘the red triangle”) equally well. Importantly, the same two constraints can also account for the pattern in Keysar et al. (2000): while the common ground constraint favors the intended referent in common ground, the perceptual match constraint strongly favors the privileged object, because it is a better perceptual match to big candle (it is the biggest candle visible to the listener). The same two constraints cannot, however, account for a more recent result from Heller et al. (2008). Using a similar setup to Keysar et al. (2000), Heller et al. examined the interpretation of an unfolding instruction such as ‘‘Pick up the big candle” at the point of processing the size adjective. Their experimental design was built on the finding that upon hearing a size adjective – and even before hearing the noun it modifies – listeners expect reference to an object for which there is a size-contrasting object (Sedivy, Tanenhaus, Chambers, & Carlson, 1999). Heller et al. (2008) examined displays that contained one pair of size-contrasting objects


D. Heller et al. / Cognition 149 (2016) 104–120

Fig. 1. Two comprehension conditions shown from the perspective of the listener; the object in the covered cubbyhole was not visible from the speaker’s perspective. TripletPrivileged mimics Keysar et al.’s (2000) critical condition, and Pairs-Privileged mimics Heller et al.’s (2008) critical condition. The instruction associated with these displays is ‘‘Pick up the big candle”; the intended referent is marked with a white circle in the figure.

that were both in common ground (e.g., a big and a small candle) and a second pair of size-contrasting objects, where the contrasting object was privileged to the listener (e.g., a big funnel in common ground and a small funnel that was hidden from the view of the speaker) – see the Pairs-Privileged display in Fig. 1. Results showed that, upon hearing the size adjective, listeners anticipated the referent to be the object with a size contrast in common ground (e.g., the big candle). This result cannot be accounted for by the two constraints proposed by Hanna et al. (2003) because neither constraint favors the object that has a size-contrast in common ground. First, the common ground constraint does not favor the big candle over the big funnel because they are both in common ground. Similarly, the perceptual match constraint does not favor the big candle over the big funnel, because they are a similar perceptual match to the partial referring expression the big. The more recent theory of ‘‘anticipation without integration” (Barr, 2008b) faces a similar problem. According to this theory, listeners anticipate objects in common ground to be more likely referents than objects in privileged ground, but the integration of linguistic information, such as the referring expression, proceeds relative to the egocentric perspective (which in this setup is comprised of all the objects in the display). For the Heller et al. (2008) result, this theory predicts that listeners will anticipate both big objects to a similar extent because they are both in common ground, and the processing of the adjective big will not distinguish them because both objects have a sizecontrasting object in the egocentric perspective. Thus, although this proposal is conceptually very different from Hanna et al.’s, it faces a similar problem in accounting for the Heller et al. (2008) result. To account for the pattern in Heller et al. (2008), information about common ground needs to affect not only which objects are expected to be referents (as in the common ground constraint), but also how objects are referred to in relation to other objects in the domain. In other words, we need a constraint that will reflect the listener’s referential expectations that the speaker will only say ‘‘the big. . .” to refer to the bigger candle, but not to the bigger funnel, when the size-contrast for the latter is not in common ground. The referential expectations in the set-up of Heller et al. (2008) could be captured by a strategy that assesses the referring expression solely against the objects in common ground: that is, two candles (that contrast in size) and one funnel. The problem with this strategy is that it cannot capture the finding in Keysar et al.

(2000), which indicated that listeners do not focus on common ground objects alone, but consider objects in the privileged domain. In other words, we seem to be forced to conclude that in the Keysar et al. (2000) study and in the Heller et al. (2008) study, listeners used a different strategy in interpreting referring expressions. Indeed, it has been proposed that a single perspective-taking strategy is unable to reconcile the experimental results, and that the different patterns arise because listeners in those studies used different perspective-taking strategies. For example, BrownSchmidt and Hanna (2011) suggest that the perspective-taking strategy is chosen based on the strength of the cues to common ground information in the particular situation (see also Hanna et al., 2003). Alternatively, Kuhlen and Brennan (2013) suggest that the perspective-taking strategy depends on the real knowledge state of the confederate speaker, and thus listeners seem to behave egocentrically in those studies where the confederate might be assumed to know about the identity of objects that were supposed to be privileged to the listener. Finally, Bezuidenhout (2013) suggests that the perspective-taking strategy depends on whether listeners in the particular situation have been motivated to rely on common ground information. While these papers identify different causes as responsible for the perspective-taking strategy adopted by listeners, they share the assumption that the apparentlyinconsistent results should be explained by attributing them to different strategies that arise because of situational cues that vary across the experiments. Here we take a different approach. We propose that these results arise from a single perspective-taking strategy, specifically, one where listeners develop referential expectations relative to both the common ground and the egocentric perspectives. We make this proposal precise in the next section, where we present a Bayesian model of reference resolution. In exploring the predictions of this model, we focus on two of the sets of findings above that use similar experimental setups: Keysar et al. (2000) and Heller et al. (2008). First, we present model simulations that show how a single perspective-taking strategy is able to predict qualitatively different results in the two cases. We then conduct a visual-world comprehension experiment that shows the predictions of the model can indeed be obtained in a within-subjects design – that is, without a change in situational factors that could trigger different perspective-taking strategies.

D. Heller et al. / Cognition 149 (2016) 104–120

2. Our model of reference resolution We model reference resolution probabilistically by computing P(obj|RE,d), namely, the likelihood that a certain object (obj) is being referred to, given a referring expression (RE) and a domain of reference (d). While the probability P(obj|RE,d) directly captures the task of reference resolution for the listener, it is not a natural one to estimate from a listener’s exposure to language. This situation often arises in probabilistic approaches, and following standard practice (see, e.g., Frank & Goodman, 2012; Kehler & Rohde, 2013, in the area of referential interpretation), we use Bayes rule to rewrite our probability formula as the product of two components:

PðobjjRE; dÞ / PðREjobj; dÞPðobjjdÞ


We start by considering the second component, P(obj|d), which represents the prior probability that a certain object (obj) will be a referent given a domain (d). This corresponds to the listener’s assessment of how likely each object is to be the referent, independent of any information from the referring expression (i.e., before the referring expression is even heard). The formula P(obj|d) is a distribution that sums to 1 (i.e., we consider all possible objects obj in the domain d); the interesting aspect is how the probabilities are distributed among the objects. While this distribution may, in principle, be affected by any source of relevant information (we come back to this in the General Discussion), here we use it to encode the ground status of objects, making the (simplified) assumption that objects in common ground are more likely referents than privileged objects (and that all the objects in common ground are equally likely). This component is reminiscent of Hanna et al.’s (2003) common ground constraint, as they both aim to capture how the ground status of objects affects their likelihood of being referents. The first component of (1), P(RE|obj,d), captures the likelihood of different referring expressions (RE), given a particular object (obj) that needs to be distinguished within a certain domain (d). This corresponds to the listener’s referential expectations: what referring expression is expected if the speaker intends to pick out a particular object from a designated referential domain? The referring expression will of course depend on the properties of the object itself; for example, a blue funnel is expected to be called the blue funnel and not the red triangle. But, critically, this component also reflects the fact that a referring expression also depends on the properties of other objects in the referential domain d. For example, the same object may be called the funnel if it is the only funnel in the referential domain, the small funnel if there is a bigger funnel in the domain, and the big funnel if there is a smaller one in the domain. Again, the formula P(RE|obj,d) reflects a probability distribution over all possible referring expressions: different referring expressions may be possible for a certain object, and together their likelihood sums to 1. In some situations, perhaps one referring expression is overwhelmingly preferred, while in other situations more variation might be expected. While this component may seem reminiscent of Hanna et al.’s (2003) perceptual match constraint in that they both aim to capture the relationship between the descriptive content of the referring expression and the properties of potential referents, the two differ in a crucial way. Specifically, the perceptual match constraint is evaluated against all the objects that are perceptually available, whereas our constraint is relativized to a referential domain. To evaluate the probability formula in (1), P(RE|obj,d)P(obj|d), we need to consider the values for RE, obj, and d. Identifying the first two is straightforward: the referring expression RE is the stated (partial) expression in the relevant trial (e.g., ‘‘the big. . .”), and obj is evaluated for all of the objects in the display. The important question then is how to set the domain d. As mentioned earlier, an


implicit assumption in the literature has been that, depending on situational factors, listeners either assess the referring expression relative to all the objects, namely using the egocentric domain (d = e), or they focus only on common ground objects and use the common ground domain (d = c). However, we note that a more careful consideration of the findings reviewed above suggests that listeners’ choice of referential domain is not, in fact, categorical. For example, listeners in Keysar et al. (2000) reached for the privileged object on 23% of trials, suggesting that on the majority of trials (i.e., 77%) they identified the intended referent immediately, indicating their use of common ground information. In Hanna et al. (2003), listeners looked at a privileged object that fitted the referring expression (i.e., a privileged red triangle when hearing the red triangle) more than at an unrelated privileged object, indicating that referential fit in the egocentric domain played a role in reference resolution. Nevertheless, they immediately chose a common ground referent, indicating their use of common ground information. Finally, in Heller et al. (2008) listeners showed an early preference for the intended referent over the same-size competitor indicating their use of referential fit relative to common ground, but, at the same time, they looked more at the privileged object when it contrasted in size with a shared object than when it was unrelated, indicating that referential fit in the egocentric domain was not completely ignored. These patterns have motivated us to pursue an approach where reference resolution is guided by more than one referential domain. Specifically, our model is novel in combining the values of our probability formula under the two different possible settings for the domain d: one in which d is set to the egocentric domain, e, and one in which it is set to the common ground domain, c. That is, we revise Eq. (1) as follows, so that the left-hand side expresses that d is not a given, and we define P(obj|RE) as the simultaneous calculation of the component probabilities with respect to each domain. The influence of the domains is combined using a weight alpha (a) and its inverse (1  a):

PðobjjREÞ ¼ def aPðREjobj; d ¼ eÞPðobjjd ¼ eÞ

þ ð1  aÞPðREjobj; d ¼ cÞPðobjjd ¼ cÞ


An alpha near 1 reflects a situation where the listener weighs the egocentric domain (d = e) more and the common ground domain (d = c) less, and is thus predicted to exhibit more egocentric behavior. In contrast, an alpha near 0 reflects a situation where the listener weighs the egocentric domain (d = e) less and the common ground domain (d = c) more, thus showing more adaptation to shared knowledge. An important claim of our approach is that perspective-taking is achieved by combining the influence of both domains rather than by categorically choosing one or the other, and thus the weight alpha is expected to be an intermediate value, rather than the extreme of 0 or 1. Before turning to model simulations, let us consider how this proposal differs from Hanna et al.’s (2003). First, our model incorporates referential fit for two referential domains, P(RE|obj,d = e) and P(RE|obj,d = c), whereas Hanna et al.’s perceptual match constraint is evaluated against all the perceptually-available objects, which is parallel to P(RE|obj,d = e) alone. Second, our model incorporates the likelihood of objects as referents for both referential domains, P(obj|d = e) and P(obj|d = c), whereas Hanna et al.’s common ground constraint is similar to the latter alone (for more details, see under Model Simulations). That is, in Hanna et al.’s model, sensitivity to common ground information is only reflected in the objects that are expected to be referents, and it is not incorporated into referential expectations (i.e., what the speaker is expected to call these objects). Barr’s (2008b) ‘‘anticipation-with out-integration” differs from our approach in similar ways. Specifically, the ‘‘anticipation” stage, which determines the set of


D. Heller et al. / Cognition 149 (2016) 104–120

potential referents, is sensitive to common ground information (equivalent to P(obj|d = c)), whereas the ‘‘integration” stage, which maps the linguistic signal onto potential referent, is evaluated relative to the egocentric domain (equivalent to P(RE|obj,d = e)). Thus, while the two are conceptually different, Barr’s (2008b) theory is similar to Hanna et al.’s (2003) constraint-based approach in how it compares to our model: Both of their approaches favor reference to objects in common ground, while evaluating fit of the referring expression egocentrically against all objects, whereas in our model both the common ground and the egocentric domains play a role both in the referents expected and in how these objects are expected to be described. 3. Model simulations Our goal is to demonstrate that the proposed model allows us to consider findings that have been previously attributed to listeners adopting different perspective-taking strategies, and demonstrate that they can arise from a single strategy that integrates multiple referential domains. Specifically, we model reference resolution in two situations, modeled after the critical conditions from Keysar et al. (2000) and from Heller et al. (2008). We focus on these two studies because their results have been taken as evidence for different perspective-taking strategies (egocentricity vs. adaptation to common ground), despite the fact that they used similar materials (definites with size adjectives) and a similar setup (common ground established by physical co-presence), which should reasonably give rise to the same perspective-taking strategy. We investigate the interpretation of definite referring expressions with size adjectives (e.g., Pick up the big candle) in the two conditions from Fig. 1. The Triplet-Privileged display mimics Keysar et al.’s (2000) critical condition: there are three objects contrasting in size, with the object that is the best referential fit to the referring expression (e.g., the biggest candle) in the listener’s privileged ground (the fourth object is unrelated). The Pairs-Privileged display is identical to Heller et al.’s (2008) critical condition: there are two pairs of objects contrasting in size, with the size-contrast of the intended referent in common ground and the other size-contrast in the listener’s privileged ground. Modeling reference resolution in these two cases involves the following steps. First, we identify the two referential domains for each case: the egocentric domain (d = e) and the common ground domain (d = c). Next, we estimate the probabilities for the each of the two components in our model: P(RE|obj,d) will be estimated from production data, whereas P(obj| d) will be determined based on theoretical assumptions. The model’s predictions for P(obj|RE) – the probability of each object given the (partial) referring expression – will be determined using each of the two components in each of the two domains, as in Eq. (2). 3.1. Identifying the two domains We start by identifying the egocentric and the common ground domains for each case. The left column of Fig. 2 shows the two domains for Triplet-Privileged (top panel), and for Pairs-Privileged (bottom panel). In each case, the egocentric domain (d = e) contains the objects that are visible to the listener (i.e., all four objects). The common ground domain (d = c) contains the (three) objects in common ground, but it also reflects the fact that there is an object that is privileged to the listener – that is, the identity of the object is not in common ground, but the fact that such an object exists is.

(d). We assume that listeners’ expectations about what referring expressions speakers will use for a given object in a given referential domain arise from their prior experience. Therefore, to estimate these probabilities, we conducted a production experiment with the goal of eliciting referring expressions for each of the four objects (obj) in each of the four domains (d): Triplet-Privileged: egocentric, Triplet-Privileged: common ground, Pairs-Privileged: egocentric and Pairs-Privileged: common ground. Full details regarding the method in this norming are given in Appendix A. The middle column in Fig. 2 shows the most likely referring expression for each object in each of the four domains, along with the probability of that expression (while our example uses big, half of our data involved the adjective small). Overall, images that had a size-contrast were very likely to be referred to with a referring expression that contained a size adjective (over .90 in all cases). In contrast, images that did not have a size-contrast from the same nominal category were usually referred to with a bare noun, although a size adjective was nonetheless used about 17% of the time (possibly reflecting the salience of size contrast in the display overall). Finally, HIDDEN objects were most often referred to using an utterance like ‘‘not the big candle, not the small candle, not the funnel”; this is perhaps because we told participants not to use locations in their utterances. While the figure provides only a single probability for the most likely referring expressions, the model’s predictions were computed using the complete probability distribution obtained in the norming (i.e., a distribution over the observed referring expressions for the object, which sums to 1). 3.3. Estimating probabilities for P(obj|d) Recall that the component P(obj|d) represents the prior probability that an object (obj) will be referred to given a referential domain (d). That is, this is the probability of objects being referents independent of any linguistic input (i.e., before the referring expression has been uttered or heard). The right column of Fig. 2 illustrates the specific probabilities we used to compute the model’s predictions, which we estimated based on theoretical assumptions. First, we assume that objects of the same ground status are equally likely to be referents. For d = e, which has four objects, this means each object is assigned a probability of .25. For d = c, we assume that a speaker is more likely to refer to objects in common ground (i.e., objects she is aware of) than to objects that are privileged to the listener (i.e., objects hidden from the view of the speaker) (cf. Clark & Marshall, 1981). We chose a small probability value of .1 for the privileged object to reflect the fact that this object is not a likely referent. Note that this value of 10% is quite high, given that privileged objects were never the intended referent in any of the relevant experiments (i.e., Heller et al., 2008; Keysar et al., 2000; see also under Materials below). This choice of a relatively high value is conservative: a smaller probability will only improve our model’s predictions (more below). The remaining probability of objects given d = c is distributed equally between the three shared objects (.3 each). 3.4. Model predictions We now have all the components that allow us to simulate reference resolution in our two cases, Triplet-Privileged and PairsPrivileged, from Fig. 1. We use these components to make predictions for a partial referring expression, such as the big (or the small). This models the processing of the speech signal as it unfolds in real time, before any information from the noun is heard.2 We model

3.2. Estimating probabilities for P(obj|d) Recall that the component P(RE|obj,d) estimates the likelihood of a referring expression (RE) given an object (obj) and a domain

2 To derive P(RE|obj,d) where RE is a partial referring expression such as the big, we sum the probabilities obtained from the norming study for any full RE that is compatible with the initial string the big.


D. Heller et al. / Cognition 149 (2016) 104–120

Triplet-Privileged: Egocentric



Triplet-Privileged : Common Ground



Pairs-Privileged: Egocentric



Pairs-Privileged: Common Ground



Fig. 2. Modeling components for Triplet-Privileged (top panel) and Pairs-Privileged (bottom panel). The left column is a visual representation of the referential domains. The middle column shows the probabilities for the referring expressions component, P(RE|obj,d), estimated from the norming production task. The right column shows the probabilities for the objects component, P(obj|d), which were determined based on theoretical assumptions.

referential expectations based on adjective information alone because this allows us to tap into the earliest moments in referential processing that would be revealing of influences of ground manipulation. Moreover, because the competitor object in the Pairs conditions is a different type of object (funnel vs. candle), looking at the full referring expression in those conditions would disambiguate the referent and obscure the influence of the ground status of the contrast object. The probabilities estimated for P(RE|obj,d) and P(obj|d) in Eq. (2) were entered into our model, and a was varied between 0 and 1 (indicating the relative weight of the two referential domains in the overall probability). The model’s predictions for P(obj|RE) (the probability yielded by Eq. (2)) are plotted in Fig. 3 as a function of the weight a. Thus, this figure shows, for different weightings of the two domains, how likely each of the four objects in the display is to be the referent of a partial referring expression like the big. At the extremes, the model’s predictions amount to the possibilities considered in the original studies, where the choice of referential domain was discussed as being categorical, even though the cues to the domain were assumed to be probabilistic. Let us first consider Pairs-Privileged (left panel). At a = 0 (i.e., using only the common ground domain), the model predicts the big candle to be preferred over the big funnel (.80 vs. .17), and at a = 1 (i.e., using only the egocentric domain) it predicts ambiguity between those two objects. These are the possibilities originally considered in Heller et al. (2008). Next, we consider Triplet-Privileged (right panel of Fig. 3). At a = 0 (i.e., common ground domain only), the medium candle is predicted to be the most likely referent (.89),

and the privileged big candle to be very unlikely. (This latter prediction is driven by the prior, P(obj|d): privileged objects are not expected to be referents; recall that this is with a conservative prior of p = .10 for privileged objects.) At a = 1 (i.e., egocentric domain only), the most likely referent is the biggest candle (.97), while the medium candle is very unlikely (with a probability close to 0, because when it is in a domain with the biggest candle it is not expected to be labeled big). Again, these are the two possibilities originally considered in Keysar et al. (2000). Our approach here is that the choice of domain is not categorical, and instead reference resolution is guided by both domains. That is, not only are the situational cues to the referential domain probabilistic, but the actual choice of domain is probabilistic as well. The relevant patterns therefore are those given for intermediate values of a. For Pairs-Privileged, there is a clear preference for the intended referent (e.g., the big candle) over the other big object (e.g., the big funnel) for virtually all intermediate values. For Triplet-Privileged, intermediate values of a predict a qualitatively different pattern: ambiguity between the preferred referent in domain e (e.g., the privileged biggest candle) and the preferred referent in domain c (e.g., the medium-size candle). We propose that a strategy of using both referential domains can account for the results of both Keysar et al. (2000) and Heller et al. (2008), which have been previously taken to support different perspectivetaking strategies. For example, for a = .5, the Triplet-Privileged display will give rise to competition between the two potential referents, whereas the same value in the Pairs-Privileged display will entail a preference for the big object with a size-contrast in common ground over the big object whose size-contrast is privileged.


D. Heller et al. / Cognition 149 (2016) 104–120

Fig. 3. The model’s predictions (using P(obj|RE), Eq. (2)) for the probability that each of the display objects would be the referent of ‘‘the big . . .” as a function of the weight a (a towards 0 weights common ground more; a towards 1 weights egocentric more). The left panel plots the model’s prediction for Pairs-Privileged, and the right panel for Triplet-Privileged.

4. Experiment Our model simulations demonstrate that a single mechanism where reference resolution is guided by both the egocentric and the common ground domains is able to predict qualitativelydifferent patterns for Triplet-Privileged and Pairs-Privileged, modeled after the critical conditions in Keysar et al. (2000) and Heller et al. (2008), respectively. Our next step is to test these predictions in a visual-world eye tracking experiment. There are two important motivations for re-testing these conditions from previous studies. First, we employ a different baseline condition than Keysar et al. (2000). The original baseline condition was such that it contained an unrelated privileged object in place of the competitor privileged object in the critical condition (e.g., an apple in the baseline instead of a big candle in the critical condition). Results showed that listeners looked more to the privileged object when it matched the referring expression as compared with looks to the unrelated privileged object. While Keysar et al. (2000) interpreted this result as indicating that listeners are egocentric, this comparison does not, in fact, license this conclusion; it only shows that listeners do not completely exclude privileged objects from considerations during reference resolution. To be able to test the effect of ground directly, in the current experiment we employ baseline conditions that are different from the critical condition only in the ground status of objects. That is, our baseline conditions contain the same array of objects as the critical conditions, with all four objects in common ground. This allows us to investigate the more nuanced question of how the ground status of objects affects reference resolution. For example, if listeners are egocentric, we would predict that the earliest moments of processing will show no difference between the baseline and the privileged conditions, because both have the same array of objects visible to the listener. The second motivation for our experiment is testing the TripletPrivileged and the Pair-Privileged conditions in a within-subjects design, which uses a single set of cues to common ground. This is important because, as already discussed in the introduction, a prominent explanation for why listeners seemed egocentric in Keysar et al. (2000) and adaptive to common ground in Heller et al. (2008) is that the two experimental situations contained different cues to common ground, leading listeners to adopt different perspective-taking strategies (Bezuidenhout, 2013; Brown-Schmidt & Hanna, 2011; Hanna et al., 2003; Kuhlen & Brennan, 2013). In a single session with strong cues to common ground, we can assume that listeners will use the same perspective-taking strategy throughout. Following the procedure of Heller et al. (2008), we created an experimental situation that

gave clear and consistent cues to what information is or is not shared. For example, the procedure did not allow the speaker to know what the privileged objects were, and participants experienced this first hand during practice trials. Furthermore, the confederate speaker indeed did not know the identity of the privileged objects and was, more generally, naïve to the goals of the experiment; this ensured that their knowledge state matched what we told participants– see Kuhlen and Brennan (2013) on best practices in using confederates. The goal was to create a set of cues to common ground that would be interpreted similarly by all participants (for the full details, see under Procedure). In our approach, this is expected to lead to a similar weighting of the two domains across all participants. The four conditions are given in Fig. 4. The baseline conditions had the same objects as the privileged conditions, and they were all visible and thus assumed to be in common ground. This situation is analogous to the egocentric domain in the privileged conditions, allowing us to use the same probabilities we used in modeling the egocentric domains – see again Triplet-Privileged: Egocentric and Pairs-Privileged: Egocentric in Fig. 2. Thus, the predictions for the baseline conditions are analogous to our model simulations shown in Fig. 3 with a = 1. More precisely, just as in the privileged conditions, in the baseline conditions there are two domains that will be weighted with a, representing the weighting in our experimental situation. But in the baseline conditions there is no knowledge mismatch between the conversational partners, and thus the egocentric domain and the common ground domain are identical. For simplicity of presentation, we refer to the predictions in each baseline condition as analogous to a = 1 in the respective privileged condition.3 For the privileged conditions, we expect processing to reflect an ‘‘intermediate” value of a (roughly, 0.3 < a < 0.7 in Fig. 3). Note that we cannot determine a specific value of a because we do not expect eye movements to translate quantitatively onto our model’s predictions – we come back to this issue when discussing the results. Thus, we will examine the effect of ground on each array, comparing the direction and strength of the effect for the two arrays. In the Pairs-Baseline condition, we expect that, during the processing of the adjective, there will be ambiguity between the two objects that fit the adjective (e.g., the two big objects) – this is analogous to a = 1 in the left panel of Fig. 3. (This ambiguity will go away as soon as the noun information becomes available in the

3 Formally, because P(RE|obj,d = c) = P(RE|obj,d = e) and P(obj|d = c) = P(obj|d = e) in the baseline conditions, the model’s predictions are constant across all values of a, and they are the same as the predictions for a categorical use of the egocentric domain (d = e) in the privileged conditions.

D. Heller et al. / Cognition 149 (2016) 104–120


Fig. 4. Sample displays in the four conditions in the eye-tracking comprehension experiment, as seen from the listener’s perspective (the confederate is shown seated on the other side of the display). In the baseline conditions, all four objects were visible to both partners, and thus assumed to be in common ground. In the privileged conditions, one object was visible only to the listener, and thus assumed to be in his privileged ground. In all four conditions, the instruction would be ‘‘pick up the big candle”, and the intended referent is marked with a white circle.

speech stream.) In the Pairs-Privileged condition, our approach of multiple domains predicts less ambiguity: this is because, for intermediate values of a in this condition, the intended referent (e.g., the big candle) is predicted to be preferred over its same-size competitor (e.g., the big funnel). In the triplet array, the intended referent changes across the ground manipulation – see the marked referents in Fig. 4. Nonetheless, because the model makes predictions for all the display objects across all values of a, we can still compare the patterns across the two conditions. In the Triplet-Baseline condition, we expect that, during the processing of the adjective, listeners will anticipate that the speaker will refer to the biggest candle (i.e., the intended referent) – again, analogous to a = 1 in the right panel of Fig. 3. In the Triplet-Privileged condition, the preferred referent depends on the specific value of a, but for all intermediate values the model predicts more ambiguity than with a = 1. (Here, the ambiguity will not go away when noun information becomes available.) In sum, our approach of multiple domains predicts an opposite effect of ground in the pairs array and the triplet array, as seen in the original studies of Heller et al. (2008) and Keysar et al. (2000) (the latter with an appropriate baseline), respectively, but within a single experiment with clear and consistent cues to ground. Alternatively, if listeners focus on common ground alone, we expect to see an overall facilitation of reference resolution in the privileged conditions compared to the baseline condition, as those displays have fewer shared objects and thus fewer potential referents. This design also allows us to assess whether listeners are initially egocentric: in such a case we expect no difference between the conditions in the pairs array, where (later) information from the noun allows identification of a unique referent, and we expect late integration of common ground in Triplet-Privileged, where ground is necessary to choosing a referent. Finding that listeners

are fully egocentric or fully adapted to common ground would provide (indirect) support to the proposal that listeners in the original studies indeed used different perspective-taking strategies (we discuss further possibilities under Discussion). 4.1. Method 4.1.1. Participants Seventy native English speakers from the University of Toronto community participated in exchange for $15 each. We report data from sixty participants; the data from the other participants was not coded, because of equipment problems (n = 5), because the participant made multiple errors in manipulating objects (n = 2), because the participant did not complete the experiment (n = 2), or because it later turned out the participant was not actually a native speaker (n = 1). 4.1.2. Materials Each trial contained four objects. Two factors were manipulated in a 2  2 within-subjects design: Array (pairs vs. triplet) and Ground (baseline vs. privileged). Array was manipulated by changing the array of objects in the display. In the pairs conditions, there were two pairs of objects that contrast in size, where both the big and small objects were matched for size (e.g., a big candle – the target, a small candle – the target-contrast, a big funnel – the competitor, and a small funnel – the competitor-contrast). In the triplet conditions, the display contained three objects contrasting in size (e.g., a big, medium, and small candle), and a fourth object from a different category (e.g., a funnel). The locations of the objects in the display, and, more specifically, the location of the intended referent, were systematically varied across displays. Ground was manipulated by changing whether all four objects in the display were visible to both interlocutors, and thus assumed


D. Heller et al. / Cognition 149 (2016) 104–120

to be in common ground (the baseline conditions), or whether there was one object that was only visible to the listener, and thus assumed to be in their privileged ground (privileged conditions). The location of the covered cubbyhole was systematically varied across displays. This manipulation did not change the target in the pairs conditions, but it did cause the target to change in the triplet conditions (see again Fig. 4). Critical instructions had the form ‘‘Pick up the [scalar adjective] [noun], and . . .”. In half of the experimental trials the scalar adjective was big and in the other half it was small. Objects in the display were chosen such that their nominal labels were not phonologically similar to each other. Sixteen experimental displays were constructed for each of the four conditions. One condition was assigned to each of four lists, and rotated across participants using a modified Latin square design. Thus, each display was presented in all four conditions, but any one participant saw only one version of that display. We also created thirty-eight filler displays. Fourteen fillers had displays parallel to those in the critical trials, but with a different target: Six trials had displays like Pairs-Privileged with the target being (i) the object of a size of which there was only one in common ground (e.g., the small candle; two trials), or (ii) the object whose contrast is privileged (e.g., the big funnel; four trials: two with a size adjective and two with a bare noun). Four trials had displays like Triplet-Baseline, but in which the target was the medium object or the singleton object. Finally, four trials had displays like Triplet-Privileged with the target being one of the other objects in common ground. The goal of these fillers was to counteract the biases created by the critical trials, such that listeners could not learn to expect which object will be the referent. To divert attention from size contrasts, the remaining twentyfour filler displays had pairs of objects contrasting in color, eight with two pairs of objects contrasting in color, and sixteen with one such pair (eight of these had a privileged object). In all these displays, the target object was always referred to with a color description (i.e., the [color adjective] [noun]), whether it was part of a pair contrasting in color or it was a singleton object. Four of the fillers were used as practice trials. The remaining fifty trials were presented in a pseudo-random order, such that there were no adjacent trials with the same display type (in the four presentations lists). In all the trials, no object type was repeated. 4.1.3. Procedure Participants performed the role of Matcher in a referential communication task, where the role of Director was performed by a lab confederate. Participants were truthfully told that the Director worked for the lab. Crucially, the confederate was naïve to the goals of the experiment, and generally unfamiliar with research in the lab. More specifically, the confederate did not know the identity of privileged objects, so the knowledge state participants were expected to attribute to the confederate corresponded to the real knowledge state of the confederate (see Kuhlen & Brennan, 2013). Finally, to ensure that the confederate did not become familiar with the task and the display objects over time, each confederate only participated in about twelve sessions (in total, five confederates were used). Our rationale of using a confederate rather than a naïve Director was to control the form of the referring expressions. The procedure was modeled after Heller et al. (2008). Participants sat at a table across from the lab confederate which had a 3  3 vertical wooden display. The middle and the upper middle cubbyholes were covered throughout the experiment, so that the participant could not use the confederate’s gaze as a cue for her referential intention. At the beginning of each trial, the confederate placed covers over the four corner cubbyholes from her side of the display. The experimenter then handed the participant a bag with

the four objects, along with a photograph showing where to place them. When the participant was placing the objects, the confederate turned around so she could not see what objects were being placed. After that, the participant handed the confederate an envelope that contained a photograph showing the final position of objects, as well as how many covers needed to be removed (3 or 4) and which ones. Unknown to the participant, the confederate’s photograph also noted the referring expression to be used. Confederates were instructed to start with Pick up the [referring expression], and continue with an appropriate moving instruction that was improvised on a trial-by-trial basis (e.g. ... and move it to the bottom middle). Confederates were also instructed not to refer to objects other than the target, as we did not want them to produce additional referring expressions. The trial ended when the experimenter confirmed that the participant followed the instruction correctly. At the beginning of the experiment, participants were told that the experiment investigated how people collaborate on a task when their perspectives differ. Participants were explicitly told that the Director could not see the hidden objects, did not know their identity, and thus will not instruct them to manipulate a hidden object. If participants moved a hidden object, they were corrected by the experimenter. The first four trials were used to practice the task; during the practice trials, the Director’s cards did not provide a referring expression. To familiarize participants with the concept of hidden objects, the roles of Director and Matcher were switched on trials 3 and 4. Throughout the experiment, participants’ eye movements were monitored using a head-mounted EyeLinkII eye-tracker. The gaze of the participant, superimposed on a video-record of the scene, and both voices were recorded onto a computer. 4.1.4. Statistical analysis Because our dependent variables are all binary, we used mixedeffects logistic regression models with crossed, independent, random effects for participants and items (Baayen, Davidson, & Bates, 2008; Jaeger, 2008), as implemented in the lme4 package of the statistical software R 3.2.2. (Bates, Maechler, & Bolker, 2012; R Core Team, 2012). The independent variables were contrast coded (pairs: 1, triplet: 1; baseline: 1, privileged: 1). Pairwise comparisons were conducted by re-coding the different levels of the independent variables (West, Aiken, & Krull, 1996). We report models with the maximal structure of random effects supported by the data. To determine the structure of random effects, we used a backwards-selection method (cf. Fine & Jaeger, 2013). Specifically, we started from a model with the maximal structure of random effects supported by the design, and eliminated those random effects that did not improve the performance of the model, starting with the highest interaction and following the ‘‘best path” backwards-selection procedure outlined in Barr, Levy, Scheepers, and Tily (2013). Minimally, models included random intercepts for both participants and items. Follow-up comparisons used the same random effects structure used in the full model. For each model, we report the random effects included in the final model. 4.2. Results Participants’ eye movements were manually coded from the video records using Adobe Premiere. We coded eye movements for the interval beginning 200 ms before the onset of the adjective and ending when the participant touched an object. Fig. 5 plots, for each condition, the proportion of fixations to each of the four display objects over time (trials are aligned to the onset of the adjective; the average onset of the noun was 348 ms). The shaded area marks the (average) interval of adjective processing; it is offset by 200 ms, in order to account for the time estimated to program and launch a saccade (Hallett, 1986).

D. Heller et al. / Cognition 149 (2016) 104–120






Fig. 5. Proportion of fixations to the four objects over time for each of the four conditions. Trials are aligned to the onset of the scalar adjective (e.g., ‘‘big”) at 0 ms. The average length of the adjective was 348 ms and the average length of the noun was 449 ms. The shaded area is the interval of the processing of the adjective: the duration of the adjective offset by 200 ms, the estimated time it takes to program and launch an eye movement.

In the Pairs-Baseline condition, with all four objects in common ground, the processing of the adjective leads to the separation of looks to the two objects that fit the adjective (e.g., the two big objects) from looks to the two other objects (e.g., the small objects). Looks to the intended referent (e.g., the big candle) and the same-size competitor (e.g., the big funnel) separate later, upon hearing the disambiguating information from the noun. (The separation seems to start somewhat earlier than the average noun onset; this is expected given the variability in noun onset, 100– 880 ms, median 332 ms.) Importantly, in the Pairs-Privileged condition, looks to the intended referent start separating from looks to the same-size competitor immediately upon hearing the adjective (see Heller, Grodner, & Tanenhaus, 2009, for similar plots). Qualitatively, comparing these two fixation profiles suggests that, for the array of pairs, changing the ground status of one object from common ground to privileged had a facilitatory effect. Turning to the triplet array, we find that when all the objects are in common ground in the Triplet-Baseline condition, looks to the intended referent (e.g., the biggest candle) separate from looks to all other objects immediately upon hearing the adjective. In the Triplet-Privileged condition, there is a tendency to look at the intended referent (e.g., medium candle) over the privileged object (e.g., big candle) even before hearing the referring expression. Nevertheless, upon hearing the adjective, looks to both objects start to rise, suggesting that the interpretation of the referring expression

leads listeners to consider both objects as referents. Thus, unlike in the pairs array, here changing the ground status of an object to privileged introduces more competition in reference resolution. To evaluate the model, we consider how eye-movement patterns map onto the different probabilities used in the model simulations. We first evaluate one of the component probabilities – P(obj|d) – that is, the prior probability that different objects will serve as the referent; recall that we determined the estimates of this probability based on theoretical considerations. Following that, we examine the output of the model, namely, the model’s simulation for the overall probability P(obj|RE) (compare Fig. 3). (Recall that the third probability – the referential fit P(RE|obj,d) – was empirically estimated from production data.) 4.2.1. Evaluating the prior P(obj|d) One of the two component probabilities in our model, P(obj|d), is the prior probability that listeners will expect different objects as referents before any referring expression is uttered. Recall that we estimated these probabilities based on theoretical considerations, assuming that (i) privileged objects are less likely to be referents than objects in common ground, and (ii) all objects in common ground are equally likely (see again the right column of Fig. 2). In terms of eye movements, this component is expected to map onto anticipatory effects. To this end, we computed how likely participants were to already be looking looking at the intended referent


D. Heller et al. / Cognition 149 (2016) 104–120

(the ‘‘target”) at the onset of the adjective, namely before hearing any descriptive content of the referring expression – see Fig. 6. We fitted a mixed-effects logistic regression model with ground and array as predictors; the random effect structure supported by the data was a random intercept for both participants and items. First, there was a main effect of ground (b = 0.26, SE = 0.08, z = 3.15, p = .002), indicating that participants were more likely to fixate the target in the privileged conditions. This finding is expected from the probabilities we used in the simulations: the target is a more likely referent when there are only three objects in common ground. There was also a main effect of array (b = 0.23, SE = 0.08, z = 2.78, p = 0.006), indicating that participants were more likely to fixate the target when the array contained a triplet. The main effect of array is unexpected given the experimental design (see again Materials), and it is also not reflected in the probabilities we chose for modeling, where we assumed all common ground objects to be equally likely. However, since our goal in choosing P(obj|d) was just to model effects of ground, and we assumed ground had the same effect on both arrays, what is important for our purposes here is that the Array  Ground interaction was not significant (b = 0.10, SE = 0.08, z = 1.21, p = .23), confirming that – as in our probabilities – the effect of ground was not different across arrays.

examine how the eye-movement patterns map onto the model’s predictions given by P(obj|RE) (compare Eq. (2) and Fig. 3), we calculated the likelihood of saccades to the intended referent (the ‘‘target”) that were launched during the interval of adjective processing (a window of 348 ms) – see Fig. 7. We focused on those trials that had saccades during this time window (593 trials), which were distributed about evenly between the conditions (pairs-b 150; pairs-p 150; triplet-b 155; triplet-p 138). While focusing on saccades entails data loss (because not all trials had a saccade during this relatively-short interval), this analysis has two important advantages over analyzing fixation probabilities. First, a dependent measure of saccades gives the same status to all eye movements during this interval independent of when they were generated during that window, which is a better reflection of the model’s predictions that were based on the entirety of the partial referring expression (e.g., ‘‘the big”), and not on an unfolding linguistic signal. In addition, because saccades reflect attention shift, they unambiguously reflect the integration of information from the referring expression, rather than the continuation of fixations that may have been generated before the referring expression was heard, which in our model would correspond to P(obj|d) alone. (To show the robustness of the observed pattern, we present an analysis of fixations in Appendix C.) We fitted a mixed-effects logistic regression model with array and ground as predictors. The random effect structure supported by the data was a random intercept for both participants and items. The main effect of array was not significant (b = 0.08, SE = 0.08, z = 0.92, p = .36), indicating that adjective information was not overall a better fit to one of the arrays. More importantly, the effect of ground was also not significant (b = 0.01, SE = 0.08, z = 0.06, p = 0.95), indicating that the ground manipulation (i.e., adding a privileged object) did not overall facilitate reference resolution – an effect that would be expected if listeners simply focused on common ground information. The Array  Ground interaction, however, was significant (b = 0.24, SE = 0.08, z = 2.85, p = .004), indicating that the ground manipulation had a different effect depending on the array of objects. Follow-up comparisons indicated that participants were significantly more likely to launch a saccade to the target in the Pairs-Privileged than in the Pairs-Baseline condition (b = 0.24, SE = 0.12, z = 2.08, p = .037). The effect was in the opposite direction in the triplet array, with fewer saccades to the target in Triplet-Privileged than in Triplet-Baseline (b = 0.23, SE = 0.12, z = 1.95, p = .0509). Importantly, this cross-over interaction is the effect of ground on array predicted by our model simulation: less ambiguity for the Pairs-Privileged compared to its baseline, but more ambiguity for Triplet-Privileged compared to its baseline. Finally, in addition to the eye-movement data, we also considered the object touched by listeners, which provides a later measure of referent choice. In the pairs array (either baseline or privileged), where the complete referring expression is compatible with only one object independent of ground, participants performed at ceiling (over 98% correct). Similarly, there were very few errors in the Triplet-Baseline condition (under 2%). However, in the Triplet-Privileged condition, where the ambiguity is not resolved by noun information, participants touched the privileged object on 15% of the trials, which is significantly lower than the Triplet-Baseline condition (b = 3.60, SE = 1.47, z = 2.45, p = .01). This result further shows that adding a privileged object in the triplet array introduces ambiguity, which is predicted by our model’s simulations.4

4.2.2. Evaluating the model’s predictions, P(obj|RE) The output of the model is the probability that different objects will serve as referent for the referring expression heard. To

4 It is worth mentioning that our finding where the privileged object was touched on 15% of the trials is similar to Keysar et al.’s (2000) findings in the parallel situation where participants reached for the privileged object on 23% of the trials.

Fig. 6. The likelihood that participants were already looking at the intended referent at the beginning of the adjective (i.e., before any descriptive information was heard) across the four conditions.

Fig. 7. The likelihood to make a saccade to the target during the processing of the adjective, across conditions.

D. Heller et al. / Cognition 149 (2016) 104–120

4.3. Discussion Our results show that the effect of ground on reference resolution (specifically, on the processing of the adjective) depends on the specific properties of the array of objects. In the pairs array, making the competitor-contrast privileged facilitated reference resolution, replicating the critical finding from Heller et al. (2008). In the triplet array, making the object that best fits the referring expression privileged actually hindered reference resolution (because we used a different baseline condition, this finding goes beyond a replication of Keysar et al. (2000)). The fact that the pairs and triplet arrays of objects were affected in the opposite direction by the ground manipulation is predicted by our model’s simulations (see again Fig. 3), and thus supports our approach that reference resolution involves the consideration of multiple referential domains. Our dependent measure of saccades to the target is indirectly affected by anticipatory looks to the target, because a saccade to the target is not possible if one is already looking at the target. However, this indirect effect cannot be responsible for our results: that is, the full pattern of saccades we observe here cannot simply arise as a by-product of the anticipatory effects. This is because the main effect of ground we found in the anticipatory analysis (see again Fig. 6) cannot indirectly generate the cross-over interaction observed with saccades (see again Fig. 7). Thus, the pattern of saccades indeed reflects the effect of reference resolution (see also the fixation analysis in Appendix C). The full pattern of results cannot be explained by other theories of perspective taking. While previous theories (Barr, 2008b; Hanna et al., 2003; Keysar et al., 2000) can potentially account for the pattern of added ambiguity in Triplet-Privileged compared to TripletBaseline, they are unable at the same time to explain the facilitation in Pairs-Privileged compared to Pairs-Baseline. First, it is not explained by ‘‘egocentric first” (Keysar et al., 2003), because the temporary ambiguity is not expected to trigger the use of common ground in Pairs-Privileged. This aspect of the result is also not predicted by the constraints proposed in Hanna et al. (2003), because they posit that the referring expression is evaluated perceptually against all the available objects, which does not distinguish the target (e.g., big candle) from its competitor (e.g., big funnel). Finally, this pattern is also not predicted by Barr’s (2008b) ‘‘anticipationwithout-integration”, which posits that effects of common ground are only anticipatory (in focusing on the set of potential referents), and do not limit the interpretation of the linguistic signal to those objects. In this context it should be emphasized that the pattern of saccades in the pairs array cannot be reduced to an anticipatory effect of ground. Because listeners were more likely to already be looking at the target in Pairs-Privileged than Pairs-Baseline, this might have created a by-product where listeners were subsequently less likely to make a saccade to the target during the processing of the adjective. However, here listeners were actually more likely to make a saccade to the target in Pairs-Privileged, indicating that this is an effect of integration, not anticipation. Instead, a theory that assumes that listeners simply focus on common ground objects could account for the facilitation in the pairs array, but such a theory would predict facilitation in the triplet array as well, where (contrarily) the ground manipulation hindered reference resolution. In conclusion, our model is successful in accounting for both patterns because it considers the labels (i.e., referring expressions) expected in both referential domains. 4.3.1. What alpha did listeners use? The pattern in the pairs array indicates that in our experiment listeners were not completely egocentric, or, in other words, they employed an alpha that was smaller than 1. The pattern in the triplet array, in turn, shows that listeners were not completely adapt-


ing to common ground, or, in other words, employed an alpha that is bigger than 0. Finally, the fact that in the Triplet-Privileged condition, the medium candle in common ground was preferred over the biggest (privileged) candle suggests that in our experiment the common ground domain was weighed more than the egocentric domain. In other words, we conclude that listeners employed an alpha that was between 0 and .5. This weight of the two domains is expected given that our experimental procedure contained clear and consistent cues to common ground. The reasoning above leaves us with a range of possible alpha values for modeling the results. Pinpointing a more specific alpha value is not warranted because the model’s predictions are not expected to directly map onto eye-movement probabilities, for two reasons. First, as noted earlier, we calculate the probability of partial referring expressions by summing the probabilities of complete referring expressions that are each compatible with the initial substring under consideration (e.g., the big). However, while these probabilities are likely to be indicative of the relative likelihoods of partial referring expressions, we do not expect them to precisely reflect incremental processing of a referring expression that unfolds over time. Second, we do not expect eye movements to solely reflect the identification of the intended referent, but also other scanning of the visual context that may contribute indirectly to reference resolution. For example, Heller and Chambers (2014) found that upon hearing a size adjective (e.g., big), listeners not only direct visual attention towards objects that fit the property encoded in the adjective (e.g., big objects), but also towards objects that serve as potential contrasting objects (e.g., small objects). For these two reasons, we take our results to suggest a range for a (i.e., 0 < a < .5), rather than a particular value. 4.3.2. The nature of simultaneity The opposite effect of ground on reference resolution is obtained in the two critical conditions reflects the simultaneous consideration of two domains. We note that there are two possible interpretations of simultaneity: (i) on each trial, the process of reference resolution is guided by both referential domains, which are weighed with a according to their relative contribution in the situation, and (ii) a determines the relative likelihood that each of the two domains will be used, but on each trial only one domain is selected to guide interpretation. While we favor the former proposal, both interpretations are compatible with our model: the former would mean that the probability formula captures the state of an individual on every single trial, whereas the latter would mean that the probability formula captures the overall behavior of an individual (i.e., across all trials). Note that since our statistical modeling incorporates by-participant dependencies, it rules out a third, population-level, interpretation where some participants were fully egocentric whereas others were fully adaptive to common ground. Because the data patterns considered here require aggregating over trials and participants (the adjective interval is too short to allow for eye movements that would directly reflect the weighing of the two domains), we cannot conclusively choose between possibilities (i) and (ii) above based on the current evidence. Future work, likely using a different methodology, is needed to provide conclusive evidence that could distinguish these two alternatives. Nonetheless, we favor the simultaneous consideration of domains on every single trial for two reasons. First, we note that other work has used an aggregate measure in the visual world to infer the simultaneous consideration of multiple alternatives. For example, in a classic study of spoken word recognition, Allopenna, Magnuson, and Tanenhaus (1998) found an equal likelihood of fixating a target (e.g., beaker) and its onset-sharing competitor (e.g., beetle) across trials and participants. They interpreted this aggregate finding as an indication that listeners accessed the two


D. Heller et al. / Cognition 149 (2016) 104–120

representations of beaker and beetle simultaneously when hearing the first syllable of beaker. While the processing of contextual information may be different from processing at the lexical level, it is clear that aggregate data can be seen as supporting an interpretation of simultaneity. More importantly, we note that our data contains suggestive evidence that listeners were indeed considering both domains simultaneously, in the form of trials that contained multiple saccades to different potential referents. In the pairs conditions, the relevant window (namely, the length of the adjective: 348 ms) is too short to allow for multiple saccades on most trials, but in the triplet condition, the potential referents in the two domains are compatible with the full referring expression. Thus, we examined those trials in Triplet-Privileged on which listeners made a saccade to both the intended referent (i.e., the predicted referent for d = c) and to the privileged object (i.e., the predicted referent for d = e). Because we are interested in simultaneity not simply at the trial level but at the individual participant level, we counted how many participants had at least two such trials, finding 43 participants (or 72%) in Triplet-Privileged, compared to 27 participants (or 45%) in Triplet-Baseline. This suggests that this pattern is tied to the effect of ground, rather than reflecting a general pattern of scanning in this type of display. However, one may suggest that trials on which listeners first made a saccade to the privileged object and then to the intended referent in common ground instead reflect an egocentric behavior that is later corrected. To address this issue, we also calculated a more conservative measure, taking into account only those trials where the saccade to the intended referent occurred before the saccade to the privileged object. This did not change the pattern: 23 participants (or 38%) showed this behavior on two or more trials in Triplet-Privileged versus only 4 participants (or 7%) in Triplet-Baseline. While more conclusive evidence of within-trial simultaneity is needed, this pattern in our data is more compatible with the interpretation that reference is guided by both domains simultaneously, than that people are using both domains in a probabilistic way (weighted by alpha), on a trial-by-trial basis.

5. General discussion We set out to re-examine some results in the psycholinguistics literature regarding whether listeners use common ground information in reference resolution, focusing on two studies that seem to use similar methods and yet have led to opposite conclusions about the early use of common ground: Keysar et al. (2000) and Heller et al. (2008). We developed a new approach in which reference resolution is guided by the simultaneous consideration of multiple referential domains (both egocentric and common ground), which we formalized in a Bayesian probabilistic model. This allowed us to account for the aforementioned seeminglycontradictory results as arising from a single perspective-taking strategy. We took the first step towards providing empirical support for our model by testing the critical cases from Keysar et al. (2000) and Heller et al. (2008), along with appropriate baselines, using a within-subjects design with clear and consistent situational cues to common ground. The results support our model’s predictions. The fact that we obtained these results in a withinsubjects design means that the findings in the original studies are unlikely to have been a result of listeners adopting different perspective-taking strategies, as has been suggested in the literature (Bezuidenhout, 2013; Brown-Schmidt & Hanna, 2011; Kuhlen & Brennan, 2013). Our model is currently the only theory of perspective-taking that can account for the full set of results. What is novel about our approach to perspective taking is that the two possible perspectives – the egocentric and the common ground – contribute to reference resolution as referential domains.

Our model reveals that the apparent differences between the findings of Keysar et al. (2000) and Heller et al. (2008) are due to how reference resolution patterns change across the two referential domains. Importantly, our results demonstrate that listeners develop referential expectations based on the labels expected in each of the referential domains, probabilistically integrating these expectations from the two domains. The referential nature of the model lies in the fact that it doesn’t just take into account how likely objects are to be in common ground (which is captured in our P(obj|d)), but crucially also considers the referring expressions expected for objects, which takes into account what objects the referent needs to be distinguished from in the referential domain (this is captured in our P(RE|obj,d)). Our model contrasts, first, with the implicit assumption in the literature that the probabilistic cues to common ground in the experimental situation lead to a categorical choice of a single referential domain. Our model also contrasts with the more recent approach of Brown-Schmidt (2012) where the status of each object as common or privileged is probabilistic: this approach uses a nuanced P(obj|d), but does not capture how this will affect the referring expression expected (i.e., it does not represent P(RE|obj,d)). The goal of the current paper is to introduce the model and demonstrate the advances it allows, and also to set the stage for further research it opens up. One aspect of the model that is ripe for future research is the nature of the weight alpha that weighs the contribution of the two domains in our model. How is alpha determined? We believe that it can be decomposed into a number of different components. First, we expect alpha to be affected by the strength of the situational cues to what is in common ground, and how much a listener can trust these cues, which have been argued in past research to affect whether common ground is used (Brown-Schmidt & Hanna, 2011; Hanna et al., 2003; Kuhlen & Brennan, 2013). Factors of this type include, for example, the interactivity of the situation (Brown-Schmidt, 2009a) and the level of engagement in the task (Ferguson, Apperly, Ahmad, Bindeman, & Cane, 2015). Note that our model is not restricted to information that enters common ground by physical co-presence as in the situation investigated here, but can deal with different sources of common ground information, such as linguistic mention. To this end, we furthermore propose that alpha is not determined once per conversation, but rather can change over time, as more information is added into common ground, or more information is inferred about the knowledge state of the conversational partner. A different component of alpha may come not from the situation, but rather from the individual performing reference resolution. For example, it has been shown that the ability to use common ground information is correlated with executive function, specifically with inhibition control abilities (Brown-Schmidt, 2009b; Nilsen & Graham, 2009) and with working memory capacity (Lin, Keysar, & Epley, 2010). Since in our model the contributions of the two domains are dependent, weighing the common ground domain more requires suppressing the egocentric domain. Thus, in our model it follows naturally that the better an individual is at inhibiting their own perspective, the more influence will be given to common ground. Further research should also address other factors that determine the probability distributions used by the model: P(RE|obj,d) and P(obj|d). First, in relation to P(RE|obj,d), we note that the simple arrays of objects used in this paper (and in most other experiments) created situations where each object had a single referring expression that was overwhelmingly preferred. These relatively extreme probabilities lead to very sharp predictions in the model for the experimental conditions. In the real world, speakers exhibit considerable variability in referring to a single object, which, in addition to the composition of the referential domain, could also depend on other factors such as the identity of the speaker.

D. Heller et al. / Cognition 149 (2016) 104–120

Importantly, since our model draws on the actual probabilities of referring expressions for each object, and not simply what is the most-preferred expression, our model readily applies to these more complex situations; in (more) realistic situations, the resulting probability distributions will simply be ‘‘flatter” because each object will not have a clearly preferred label. Turning next to the prior, P(obj|d), we made a number of simplifying assumptions in our simulations which have enabled us to make sharper predictions. First, we assumed that all shared objects have the same probability and all privileged objects have the same (lower) probability. However, Brown-Schmidt (2012) has already shown that different conversational moves render information in common ground to different degrees; this finding could be incorporated into our model by assigning different prior probabilities to different objects in common ground. In addition, it should be noted that in the common ground domain, privileged information may not be less likely in all situations. For example, certain utterances, such as an information question, may in fact render privileged information more likely than shared information (cf. Brown-Schmidt et al., 2008). Thus, the probabilistic mechanism we propose can readily generalize to the full range of factors at play in real world reference resolution. Our second simplifying assumption for the prior P(obj|d) was that we determined the probability of potential referents based on ground information alone. But our model can accommodate in a similar fashion differences that are due to other sources of information. For example, a verb like eat may raise the likelihood of edible objects to be referents (Altmann & Kamide, 1999), and a verb like pour may raise the likelihood of pourable objects (like a liquid rather than solid egg) to be a referent (Chambers et al., 2004). That is, the prior may not be sensitive only to situational cues such as the common vs. privileged status of information, but may also incorporate linguistic information coming from the unfolding utterance, such as utterance type and even more specific information associated with individual lexical items in the speech stream. We have presented this probabilistic model as capturing reference resolution in an individual at each point of reference resolution – that is, we propose that an addressee simultaneously weighs (according to alpha) evidence from both common ground and their egocentric perspective in developing expectations for the intended referent. Our experimental findings are also consistent with an alternative in which each individual probabilistically (in proportion to alpha) samples one or the other of the domains, and the selected domain categorically then guides reference resolution in that instance. We have noted that independent theoretical motivations (the necessity of conversational participants to continuously monitor and update common ground), as well as preliminary suggestive evidence (the pattern of saccades to multiple objects in our data), favor our proposal that individuals simultaneously weigh multiple domains. But much work is needed to consider how our proposal plays out over the time-course of an unfolding speech stream; for example, whether individuals are indeed simultaneously drawing on multiple domains, and how the various probabilities are tracked as evidence dynamically accrues. Further research, using alternative experimental methodologies, is necessary to consider effects at a finer-grained time scale than our current experiments could reveal, and such experiments should also help to illuminate the precise nature of simultaneity. Finally, while we developed our approach of multiple domains in order to shed new light on a controversy regarding perspective-taking behavior, this approach may prove relevant beyond common ground, as a general approach to referential domains. It is possible that reference resolution has been generally assumed to involve a single referential domain, because, in many situations, only one domain is salient in the context, giving the impression that it alone influences reference resolution. In some


situations – like the case discussed in this paper where there is a clear knowledge mismatch between the conversational partners– there may be more than one salient domain. Such cases will be important in revealing that the interpretation of a referring expression is simultaneously affected by multiple domains. Acknowledgements We gratefully acknowledge the support to the first author from the Social Sciences and Humanities Research Council of Canada (SSHRC), and to the second and third authors from the Natural Sciences and Engineering Research Council of Canada (NSERC). We are extremely grateful to Assunta Ferrante and Michelle Scott for their assistance with the preparation of experimental material and with data collection and coding.

Appendix A. Estimating P(RE|obj,k) via production A.1. Participants Participants were twenty members of the University of Toronto community. All were native speakers of English, and were paid $10 for their participation. A.2. Materials Because of the large number of observations required, we used a computer-based display and not real objects like in the comprehension experiment. The visual materials consisted of a 3  3 grid with four images appearing in the four corners – similar to the vertical shelf. Sixty-four critical displays were created in four versions: pairs-e and pairs-c, and triplet-e and triplet-c. The pairs-e displays contained two pairs of objects contrasting in size, where the two bigger and two smaller objects were of equal size. The pairs-c displays contained three of those objects, with the fourth square marked HIDDEN (see again the left column of Fig. 2). The triplete displays contained three objects contrasting in size, and an unrelated object (that varied in size across displays). The triplet-c displays contained two of the triplet and the unrelated object; the fourth square was marked HIDDEN (see again Fig. 2). Overall, half of the displays were constructed to model a referring expression containing the adjective big and the other half to model the adjective small. To avoid a correlation between absolute size and the intended adjectives, four absolute sizes of images were created with different combinations used in different displays, creating different relative sizes across displays. Sixteen trials were assigned to each display type using a list design. In each display type, one of the four images was the intended referent on four trials. The HIDDEN square was the intended referent only once for each display type. Thus, each display appeared in all four types, but any single participant only saw a display once, referring to just one of its images. This resulted in fifty-eight experimental trials in each list. Eighty-four filler displays were added to each list. To draw attention away from size contrasts, thirty-two displays contained two objects that contrasted in color, thirty-two displays contained a pair of objects whose names share the first sounds (e.g., buttonbutter), and eight displays contained two images that contrasted along a different dimension (e.g., an open and a closed mailbox). In half of these seventy-two fillers, the referent was a member of a contrasting pair, and in the other half the referent was not a member of a contrasting pair. Finally, twelve displays contained four unrelated images. Overall, some of the filler displays contained a HIDDEN square.


D. Heller et al. / Cognition 149 (2016) 104–120

Four presentation lists were created, with one hundred forty two trials each. Trial order was pseudo-randomized, such that there were no adjacent trials of the same type.

Appendix B. List of items in the eye-tracking experiment (in alphabetical order)

A.3. Procedure

Referring expression

Pairs array

Triplet array

Participants performed the role of Director in a referential communication task with a (male) lab confederate. We used a confederate in order to avoid the (natural) situation where the matcher helps the director in choosing the referring expressions (cf. Clark & Wilkes-Gibbs, 1986). The confederate was naïve to the goals of the experiment and did not know what the intended referents were. In order for participants to believe that the confederate had real informational needs in the task, participants were made to believe that the confederate was a naïve participant (see Kuhlen & Brennan, 2013 for discussion). Participants sat at a computer, with the confederate sitting at a different computer in the same room, such that the participant and confederate could not view each other’s monitors. On each trial, the participant saw a 3  3 grid with four images in the corners squares; some trials had only three images with the fourth square colored gray and marked HIDDEN. Participants were told that for their HIDDEN squares the Matcher (=the confederate) saw an image with a gray background, so the matcher would know that the Director could not see that image. The first three trials were considered practice, and participants had an opportunity to see the Matcher’s monitor with such an image (after the practice trials, the confederate did not actually see images in the HIDDEN location). On each trial, one of the images was highlighted, and the participant was asked to produce an instruction for the confederate to click on that image. Participants were free to say whatever they wanted, with the exception that they were instructed not to use image locations. After the participant instructed the confederate which image to click on, the confederate indicated his understanding by saying yes or ok. If the participant did not give sufficient information to identify the image, the confederate asked which one?, but did not offer a referring expression. An audio record of the conversation was recorded onto a computer.

the big candle

2 candles, 2 funnels

the big giftbag

2 giftbags, 2 (plastic) bottles 2 hairclips, 2 staplers

3 candles, small funnel 3 giftbags, small bottle 3 hairclips, small stapler 3 legos, small bowl 3 locks, small nailpolish 3 paintbrushes, small spatula 3 pairs of scissors, small comb 3 sponges, small Tupperware 3 blocks, big plastic egg 3 bows, big tape dispenser 3 cans, big deodorant 3 coffee cups, big grater 3 ducks, big car

A.4. Coding A naïve research assistant transcribed all conversations, and classified every referring expression into one of seven categories: (i) BIG, which included referring expression with size adjectives like big, bigger and biggest, as well as large and its derivations; (ii) medium, which included referring expressions with adjectives like medium or medium-sized; (iii) small, which also included smaller and smallest, little and its derivations; (iv) other-modifier, for those referring expressions that included a modifier that did not encode size information; (v) no-modifier, for those referring expressions that included a bare noun (e.g., the candle); (vi) not-this, for referring expressions of the form not the candles and not the funnel which was used in referring to the hidden object; (vii) Other, for referring expressions that did not fit in any of the previous categories. If the speaker corrected themselves, we classified the referring expression according to the form used in the corrected version. This is because the corrected form should better reflect the expectation this speaker would have if they were the listener (this affected 3% of the data).

the big hairclip the big lego the big lock the big paintbrush the big scissors the big sponge the small block the small bow the small can the small coffee cup the small duck the small gluebottle the small jar the small pot

2 legos, 2 bowls 2 locks, 2 bottles of nailpolish 2 paintbrushes, 2 spatulas 2 pairs of scissors, 2 combs 2 sponges, 2 tupperware 2 blocks, 2 plastic eggs 2 bows, 2 tape dispensers 2 cans, 2 deodorants 2 coffee cups, 2 graters 2 (rubber) ducks, 2 (toy) cars 2 gluebottles, 2 mugs 2 jars, 2 measuring cups 2 (flower) pots, 2 8 balls

3 gluebottles, big mug 3 jars, big measuring cup 3 pots, big 8 ball

Appendix C. An alternative analysis There has been an ongoing debate in the literature about how to analyze visual-world data – specifically, how to identify effects of the processing of the linguistic signal, separating them from anticipatory effects (i.e., where listeners are looking before they process the critical linguistic stimulus). One particular issue is how to understand trials on which listeners were already looking at the target before they hear the critical stimulus: some have argued that these trials are not informative (or less informative) because the ongoing fixation of the target cannot be reliably interpreted (Tanenhaus, Frank, Jaeger, Masharov, & Salverda, 2008). This line of reasoning has motivated us to use saccades as our main dependent variable (another solution is to focus on fixations, excluding trials on which the participant was already fixating the target at the critical onset: see Brown-Schmidt, 2009b; Heller et al., 2008). However, others have argued that all trials, including on-target trials, should be used in analysis, and that excluding trials can lead to spurious results (Barr, Gann, & Pierce, 2011). Since our model makes predictions for the processing of a partial referring expression up through the adjective (P(obj|RE)), and these predictions are different from the patterns expected before the referring expression is heard (P(obj|d)), it is important to show that the match of our results to our model predictions is not a by-product of the way our analysis deals with anticipatory looks. Thus, we also

D. Heller et al. / Cognition 149 (2016) 104–120 Table 1 Effects of array, ground and window on the proportion of fixating the target. b




Full model (intercept) array ground window array:ground array:window ground:window array:ground:window

0.46 0.14 0.09 0.09 –0.006 0.004 0.01 0.02

0.07 0.05 0.07 0.01 0.01 0.0035 0.0035 0.0035

8.44 1.69 34.006 0.27 1.62 11.49 36.39

.003 .19 5.49E09 .60 .20 .0007 1.61E09

Pairs model (intercept) ground window ground:window

0.59 0.09 0.09 0.009

.09 0.08 0.01 0.004

1.33 20.80 3.94

.25 5.09E06 .047⁄

Triplet model (intercept) ground window ground:window

0.32 0.08 0.09 0.03

0.09 0.10 0.01 0.004

0.70 42.16 46.98

.40 8.41E11 7.18E12

present an analysis in terms of fixations over time on all trials, using the method described in Barr et al. (2011). We analyzed fixations to the target, using a weighted empirical logit regression that approximates multilevel logistic regression (Barr et al., 2011; see also Barr, 2008a). In addition to the experimental fixed factors of array and ground, the statistical model also included the window of analysis as a fixed factor; this factor provides an estimate of the increase in the likelihood of fixating the target over the course of the critical analysis window. The critical interval was 440 ms (200–640 ms), with observations grouped into a series of temporal bins of 40 ms each, plus a 40 ms baseline window (160–200 ms); window was coded as the continuous variable. In this type of analysis, the interesting questions are how the experimental fixed factors of array and ground interact with window (i.e., whether and how the effects of array and ground change over time). This interaction should reflect the change in visual attention during the processing of the critical linguistic stimulus (here: the adjective) above and beyond any biases that existed


before the linguistic stimulus was heard. Main effects and interactions that do not involve the variable window reflect processing across the whole interval, and may therefore reflect biases that were present even before the linguistic stimulus (i.e., anticipatory effects). We fitted a mixed-effects linear regression model with ground, array and window as predictors; the random effect structure supported by the data was a random intercept and a random slope for each of the predictors, for both participants and items (a model with random slopes for interactions did not converge). Fixed effects were evaluated by performing likelihood ratio tests in which the deviance of a model containing the effect was compared to a second model without that fixed effect (but which had the same random effects structure). Table 1 (top panel) provides a summary of the model. The model revealed the expected strong effect of window, which simply means that fixations to the target were overall more likely as the linguistic stimulus was unfolding. There was also a main effect of array, meaning that, overall, listeners were more likely to gaze at the target in the triplet array. Notable are, first, the Ground  Window interaction, indicating that the effect of ground changed over time. Most important is the threeway interaction Array  Ground  Window, indicating that the effect of ground over time was different for the two arrays – this interaction is plotted in Fig. 8. This interaction is parallel to the critical interaction of Array  Ground that we observed in our saccade analysis. We further examined the critical three-way interaction by looking at the two types of array separately. In both cases (see again Table 1), the model revealed the expected main effect of window, no main effect of ground, and a significant Ground  Window interaction. In both cases, the interaction indicates that the effect of ground changed over the course of the window. Importantly, the effects of ground were in opposite directions for the two arrays: a facilitatory effect in pairs array (b = 0.009) and a hindering effect in the triplet array (b = 0.03). Examining the effect of window in each condition separately showed that all slopes were significant (all ts > 3.5; Pairs-Baseline 0.07 vs. Pairs-Privileged 0.09; Triplet-Baseline 0.12 vs. Triplet-Privileged 0.06). This is parallel to the pattern observed in our saccade analysis in the main body of the paper (see again Fig. 7), which is the cross-over interaction predicted by our model.

Fig. 8. Likelihood of fixating the target as a function of analysis window, array and ground. The analyses were conducted on the log odds scale, but the figure is plotted in terms of raw proportions for ease of interpretation. The left panel plots the pairs array and the right panel plots the triplet array.


D. Heller et al. / Cognition 149 (2016) 104–120

References Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38, 419–439. Altmann, G. T. M., & Kamide, Y. (1999). Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition, 73, 247–264. Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412. Barr, D. J. (2008a). Analyzing ‘visual world’ eyetracking data using multilevel logistic regression. Journal of Memory and Language, 59, 457–474. Barr, D. J. (2008b). Pragmatic expectations and linguistic evidence: Listeners anticipate but do not integrate common ground. Cognition, 109, 18–40. Barr, D. J., Gann, T. M., & Pierce, R. S. (2011). Anticipatory baseline effects and information integration in visual world studies. Acta Psychologica, 137, 201–207. Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278. Bates, D. M., Maechler, M., & Bolker, B. (2012). lme4: Linear mixed-effects models using S4 classes. R package version 0.999999-0. . Bezuidenhout, A. (2013). Perspective taking in conversation: A defense of speaker non-egocentricity. Journal of Pragmatics, 48, 4–16. Brown-Schmidt, S. (2009a). Partner-specific interpretation of maintained referential precedents during interactive dialog. Journal of Memory and Language, 61, 171–190. Brown-Schmidt, S. (2009b). The role of executive function in perspective-taking during on-line language comprehension. Psychonomic Bulletin and Review, 16, 893–900. Brown-Schmidt, S. (2012). Beyond common and privileged: Gradient representations of common ground in real-time language use. Language and Cognitive Processes, 27, 62–89. Brown-Schmidt, S., Gunlogson, C., & Tanenhaus, M. K. (2008). Addressees distinguish shared from private information when interpreting questions during interactive conversation. Cognition, 107, 1122–1134. Brown-Schmidt, S., & Hanna, J. E. (2011). Talking in another person’s shoes: Incremental perspective-taking in language processing. Dialog and Discourse, 2, 11–33. Chambers, C. G., Tanenhaus, M. K., Eberhard, K. M., Filip, H., & Carlson, G. N. (2002). Circumscribing referential domains during real-time language comprehension. Journal of Memory and Language, 47, 30–49. Chambers, C. G., Tanenhaus, M. K., & Magnuson, J. S. (2004). Actions and affordances in syntactic ambiguity resolution. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 687–696. Clark, H. H., & Marshall, C. R. (1981). Definite reference and mutual knowledge. In A. K. Joshi, B. Webber, & I. Sag (Eds.), Elements of discourse understanding (pp. 10–63). Cambridge: Cambridge University Press. Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process. Cognition, 22, 1–39. Ferguson, H. J., Apperly, I. A., Ahmad, J., Bindeman, M., & Cane, J. E. (2015). Task constraints distinguish perspective inferences from perspective use during discourse interpretation. Cognition, 139, 50–70. Fine, A. B., & Jaeger, T. F. (2013). Evidence for error-based implicit learning in adult language processing. Cognitive Science, 37, 578–591. Frank, M. C., & Goodman, N. D. (2012). Predicting pragmatic reasoning in language games. Science, 336, 998. Gundel, J., Hedberg, N., & Zacharski, R. (1993). Cognitive status and the form of referring expressions in discourse. Language, 69, 274–307.

Hallett, P. E. (1986). Eye movements. In K. R. Boff, L. Kaufman, & J. P. Thomas (Eds.), Handbook of perception and human performance (pp. 10.1–10.112). New York: Wiley. Hanna, J. E., & Tanenhaus, M. K. (2004). Pragmatic effects on reference resolution in a collaborative task: Evidence from eye movements. Cognitive Science, 28, 105–115. Hanna, J. E., Tanenhaus, M. K., & Trueswell, J. C. (2003). The effects of common ground and perspective on domains of referential interpretation. Journal of Memory and Language, 49, 43–61. Heller, D., & Chambers, C. G. (2014). Would a blue kite by any other name be just as blue? Effects of descriptive choices on subsequent referential behavior. Journal of Memory and Language, 70, 53–67. Heller, D., Grodner, D., & Tanenhaus, M. K. (2008). The role of perspective in identifying domains of reference. Cognition, 108, 831–836. Heller, D., Grodner, D., & Tanenhaus, M. K. (2009). The real-time use of information about common ground in restricting domains of reference. In U. Sauerland & K. Yatsushiro (Eds.), Semantics and pragmatics: From experiment to theory (pp. 228–248). Palgrave Studies in Pragmatics, Language & Cognition. Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446. Kehler, A., & Rohde, H. (2013). A probabilistic reconciliation of coherence-driven and centering-driven theories of pronoun interpretation. Theoretical Linguistics, 39, 1–37. Keysar, B., Barr, D. J., Balin, J. A., & Brauner, J. S. (2000). Taking perspective in conversation: The role of mutual knowledge in comprehension. Psychological Science, 11, 32–37. Keysar, B., Barr, D. J., Balin, J. A., & Paek, T. S. (1998). Definite reference and mutual knowledge: Process models of common ground in comprehension. Journal of Memory and Language, 39, 1–20. Keysar, B., Lin, S., & Barr, D. J. (2003). Limits on theory of mind use in adults. Cognition, 89, 25–41. Kuhlen, A. K., & Brennan, S. E. (2013). Language in dialogue: When confederates might be hazardous to your data. Psychonomic Bulletin and Review, 20, 54–72. Lin, S., Keysar, B., & Epley, N. (2010). Reflexively mindblind: Using theory of mind to interpret behavior requires effortful attention. Journal of Experimental Social Psychology, 46, 551–556. Nadig, A. S., & Sedivy, J. C. (2002). Evidence of perspective-taking constraints in children’s on-line reference resolution. Psychological Science, 13, 329–336. Nilsen, E., & Graham, S. (2009). The relations between children’s communicative perspective-taking and executive functioning. Cognitive Psychology, 58, 220–249. R Core Team (2012). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Roberts, C. (2003). Uniqueness in definite noun phrases. Linguistics and Philosophy, 26, 287–350. Russell, B. (1905). On denoting. Mind, 14, 479–493. Sedivy, J. C., Tanenhaus, M. K., Chambers, C. G., & Carlson, G. N. (1999). Achieving incremental processing through contextual representation: Evidence from the processing of adjectives. Cognition, 71, 109–147. Stalnaker, R. (1978). Assertion. In P. Cole (Ed.). Syntax and semantics: Pragmatics (Vol. 9). Academic Press. Tanenhaus, M. K., Frank, A., Jaeger, T. F., Masharov, M., & Salverda, A. P. (2008). The art of the state: Mixed effect regression modeling in the visual world. In The 21st CUNY sentence processing conference. Chapel Hill, NC. West, S. G., Aiken, L. S., & Krull, J. L. (1996). Experimental personality designs: Analyzing categorical by continuous variable interactions. Journal of Personality, 64, 1–48.

Perspective-taking behavior as the probabilistic weighing of multiple domains.

Our starting point is the apparently-contradictory results in the psycholinguistic literature regarding whether, when interpreting a definite referrin...
2MB Sizes 0 Downloads 8 Views