The Quarterly Journal of Experimental Psychology

ISSN: 1747-0218 (Print) 1747-0226 (Online) Journal homepage: http://www.tandfonline.com/loi/pqje20

Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies Fritz Günther, Carolin Dudschig & Barbara Kaup To cite this article: Fritz Günther, Carolin Dudschig & Barbara Kaup (2015): Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies, The Quarterly Journal of Experimental Psychology, DOI: 10.1080/17470218.2015.1038280 To link to this article: http://dx.doi.org/10.1080/17470218.2015.1038280

View supplementary material

Published online: 08 May 2015.

Submit your article to this journal

Article views: 72

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=pqje20 Download by: [Universite Laval]

Date: 05 November 2015, At: 21:03

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015 http://dx.doi.org/10.1080/17470218.2015.1038280

Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies Fritz Günther, Carolin Dudschig, and Barbara Kaup Department of Psychology, University of Tübingen, Baden-Württemberg, Germany

Downloaded by [Universite Laval] at 21:03 05 November 2015

(Received 21 October 2014; accepted 26 March 2015)

In distributional semantics models (DSMs) such as latent semantic analysis (LSA), words are represented as vectors in a high-dimensional vector space. This allows for computing word similarities as the cosine of the angle between two such vectors. In two experiments, we investigated whether LSA cosine similarities predict priming effects, in that higher cosine similarities are associated with shorter reaction times (RTs). Critically, we applied a pseudo-random procedure in generating the item material to ensure that we directly manipulated LSA cosines as an independent variable. We employed two lexical priming experiments with lexical decision tasks (LDTs). In Experiment 1 we presented participants with 200 different prime words, each paired with one unique target. We found a significant effect of cosine similarities on RTs. The same was true for Experiment 2, where we reversed the prime-target order (primes of Experiment 1 were targets in Experiment 2, and vice versa). The results of these experiments confirm that LSA cosine similarities can predict priming effects, supporting the view that they are psychologically relevant. The present study thereby provides evidence for qualifying LSA cosine similarities not only as a linguistic measure, but also as a cognitive similarity measure. However, it is also shown that other DSMs can outperform LSA as a predictor of priming effects. Keywords: Latent semantic analysis; Distributional semantics models; Semantic space models; Semantic priming.

Language is a powerful symbol system that plays a crucial role in communication and social interactions. The ability to transfer information between individuals allows humans to increase their amount of world knowledge without having to directly experience everything. This idea is expressed by Johnson-Laird (1983, p. 430) in the claim that “a major function of language is thus to enable us to experience the world by proxy”.

To fulfil this function, a language must consist of symbols that convey meaning. Only then can information be transferred from one mind to the other by means of using this symbol system. Interestingly, the question of how words constituting the basic symbols of a language obtain meaning is still highly debated. One proposed account is the so-called distributional hypothesis (Sahlgren, 2008), which can be

Correspondence should be addressed to Fritz Günther, University of Tübingen, Department of Psychology, Schleichstraße 4, 72076 Tübingen, Baden-Württemberg, Germany. Email: [email protected] We are grateful to Wolfgang Lenhard, University of Würzburg, for providing us with the software to create LSA semantic spaces, and to Ian Mackenzie, University of Tübingen, for further software support. Additionally, we would like to thank two anonymous reviewers for their valuable comments on previous versions of this paper. This project was supported by the Collaborative Research Centre 833 (SFB833) “The Construction of Meaning”/Z2 project appointed to Barbara Kaup by the German Research Foundation (DFG). © 2015 The Experimental Psychology Society

1

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

traced back to Harris (1954). According to this account, “words which are similar in meaning occur in similar contexts” (Rubenstein & Goodenough, 1965, p. 627). For example, words like hospital and clinic tend to be considered similar in terms of meaning because both words often occur in the context of other words like doctor, patient and medicine. This is even more striking for synonyms, which can be used interchangeably in different contexts. Distributional semantics models (DSMs) like latent semantic analysis (LSA; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Landauer & Dumais, 1997) rely heavily upon this assumption that a word’s meaning is determined by the contexts in which it occurs. These models are algorithms working on a large corpus of text documents, and strive to represent the meaning of words as a vector in a high-dimensional vector space by using the distribution of the words in the corpus as a basis. While LSA is only one amongst many DSMs that have been developed in the last two decades (for an overview, see Jones, Willits, & Dennis, in press), it has received by far the most attention of these models in the fields of psychology and cognitive science. Such a vector representation of word meanings as computed by LSA allows for determining word similarities by computing the cosine of the angle between two such vectors. In the present article, we will investigate whether these LSA cosine similarities provide plausible measures for mental word similarities. For the purpose of our study, we thereby define “cognitive plausibility” as a correspondence and correlation between the mental semantic similarity structure (for example association strengths in some kind of mental lexicon) and the semantic similarity structure that LSA gives. More specifically, we examine whether LSA cosine similarities predict reaction times (RTs) in a priming paradigm with a lexical decision task (LDT). Previous studies have already addressed this question, but to our knowledge there exists no study that directly manipulated LSA cosine similarities as an independent variable and a numerical predictor. We want to resolve this issue by applying a different procedure in generating the item material than previous studies.

2

The LSA algorithm is explained in more detail below, after which previous results on the cognitive relevance of LSA are presented, along with the potential problems for the interpretation of these results and our approach to solving these problems.

Latent semantic analysis As stated above, the basis of the LSA algorithm is a reasonably large corpus of natural language, split into single text documents. The LSA algorithm is applied to such a corpus by means of four steps (for details of this algorithm, see Martin & Berry, 2007). Firstly, a word-by-document frequency matrix M is constructed from the corpus. In such a matrix, each word makes up a row vector, and each document a column vector; the cell entries then determine how often a specific word occurs in a specific document. This matrix is the only input available to the LSA algorithm, so this algorithm relies entirely upon occurrences and, far more importantly, co-occurrences of words. In fact, the row vectors (which are the word vectors) already represent the distributions of words across documents. In LSA, a document is equated with a context. Words that often occur in the same contexts will therefore tend to have similar vectors. Secondly, a weighting scheme such as logentropy weighting (Dumais, 1991) is applied to the matrix. Thirdly, a singular value decomposition (SVD) is performed on the weighted term-bydocument frequency matrix Mw . This procedure is a two-mode generalization of the widespread factor analysis (Deerwester et al., 1990). Essentially, the SVD decomposes the matrix Mw into three components as follows: Mw = U SV T .

(1)

With this decomposition, the semantic space is created by assigning a vector to each term and to each document: the row vectors of U are the term vectors or word vectors, and the column vectors of V are the document vectors. SIGMA is a diagonal matrix with the singular values of M as its

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

diagonal entries. Words and documents are thereby represented in one single semantic space. Lastly, dimension reduction is applied to the word and document vectors, a procedure that is well known in factorial analysis. The number of dimensions k can be chosen arbitrarily, but it has been shown that choosing about 300 to 1000 dimensions results in good performance for LSA (Landauer & Dumais, 1997). This reduction of dimensionality is an important step in forming the LSA algorithm, and it is this step which seems to allow LSA to capture deeper semantic and associative structures than just the simple cooccurrence data in the term-by-document frequency matrix. Landauer, Foltz, and Laham (1998) demonstrate that it is possible to retrieve a very high cosine for words that actually never occurred together in the same document. In fact, Jones, Kintsch, and Mewhort (2006) state that about 70% of a word’s nearest neighbours never co-occurred in the same document. At the end of this four-step algorithm, each word is assigned a k-dimensional vector. This representation enables the key feature of LSA: comparison of word meanings. Assuming that the word vectors represent word meanings, the more similar these vectors, the more similar the respective word meanings should be. The most commonly used method for computing the similarity of vectors is to compute the cosine of the angle between two word vectors, which is a value ranging between −1 and 1. A cosine of 0 indicates orthogonal, i.e., unrelated vectors, while a cosine of 1 indicates identical vectors. Negative cosine similarities cannot be interpreted and are often set to zero in the literature. The fact that every word is assigned a vector thus allows for the computation of meaning similarities for every word pair that is represented in the semantic space.

The cognitive relevance of LSA A major question that arises at this point is whether the word similarities that can be computed via LSA are just abstract values that represent word similarities in a collection of texts, or whether they actually represent human mental similarities. In other

words, simply deriving a semantic space from a corpus of natural language by no means guarantees that this semantic space can be mapped onto the human mental lexicon or lexical network. In fact, Sahlgren (2008, p. 14) stated that DSMs do not capture “the meanings that are in our heads” or “the meanings that are out there in the world, but the meanings that are in the text”. In this context, the phrases “meanings in our heads” (or “meanings out there in the world”) refer to what most psychologists will associate with the word “meaning”: the reference of linguistic expressions (such as words) to objects, situations and other entities in the world, and to other concepts and ideas inside a speaker’s or listener’s mind. According to this view, words that refer to similar or associated entities or concepts (such as horses and donkeys, or doors and keys) have a similar meaning. As pointed out by Sahlgren (2008), DSMs have no possibility of taking into account the outside world or the minds of language users. The only information DSMs use is information that is present in the language itself. More specifically, they focus on syntagmatic relations (which words co-occur in texts) and paradigmatic relations (which words can occur interchangeably in the same context, but not at the same time) between words. In this view, meaning is defined as the syntagmatic or paradigmatic relations between words: words that often co-occur or can be used interchangeably have a similar meaning (Sahlgren, 2008). Obviously, “meanings in our heads” and “meanings in the text” are two very different definitions of meaning; the former is concerned with the relationship between language and the outside world, and the latter with language itself. However, the separation between language and the outside world may be not as strict, because language is part of the outside world, and language is used to describe and capture the outside world (Louwerse, 2011; Louwerse & Zwaan, 2009). For example, the cooccurrence of door and key in the outside world is very likely to be reflected in language, so these words should be similar in meaning according to both views. DSMs such as LSA are explicitly designed to capture the “meanings in the text”, but from this

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

3

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

it does not necessarily follow that they do not capture the “meanings in our heads”, since the two definitions are not mutually exclusive. The statement that DSMs do not capture the “meanings in our heads” was not further investigated empirically by Sahlgren (2008). Assuming that there is a correspondence between language and the outside world as argued by Louwerse (2011), one might even expect the two types of meanings to be similar; not by definition, but empirically. Therefore, we investigate in the present study whether LSA provides cognitively plausible similarity measures (i.e., whether or not words with a similar meaning according to LSA also have a similar meaning in the minds of language users). For the fields of psychology and cognitive science, LSA can only provide a useful algorithm and tool if it captures similarities in word meanings as conceived by the human mind. This issue has already been addressed by various studies, some of which even included behavioural data and not only ratings data. Snyder and Munakata (2008) conducted an experiment in which participants had to generate words after they were given incomplete linguistic input. In a first experiment these words were either highly determined by a preceding sentence, as in, “he mailed a letter without a ___” (stamp would be the determined answer here), or highly underdetermined, as in “he couldn’t think of anyone less ___”. In a second experiment, participants had to generate a verb after being presented with a noun that was either highly determined (such as scissors) or highly underdetermined (such as cat). Snyder and Munakata found that the RTs on these tasks (the time until the answer was given) could be predicted by the LSA cosine between the input and the given answer. A rather extensive study was conducted by Jones et al. (2006). These authors re-analysed the data of several experiments on lexical priming, and compared the RTs in these studies with LSA cosines. They found that differences in RT between related and unrelated pairs could be predicted by differences in the pairs’ LSA cosines. This was the case for studies including different types of item material and experimental procedures, and

4

even held for studies concerned with mediated priming (for example, the relation between lion and stripes is mediated by the word tiger, to which both words are related). Another study dealing with the relation between LSA cosine similarities and RTs was conducted by Hutchison, Balota, Cortese, and Watson (2008). These authors examined the effect of several different factors on lexical priming effects (prime characteristics, target characteristics, and prime-target relatedness measures). In their experiments, participants were presented with a prime word which was followed by a target word after a short delay. The participants then either had to judge whether the target word was a real word or a non-word (LDT) or they had to read the target word out loud (naming task). The primes and targets were chosen in a way whereby each prime was either followed by a related, unrelated or a neutral prime (the word blank). Hutchison et al. (2008) found that LSA cosines were significantly higher for related than for unrelated pairs, and RTs were shorter. However, LSA cosines failed to predict RTs at the item level: in a regression analysis, LSA cosine similarities for prime-target pairs were not significant predictors for target RTs. The authors concluded: “a challenge therefore is for semantic space models to capture not only overall priming effects from factorial studies, but also item-by-item differences in magnitude in priming as a function of semantic similarity” (Hutchison et al., 2008, p. 1055).

Associative and semantic priming A distinction can be made between at least two types of priming. Associative priming occurs between words that are associated, i.e., that occur in similar contexts. For associated word pairs, there exists a high probability that one word will come to a person’s mind when the other word is presented to that person. On the other hand, semantic priming occurs between words that share many semantic features. Ferrand and New (2003) found that associative and semantic priming occur independently. They found priming effects for word pairs that are only

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

associatively but not semantically related (such as spider-web) and for word pairs that are only semantically but not associatively related (such as dolphinwhale). Thus, there exists purely associative as well as purely semantic priming (Hutchison, 2003; Lucas, 2000). Jones et al. (2006) found that LSA cosine similarities performed well in predicting associative priming effects, while they performed relatively poorly in predicting semantic priming effects. Semantic priming effects were captured better by HAL (Lund & Burgess, 1996), another DSM. This pattern of results makes perfect sense, considering that the LSA algorithm is designed to assign high similarities to words that often co-occur in the same context (i.e., on syntagmatic relations), and therefore captures associative relationships. On the other hand, the HAL algorithm uses a moving-window technique that assigns high similarities to words that are often surrounded by the same words (i.e., on paradigmatic relations), and therefore captures semantic relationships. A detailed discussion on this topic is given in Sahlgren (2008). The BEAGLE model (Jones & Mewhort, 2007), which was designed to capture both syntagmatic and paradigmatic relations, was found to predict both associative and semantic priming effects.

Potential problems with the previous studies The aforementioned results seem promising for establishing LSA cosine similarities as a cognitively plausible similarity measure. However, they do not necessarily lead to the conclusion that LSA cosine similarities can actually predict priming effects. This is due to the fact that in none of these studies were LSA cosine similarities directly manipulated as an independent variable. Instead, they were only analysed post hoc, after the word material was created (or after the experiments were conducted). In Snyder and Munakata (2008), the participants’ task was to generate words after being presented with linguistic stimuli. The LSA cosine similarity between the presented and generated words was assessed after the experiment was

conducted, and then used to predict RTs. Therefore, the participants indirectly manipulated the LSA cosine similarity variable. Hutchison et al. (2008) used the association norms of Nelson, McEvoy, and Schreiber (1998) to create prime-target pairs that were either highly related or unrelated, according to these norms. The independent variable in this study was therefore the word pairs’ association norm relatedness. The LSA cosine similarities were again assessed only after the item material was already created. The same holds true for Jones et al. (2006), in which many of the studies these authors analysed used the same or a similar procedure as Hutchison et al. (2008) to create their item material. These procedures cause some potential problems for the interpretation of the results. If the item material used in a study only contains word pairs that are either highly related or unrelated according to association norms, the results cannot be necessarily generalized to other, mediumrelated word pairs. Furthermore, the distribution of LSA cosine similarities that are analysed in such a study cannot be controlled. These could result in cases where there are many empty spaces and/or densely populated chunks in their distribution. In extreme cases, this could even result in almost no variance of LSA cosine similarities. Although it is very unlikely, one cannot exclude that all word pairs have by chance almost the same cosine similarity. (This was not the case for the Hutchison et al. (2008) study, where the authors report a large variability in the LSA cosine similarities they examine, but the very fact that this has to be checked should be avoided.) Therefore, being able to directly control the distribution of LSA cosine similarities is desirable if one wishes to analyse their predictive value. Additionally, including LSA cosine similarities only as a post hoc variable in the analysis does not allow the experimenter to control for possible correlations with other variables, such as word lengths and frequencies. We argue that, in order to come to reliable conclusions on whether or not LSA cosine similarities predict RTs, these similarities need to be directly manipulated as an independent variable.

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

5

FRITZ GÜNTHER ET AL.

A second issue that we want to address is that LSA cosines were treated as a group variable in the studies examined by Jones et al. (2006) (the mean cosine differences between highly-related word pairs and word pairs with only a low relation were examined). However, as argued by Hutchison et al. (2008), the LSA cosine similarity is a numerical variable, and it should therefore be examined as a numerical predictor.

Downloaded by [Universite Laval] at 21:03 05 November 2015

Outlines for the experiments In our experiments, a lexical priming paradigm was applied to examine the cognitive plausibility of LSA cosines. Since it is a very common finding that related word pairs result in shorter RTs on the target than unrelated word pairs (Hutchison, 2003), our hypothesis was that higher LSA cosines between prime and target predict shorter RTs on the target. Priming effects can be found for purely associatively related pairs, as well as for purely semantically related pairs (Ferrand & New, 2003). We can therefore expect that LSA cosines predict priming effects, even if they should only represent associative similarities. An LDT was employed to obtain RTs because priming effects are found to be greater in this paradigm, compared to a naming paradigm (Ferrand & New, 2003; Hodgson, 1991; Hutchison, 2003; Lucas, 2000). Since we are interested in whether LSA cosine similarities actually predict priming effects, we opted for a paradigm that more likely shows such effects if they are present. We also decided to use a long stimulus onset asynchrony (SOA) of 1000 ms. This allows for conscious processing of the target word, which may not be completed for short SOAs below 250 ms (Hutchison et al., 2008; Neely& Keefe, 1989). In general, however, priming effects seem to be quite independent of SOA (Lucas, 2000; but see also Hutchison et al., 2008), at least if the proportion of related prime-target pairs is not too high (de Groot, 1984). The main rationale of our study is to address the doubts raised by Hutchison et al. (2008) that LSA might not be capable of predicting priming effects at the individual item level. We hypothesize that, if examined with more suitable methods, LSA

6

can indeed predict priming effects even at the item level. As argued above, we think it is important (and more adequate) to directly manipulate the LSA cosine similarity as an independent, numerical variable. Therefore, we employed a pseudo-random procedure for creating the primetarget pairs used as item material. Additionally, since we want to investigate the empirical validity of LSA per se, and not only the validity of one specific LSA space, two different semantic spaces were employed for the data analysis (however, only one of them was used to create the item material).

EXPERIMENT 1 In this experiment, we examined whether LSA cosines predicted priming effects when manipulated as an independent variable. Therefore, we applied a pseudo-random procedure to generate the item material. In order to exclude possible interfering effects that can arise from crossing concrete and abstract words in priming studies, only concrete nouns were used (Bleasdale, 1987; Kroll & Merves, 1986).

Method Corpus and semantic space The corpus used for creating the first semantic space was part of the German Corpus December 2011, downloaded from HC Corpora (http:// corpora.heliohost.org/), which was collected from publicly available online sources, using a web crawler. From this corpus, we sampled a subset of 50,318 blog entries (4,896,832 tokens), because we assumed that blog entries reflect normal everyday language quite well, with a wide range of topics. A German corpus was used since we tested German participants. Umlauts and other special characters were replaced with their respective equivalent non-special characters (ä with ae, for example). For creating the first semantic space (the blogs LSA space), we used the SUMMA LSA-Server Architecture provided by the University of

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

Würzburg (Baier, Lenhard, Hoffmann, & Schneider, 2006). German stop-words such as und (and) and ein (a) were removed from the corpus, and all letters were set to lower case. Terms that appeared in less than two different documents were also removed from the corpus. We did not lemmatize the words, so we did not replace every word with its word stem. After applying these transformations, the term-by-document frequency matrix contained 97,838 different words. A log-entropy weighting scheme was applied to this matrix. Afterwards, an SVD was conducted for creating a 300-dimensional semantic space. From this semantic space, we created our item material, as explained later. The second semantic space (the newspapers LSA space) was created using the same procedure, with the difference that 54,959 newspaper articles (5,103,270 tokens) were used as a text corpus. Those articles were also part of the German Corpus December 2011. This semantic space contained 109,480 different terms. Both LSA spaces are freely available in .rda format for R (R Core Team, 2014) at http:// www.lingexp.uni-tuebingen.de/z2/LSAspaces/. Participants We tested 42 native German-speaking participants who gave informed consent to participate; 32 were female and 10 were male, and all were righthanded. The participants had a mean age of 25.86 years (SD = 6.09 years) and received either course credit or money for their participation. Material We used 300 different prime words. All of these words were German concrete singular nouns consisting of one to three syllables, with a length between four and nine letters and log-frequencies between 10 and 15 (according to http:// wortschatz.uni-leipzig.de, indicating a medium high frequency). Afterwards, a target word was selected for each prime word. We applied a pseudo-random procedure for this. First, we randomly selected 200 of our 300 primes to be paired with real target words, while the remaining 100 primes were

paired with non-words. Only a third of our pairs included non-word targets, in order to keep the experiment short. The non-words were pseudowords with German phonotactic constraints, created using Wuggy (Keuleers & Brysbaert, 2010). The real words were chosen according to the following procedure. We defined four different categories of similarity. These categories were chosen as follows The first category included cosine values between .00 and .10, the second between .10 and .25, the third between .25 and .40, and the fourth between .40 and .60. A total of 50 prime words were randomly assigned to each of these four categories respectively. Then, the cosines between each prime word vector and all 97,838 word vectors in the blogs LSA space were computed. We then randomly sampled possible target words for each prime from its category of similarity using the R package LSAfun (Günther, Dudschig, & Kaup, in press). For example, if a prime word was assigned to the first category, we randomly sampled words with a cosine similarity to the prime between .00 and .10. Then, for each prime, the first randomly sampled word that was a concrete noun, either singular or plural, consisting of one to three syllables, four to ten letters, and a log-frequency between 10 and 16, was selected as a target word. Obviously, a prime word itself was never chosen to be its own target word. Thus, a pseudo-random selection of target words was applied, while intuition did not play a role at any point; all the criteria and restrictions used for this selection were strictly formal. In cases where no target that met all of these criteria was found for a prime, which sometimes occurred for the high-similarity category, this prime word switched category assignment with a randomly selected word from another category. A remark should be made at this point concerning the similarity categories used in this study. No categorical experimental design was intended by employing these categories; they were only used as auxiliary means to ensure that a broad range of cosine values was present in the item material. As can be seen in Figure 1, for any given word nearly all other words in the semantic space are hardly

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

7

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

Figure 1. A typical distribution of the number of matches as a function of LSA cosine for a given word. The word Blume (flower) and the LSA space described above were used for this example.

related, and highly related words are very rare. So by selecting completely random targets one would obtain mostly unrelated material (with a distribution of cosine similarities in the item material closely around zero, as illustrated by an example in Figure 1), which we wanted to avoid. However, to account for the fact that the lower cosine similarities are far more frequent than the higher ones, we decided to make the lowest similarity category narrower and the highest category broader than the others. We also did not include any negative cosines, since it is not entirely clear how to handle them in LSA (as stated above, the typical approach is to set them to zero). In the item material generated this way, LSA cosines between primes and targets did not correlate significantly with number of target syllables (r = .09), target length in letters (r = .06) or target frequency (r = −.09, all ps ≥ .20). Additionally, we controlled for correlations between cosines and prime characteristics. LSA cosines did not correlate significantly with number of prime syllables (r = −.01), prime length in letters (r = −.02) or prime frequency (r = −.05, all ps ≥ .52). Non-word targets did not differ from real-word targets in mean length

8

(t(222) = 0.078, p = .94) and variance of length (F (199, 99) = 1.292, p = .15). The complete item material for Experiment 1 can be found in the supplementary material. These prime-target pairs were ordered into six blocks using a Graeco-Latin square method. Each block contained 50 trials. Of these, one third consisted of non-word trials, and the rest were evenly distributed amongst the similarity categories (since 50 cannot be divided by 6, this distribution could not be completely evenly within the single blocks). Word stimuli were always presented in black letters in the centre of a white screen. Capital letters had a height of 9 mm, with a width between 3 mm and 9 mm; lower case letters had a height of 7 mm, with a width between 2 mm and 9 mm. All words began with a capital letter, while the other letters were in lower case, which is the standard German orthography for nouns. The experiment was presented on a 1024 × 768 pixel CRT monitor, with a refresh rate of 60 kHz, which was connected to the computer via a VGA connection. It was run in E-Prime 2.0 on a Windows XP Professional operating system. A standard keyboard with an orange sticker on the END key (right key) and an orange sticker on the TAB key (left key) was used for the participants’ responses. Procedure A standardized instruction was presented to the participants, telling them that word pairs would be presented on-screen. Of these word pairs, the first word would always be a real word, while the second could either be a real word or a non-existent word. The participants were further instructed that their task was to press the right key if the second word was a real word or the left key if it was a non-word. After reading these instructions, the experiment began. At the beginning of each trial, a 7 × 7 mm black fixation cross was presented in the centre of the monitor screen for 1000 ms. Immediately after this fixation cross, the prime word was presented for 500 ms followed by a blank white screen for another 500 ms. After the presentation of the

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

white screen, the target word was presented in the centre of the screen. Therefore, we used an SOA of 1000 ms. The target remained on-screen until a response was received, that is, until either the left or the right key was pressed by the participant, or until 3000 ms had passed. If the response was correct, the feedback Richtig! (Correct!) was presented in the centre of the screen in green letters for 1000 ms. If the response was incorrect, the feedback Fehler! (Error!) was presented in the centre of the screen in red letters for 1000 ms. If no response was received within 3000 ms after the appearance of the target word, the target word disappeared and the feedback Zu langsam! (Too slow!) was presented in the centre of the screen for 1000 ms in red letters. The feedback was presented in the same letter size and font as the prime words and target words, and was followed by a blank white screen for 500 ms before the next trial started. The items were presented in six blocks of 50 items. Thus, every participant saw every primetarget pair exactly once, which implies a complete within-subjects design. The order of the blocks was balanced over participants using a Latin square design, and the order in which the items were presented within the blocks was randomized. Before the actual experiment started, a block of 20 practice trials was presented to the participants. The trial procedure in this block was identical to the experimental trials. None of the words used in these practice trials appeared in the actual experiment, either as prime words or as target words. The whole experiment took about 20 to 25 minutes to complete.

Results Blogs LSA space We only analysed trials in which a real-word target was presented. We excluded data for target words with error rates above 25% (this affected one target). Error trials were excluded from analysis

(2.33%), as well as trials with RTs under 100 ms or above 1500 ms (0.17%).1 The bivariate relation between LSA cosines and logarithmic RTs (logRTs) is depicted in Figure 2. LogRTs were used as, in the data for this study, their distribution is far closer to a normal distribution than the unmodified RT distribution (Baayen & Milin, 2010). However, as can be seen in Figure 2, the patterns look similar to those of normal RTs. For the data analysis, we employed a linear mixed effect model (LMEM) approach, using the package lme4 (Bates, Maechler, Bolker, & Walker, 2014) for the statistical computing environment R (R Core Team, 2014). We will use R syntax to describe those models. To test for the effects of LSA cosines on logRTs, we applied the following model: [ModelB] log(RT)  Cosine + LengthTarget + FreqTarget + FreqPrime + LengthPrime   + 1 + LengthTarget + FreqTarget |Subject + (1 |Item). In this model, Cosine is the LSA cosine between prime and target word, LengthPrime and LengthTarget are the word lengths of the prime and target (measured by number of letters), respectively, and FreqPrime and FreqTarget are the prime and target log-frequencies, as described above, respectively. Word lengths and frequencies were included as covariates that we wanted to control for in this model. This model assumes fixed effects for all of these parameters. Additionally, it assumes random intercepts for each subject (1 |Subject) and each item (1 |Item), defined as the prime-target pair, to account for potential differences in RTs between subjects or between items (Baayen, Davidson, & Bates, 2008; Barr, Levy, Scheepers, & Tily, 2013). Furthermore, we included random slopes for target lengths and target frequencies. The

1 The pattern of results in the experiments reported in this study is not restricted to this specific exclusion criteria for outliers. For example, excluding all trials with RTs below 200 ms or 300 ms gives the same pattern of results in all analyses reported in this study.

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

9

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

Figure 2. Left panel: Mean logRTs for each target in Experiment 1, plotted as a function of the prime-target cosine similarity. The dotted line shows the model predictions of ModelB in Experiment 1. Right panel: Mean non-transformed RTs for each target in Experiment 1, plotted as a function of the prime-target cosine similarity. The dotted line shows the exponentiated model predictions of ModelB in Experiment 1.

inclusion of random slopes was motivated by the reasoning of Barr et al. (2013).2 Note that this is a rather strict analysis, since we cannot exclude the possibility that random target effect has an impact on the model in terms of the aspects of the cosine effects that we are interested in: each item was presented exactly once in the experiment, with a specific cosine similarity between its prime and target. We then compared ModelB to ModelA, using a likelihood-ratio test: [ModelA] log(RT) LengthTarget + FreqTarget + FreqPrime + LengthPrime   + 1 + LengthTarget + FreqTarget |Subject + (1 |Item).

ModelB and ModelA differ only in one parameter: ModelB includes a Cosine parameter, while ModelA does not. If ModelB turns out to have a higher predictive value than ModelA, this parameter hasa significant effect in explaining the data. As indicated by a likelihood-ratio test, ModelB explains the data significantly better than ModelA (x2 (1) = 3.95, p = .047). Higher cosines hereby predict shorter RTs (b = −0.062). Additionally, including a random cosine slope for subjects does not improve ModelB (χ 2(6) = 3.204, p = .783). The model parameters for the fixed effects in ModelB are shown in Table 1, as well as .95Wald confidence intervals for these parameters. As can be seen, the confidence interval for the cosine parameter does not include the value of

2

Note that the model does not contain random slopes for prime lengths and prime frequencies. This is due to the fact that some models in the analyses did not converge with the high number of random parameters to be estimated, which caused serious problems for the analyses. It was decided not to include random slope prime lengths and frequencies, since these parameters have no fixed effects on the RTs in the data. Furthermore, in cases where models containing these random slopes did converge, the results including the random slopes were the same as without them.

10

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

Table 1. Fixed effect parameters for ModelB (and ModelC) in Experiment 1 Blogs (ModelB) Parameter

Downloaded by [Universite Laval] at 21:03 05 November 2015

LengthPrime LengthTarget FreqPrime FreqTarget Cosine

Newspapers (ModelB)

Newspapers (ModelC)

β

CI

β

CI

β

CI

0.006 0.013 −0.002 0.022 −0.062

[−0.002, 0.014] [0.006, 0.020] [−0.010, 0.005] [0.016, 0.029] [−0.123, −0.0004]

0.007 0.011 −0.003 0.022 −0.098

[−0.001, 0.015] [0.004, 0.018] [−0.011, 0.005] [0.015, 0.029] [−0.169, −0.027]

0.007 0.011 −0.003 0.022 −0.098

[−0.001, 0.015] [0.004, 0.018] [−0.011, 0.005] [0.015, 0.029] [−0.172, −0.024]

with the only difference being that cosine similarity values from the newspapers LSA space were used. In this case, ModelB also explained our data significantly better than ModelA ((χ 2(1) = 7.39, p = .007). Adding a by-subject random slope for the cosine parameter further improved the model (x2 (6) = 33.84, p =, .001). Thus, the following model performs best with the values from the newspapers LSA space: [ModelC] log(RT)  Cosine + LengthTarget + FreqTarget + FreqPrime + LengthPrime + (1 + Cosine + LengthTarget Figure 3. Word pair cosine similarities for the word pairs used in Experiment 1 in the newspapers LSA space plotted against their respective cosine similarities in the blogs LSA space.

zero in ModelB, further confirming the significant negative effect of cosine similarities on logRTs. Newspapers LSA space We recalculated the cosine similarities for the word pairs used in Experiment 1 within the newspapers LSA space. Negative values were set to zero, and 24 word pairs had to be omitted because at least one word in the pair could not be found in the semantic space. Figure 3 shows the relation between word pair cosines in the blogs LSA space and word pair cosines in the newspapers LSA space. There is a significant positive correlation between these cosines (r(188) = .414, p , .001). We then re-analysed the data with the same linear mixed effect models as described above,

+ FreqTarget |Subject) + (1 |Item). Parameter values for both models in this analysis are shown in Table 1. For both ModelB and ModelC, the .95-Wald confidence intervals do not overlap the value of zero, which is in line with the model comparison results.

Discussion In Experiment 1, LSA cosines were found to have a significant effect on RTs. Since higher LSA cosines predicted shorter RTs, these effects can be interpreted as priming effects. This seems to qualify LSA as a useful tool for the representation of word meanings and LSA cosines as a cognitively plausible measure of meaning similarities. We found that two different measures, the blogs cosine similarities and the newspapers cosine

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

11

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

similarities, both succeeded in predicting the RTs we observed (note that, since logRTs were analysed, we found a non-linear effect on RTs). These results support the claim that LSA per se can model mental similarities, and not just the results from one single, specifically customized LSA space. Blogs cosine similarities and newspapers cosine similarities are de facto independent measures, since not a single document was part of both corpora used to create the respective LSA spaces. The fact that both LSA spaces, the one used to create the experimental material and an independent one, can be used to predict RTs allows us to assume that the LSA algorithm is able to derive word meanings from a suitable representative corpus. The coefficients for the parameters displayed in Table 1 refer to logRTs. From these coefficients, the effects on non-logarithmic RTs can be computed. The −0.062 coefficient for the blogs LSA cosines corresponds to a priming effect of 20 ms between prime-target pairs with a cosine similarity of .00 and those with a high cosine similarity of .58, while the −0.098 coefficient for the newspapers LSA cosines corresponds to a priming effect of 31 ms for this range of cosines. The priming effects found in our study are therefore located at the lower end of the range of priming effects for LDTs. In most studies, these effects range between 10 ms and 45 ms (Hutchison, 2003; Jones et al., 2006; Lucas, 2000). However, there are a variety of studies which found priming effects comparable to those in our study, especially for associative priming (Ferrand & New, 2003; Hodgson, 1991; McKoon & Ratcliff, 1992; McNamara & Altarriba, 1988; Thompson-Schill, Kurtz, & Gabrieli, 1998).3 In the procedure we applied for generating our material, we selected prime words and pseudo-randomly generated targets. However, this was a completely arbitrary decision. One could also think of an identical procedure where the primes are generated pseudo-randomly for fixed targets. Therefore, we decided to run a second experiment where this option was realized.

EXPERIMENT 2 In Experiment 1, we arbitrarily decided to pseudorandomly generate targets for previously selected primes. To account for this, we reversed the prime-target order of Experiment 1 to obtain item material for Experiment 2. Reversing the order of primes and targets might in some cases cause some issues related to the socalled backward priming effect. This effect was first introduced by Koriat (1981), who argued that the associations for some word pairs are asymmetrical. For example, when participants are presented with the word stork, they are very likely to come up with the word baby. However, this is not true when they are presented with baby, where they rarely come up with stork. Nevertheless, it has also been shown that for asymmetrical word pairs, priming effects can under some circumstances occur in both directions; stork primes baby (forward priming), but baby can also prime stork (backward priming) (Hutchison, 2003; Koriat, 1981). The general pattern of results for such asymmetrical word pairs is that, in naming tasks, forward priming occurs at short and long SOAs, while backward priming only occurs at short SOAs. In an LDT, however, forward and backward priming occur at both long and short SOAs (Hutchison, 2003; Kahan, Neely, & Forsythe, 1999; Peterson & Simpson, 1989). For symmetrical word pairs, priming effects always occur in both directions. We therefore expect to find priming effects in Experiment 2. For symmetrical word pairs included in the material, priming effects should occur regardless of the order in which the two words are presented. For potential asymmetrical word pairs included in the material, one would expect backward priming effects to occur since we used an LDT, and Experiment 2 is the reversed version of Experiment 1. Furthermore, since we pseudo-randomly generated our item material, potential asymmetrical word pairs are expected to be evenly distributed between both experiments.

3 Some of these studies are concerned with mediated priming. However, we compared our results to the results from these studies which were obtained from their direct priming control groups.

12

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

Half of these word pairs should be in their “forward priming order” in Experiment 1, and half should be in their “forward priming order” in Experiment 2. These are two reasons to assume that asymmetrical priming effects should not play a role in this experiment, and we expect to find the same priming effects in Experiment 2 that were found in Experiment 1.

Downloaded by [Universite Laval] at 21:03 05 November 2015

Participants A total of 44 participants (36 female, 8 male), all native German speakers, with a mean age of 23.95 years (SD = 5.37 years), were tested in Experiment 2. All of them were right-handed. None of the participants had taken part in Experiment 1. Participants gave informed consent and were rewarded with either course credit or money.

Method We used exactly the same item material as in Experiment 1, with the difference that we switched primes and targets. Thus, if X-Y was a prime-target pair in Experiment 1, this pair was now presented as Y-X. Prime-target pairs that had a non-word as the target word were not switched. While in the previous experiments we came up with the prime words and pseudo-randomly created the targets, this order was therefore also switched in Experiment 2 (at least for real-word trials). As follows from the description of the material in Experiment 2, LSA cosine similarities were not correlated with prime and target lengths and frequencies. Non-word targets did not in mean length differ from real-word targets  t (172)= −1.877, p = .062 or variance of  length F (199, 99) = 0.724, p = .057 . The experimental procedure and set-up were identical to those of Experiment 1.

Results Blogs LSA space We applied the same exclusion criteria as in the previous experiment. Two participants were removed from analysis since they had error rates of higher than 25% in the non-word trials. Thus, the data from 42 participants were analysed. Additionally, we excluded data from 2.08% of the remaining trials because of erroneous answers, and 0.2% in which an RT below 100 ms or above 1500 ms was observed. For data analysis, we employed the same LMEMs as in Experiment 1. The bivariate relation between LSA cosines and RTs is depicted in Figure 4. As can be seen in Figure 4, there is one outlier item with a mean RT far off the remaining distribution (the item Teigwaren-Sirene [pasta-siren]), with a mean RT of 687 ms, which is 41 ms higher than the second-highest mean RT. We therefore decided to run an influence analysis on the items, using the R-package influence.ME (Nieuwenhuis, te Grotenhuis, & Pelzer, 2012). Figure 5 shows the Cook’s Distances for the items. In order to calculate the Cook’s Distance of an item, the item is deleted from the data set and the change in model parameters is taken as a measure of its influence (Cook, 1977). As can be seen, one (and only one) item has an extremely high influence on the model parameters compared to all the other items. This item is indeed the pair Teigwaren-Sirene, with a Cook’s Distance of more than double the value of the second largest (Dn = 0.072 vs Dn−1 = 0.031). Highly influential data points clearly pose a problem for the interpretation of regression analysis (and hence LMEM analysis) results (Cook, 1977; Stevens, 1984). According to the recommendation of Stevens (1984), we therefore ran our analysis with the item Teigwaren-Sirene excluded from our data set.4 In this analysis, ModelB (which contains a cosine fixed effect) explained our data significantly

4

We also ran influence analyses for the other experiments reported in this paper. We found no other such influential items in the other experiments. There were in fact some items that also had relatively high Cook’s Distances in the other experiments, but they were not such extreme outliers. Excluding items with a high Cook’s Distance did not change the pattern of results in the other experiments. THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

13

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

Figure 4. Left panel: Mean logRTs for each target in Experiment 2, plotted as a function of the prime-target cosine similarity. The dotted line shows the model predictions of ModelB in Experiment 2. Right panel: Mean non-transformed RTs for each target in Experiment 2, plotted as a function of the prime-target cosine similarity. The dotted line shows the exponentiated model predictions of ModelB in Experiment 2.

For the sake of completeness, we also report the results of this analysis with the item TeigwarenSirene included in the data. In this analysis, ModelB failed to explain our data significantly better than ModelA (x2 (1) = 3.49, p = .061). Including a by-subject random slope for cosine values also did not improve ModelB (x2 (6) = 2.21, p = .900). The model parameters and .95-Wald confidence intervals for these parameters are also shown in Table 2.

Figure 5. Cook’s Distances for all items in Experiment 2

better than ModelA (which does not contain a cosine fixed effect) (x2 (1) = 5.45, p = .020). LSA cosine similarities thereby had a negative effect on logRTs (b = −0.052). Further including a by-subject random slope for cosine values did not improve ModelB (x2 (6) = 1.57, p = .955). The model parameters for ModelB in this analysis are shown in Table 2. In line with the model comparison results, the .95-Wald confidence interval for the cosine parameter does not overlap the value of zero.

14

Newspapers LSA space Since the same word pairs as those of Experiment 1 were used, 24 word pairs had to be omitted from this analysis, since at least one word in the pair was not part of the newspapers LSA space. The critical item of the previous analysis (TeigwarenSirene) was one of these omitted word pairs. Negative cosine similarity values were set to zero, as before. With these cosine similarities, ModelB performed significantly better than ModelA (x2 (1) = 5.56, p = .018), with a negative influence of cosine similarities on logRTs (b = −0.061). Including a by-subject random slope for the cosine parameter did not further improve the model (x2 (6) = 0.74, p = .994).

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

Table 2. Fixed effect parameters for ModelB and ModelC in Experiment 2 Blogs (ModelB) − TS Parameter LengthPrime LengthTarget FreqPrime FreqTarget Cosine

Blogs (ModelB) + TS

Newspapers (ModelC)

β

CI

β

CI

β

CI

0.002 0.013 −0.003 0.014 −0.052

[−0.002, 0.006] [0.007, 0.019] [−0.007, 0.002] [0.009, 0.020] [−0.095, −0.008]

0.003 0.013 −0.002 0.015 −0.042

[−0.002, 0.007] [0.007, 0.019] [−0.007, 0.002] [0.010, 0.021] [−0.087, 0.002]

0.000 0.013 −0.003 0.011 −0.061

[−0.005, 0.005] [0.007, 0.019] [−0.007, 0.002] [0.005, 0.017] [−0.113, −0.009]

Downloaded by [Universite Laval] at 21:03 05 November 2015

Note: In the − TS model, the item Teigwaren-Sirene is excluded, while it is included in the + TS model.

The parameter values for ModelB in this analysis are shown in Table 2.

does not differ significantly between Experiment 1 and Experiment 2.

Comparison with Experiment 1 We compared the effects of the cosine parameter in Experiment 1 and Experiment 2 by pooling the data of both experiments. For the LMEMs we employed, we regarded the word pairs of Experiment 1 to be the same items as their reversed counterparts in Experiment 2 (the same pattern of results we report in this section emerges when these are regarded as different items). For the blogs LSA space, the results are as follows. In the pooled data, ModelB (which contained a cosine fixed effect) performed better than ModelA (without the cosine fixed effect) (x2 (1) = 8.26, p = .004), which is in line with the previous results. Again, adding a by-subject random slope for the cosine parameter did not further improve this model (χ 2(4) = 5.01, p = .286). In the next step, we included a fixed effect parameter for the experiment (Experiment 1 vs Experiment 2) in the model, which significantly improved ModelB (x2 (1) = 147.99, p , .001, indicating a significant difference in the mean logRTs between Experiment 1 (MlogRT = 6.33 log(ms), MRT = 579.43 ms) and Experiment 2 (MlogRT = 6.28 log(ms), MRT = 546.78 ms). Crucially, additionally including an interaction effect between the cosine parameter and the experiment parameter did not significantly improve this model (x2 (1) = 0.82, p = .364). Thus, the effect of the cosine parameter on RTs

Discussion The results obtained in Experiment 2 further confirm those of Experiment 1: LSA cosine similarities are a significant predictor of priming effects, in that higher cosine similarities predict shorter RTs. Furthermore, the comparison between the results from Experiment 1 and Experiment 2 shows that there is no difference in priming effects between the two experiments. In Experiment 2, the −0.052 coefficient for the blogs LSA cosines corresponds to a priming effect of 16 ms between prime-target pairs with a cosine similarity of .00 and those with a high cosine similarity of .58, and the −0.061 coefficient for the newspapers LSA cosines corresponds to a priming effect of 19 ms for this range of cosines. These findings are an important extension and completion of the results of Experiment 1. In Experiment 1, we made the arbitrary choice to create primes and then randomly sample targets, but not vice versa. Experiment 2 shows that this choice plays no role when it comes to predicting priming effects by means of cosine similarities. For DSMs such as LSA, it is appealing to interpret these results in favour of generally symmetrical (associative) relationships between words. A mathematical property of cosines (and therefore cosine similarities) is that they are symmetrical, i.e., cos(a, b) = cos(b, a). With their

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

15

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

typical similarity measures, DSMs therefore assume the relations and similarities between words to be symmetrical. This can also be seen as a shortcoming, since some phenomena are hard to account for with these symmetrical similarities. For example, it cannot be explained why free associations include asymmetrical pairs such as the baby-stork pair described above. One could also argue that these asymmetrical pairs are just an artefact of free association tasks, and that the “true mental” word similarities are indeed symmetrical. This argument however is inconsistent with the results from the backward priming literature. It has been repeatedly shown (Kahan et al., 1999; Peterson & Simpson, 1989) that backward priming for asymmetrical word pairs cannot be found in naming tasks with long SOAs. This at least demonstrates that these word pairs differ from symmetrical word pairs, which always show priming effects in both directions (for both LDTs and naming tasks, and for both long and short SOAs). In summary, we do not conclude that the results from Experiment 2 speak in favour of there being generally symmetrical relations and similarities between words, mainly since this study is not designed to answer this question. In a recent study, Kintsch (2014) proposed a method for computing asymmetrical word similarities in an LSA framework by taking into account the lengths of the LSA vectors (which directly translates to the amount of information LSA has about the respective words). Using this method, future studies might explicitly generate asymmetrical item material for priming studies such as the ones presented here, and further investigate the symmetry of word similarities.

ANALYSIS OF OTHER SIMILARITY MEASURES The focus of this study lay in examining whether similarities as predicted by LSA can predict priming data. However, as stated above, LSA is

16

just one of many DSMs that can derive word similarities from corpus data (Jones et al., in press). We chose to focus on LSA since it is by far the most prominent of those models, and has had the highest impact on the psychological literature (the prominent article on LSA by Landauer and Dumais, 1997, alone has been cited almost 1500 times in Web of Science, while the main article on HAL by Lund and Burgess, 1996, has been cited about 400 times). The prominence of LSA however does not necessarily determine that LSA is the best model for computing word similarities. As pointed out before, LSA is designed to capture syntagmatic, and therefore associative, relationships between words, which is only one way of defining word similarities. In the following section, we test whether other models of semantic similarity, namely HAL and BEAGLE, predict our data from Experiment 1 and 2 better than LSA does. The HAL algorithm (Lund & Burgess, 1996) does not construct a term-by-document matrix as LSA does, but a word-by-word matrix. This matrix is populated by moving an n-word window over the text, word by word. For each word in the text, the other words which happen to be in that window are registered. For example, take the sentence “only happy people were invited to the party”. If the window size is set to n = 5, then the window for people contains only, happy, were and invited, while the window for invited contains people, were, to and the. For each word, a count is undertaken, over the whole corpus, of which words occur in the window and how often they occur. Typically, these raw counts are weighted so that words nearer to the target word are assigned a higher value (in our example, the words happy and were would get higher values than only and invited in the window for people). The HAL model heavily focuses on paradigmatic relations between words: if two words are constantly surrounded by the same words in the same word order then HAL will assign very similar vectors to them, which will in turn result in high word similarities. This

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

is the reason why HAL has been shown to capture semantic priming effects far better than LSA, but not associative priming effects (Jones et al., 2006). The BEAGLE model (Jones & Mewhort, 2007) is explicitly designed to capture both syntagmatic and paradigmatic (i.e., context and order) information. In this model, a random vector e is initially assigned to each word. The memory vector m for each word is then updated for each document in the corpus via two mechanisms. The (syntagmatic) context information c is captured by summing up the random vectors e for each word in the document except the target word. Therefore, the context information for car in “I drove home with my car” will be exactly identical to the context information for bike in “with my bike, I drove home”. The (paradigmatic) order information o is captured by applying a circular convolution technique across different sized windows around the target word (more specifically, across the random vectors e for the words in those windows). After the processing of each document, the memory vectors m for the words in that document get updated by applying the formula mnew = mold + c + o (for a more detailed description of the BEAGLE algorithm, see Jones & Mewhort, 2007). In Jones et al. (2006), BEAGLE is shown to be capable of predicting both associative and semantic priming effects, since it is designed to capture both associative and semantic word relations. We decided to test the HAL and BEAGLE model for three reasons. Firstly, HAL and BEAGLE are arguably the most prominent DSMs in psychology after LSA. Secondly, these models are the exact models also examined in the study by Jones et al. (2006), which is a major foundation for our study. Thirdly, there are interesting differences between LSA, HAL and BEAGLE on a theoretical level which affect the algorithms’ outcomes and the kind of similarity information they capture. Additionally, we also tested whether human judgements of word similarities as obtained from ratings data predict our data better than the DSMs we examined.

Obtaining HAL, BEAGLE, and ratings similarities We obtained the similarity judgements for the word pairs used in Experiment 1 in a web-based questionnaire. This questionnaire was sent to all students and employees of the University of Tübingen, and participants had a chance of winning coupons for books. In total, we obtained ratings data from 1352 native German speakers (469 male, 883 female, mean age = 27.18 years, SD = 9.12 years). The participants’ task was to indicate the similarity for each of the word pairs used in Experiment 1 on a 7-point scale (1 = not similar, 7 = very similar). We did not provide any criterion as to the basis for these similarity ratings, and requested that for each word pair the participants decide as quickly and intuitively as possible. Every participant only saw one half of the 200 word pairs in random order. Since some participants did not complete the questionnaire, we obtained an average of 567 ratings values per word pair. The similarity judgements for the word pairs used in Experiment 2 (i.e., the same word pairs as in Experiment 1, but in reverse order) were collected in another web-based questionnaire, sent to the students and employees of the University of Tübingen. The participants again had the chance to win coupons for books. We obtained ratings data from 883 native German speakers (247 male, 636 female, mean age = 26.40 years, SD = 9.52 years). In this questionnaire, every participant saw all 200 word pairs in random order. Since some participants did not complete the questionnaire, we obtained an average of 687 ratings values per word pair. The HAL and BEAGLE spaces were both created from the same corpus of blog entries that was used to create our original LSA space. As for the LSA space, all words that appeared less than two times were eliminated from the corpus. We used the S-Space Package (Jurgens & Stevens, 2010) to set up both models. For the HAL model, we used a 10-word moving window to set up the model, and the

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

17

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

dimensionality of the word vectors was reduced to 1000 (by keeping only the 1000 highest entropy columns of the word-by-word matrix), and the vectors were normalized. From these word vectors, the Euclidean distances between the word pairs used in Experiment 1 and 2 were computed (in HAL, one typically uses Euclidean distances, which are measures of dissimilarity, instead of cosines, which are measures of similarity). All the parameter values for these models (dimensionality, window size, type of similarity measure etc.) were chosen to be identical with the values used by Jones et al. (2006). Additionally, we set up another HAL-based model with a different parameter set, which Bullinaria and Levy (2007) found optimizes HAL’s performance in some tasks, which is henceforth referred to as the Positive Pointwise Mutual Information with Cosine (PPMIC) model. For this model, the window size was set to 1 (both behind and ahead of the word in question), a positive pointwise mutual information weighting was applied to the co-occurrence matrix, the dimensionality was set to 10,000, and cosine similarities were used instead of Euclidean distances. This model was set up using the HiDEx software (Shaoul & Westbury, 2006, 2010). For the BEAGLE model, the vector dimensionality was set to 1024, and the cosine similarities between the word pairs used in Experiment 1 and Experiment 2 were computed.

Results The correlations between the different similarity measures for our word pairs are shown in Table 3. These correlations are all significant (ps , .0350), except for the correlation between the PPMIC and LSA cosines (p = .114). The mean similarity judgements for the word pairs used in Experiment 2 correlated almost perfectly with those for the word pairs of Experiment 1

Table 3. Correlations between the different measures of semantic similarity for the word pairs used in Experiment 1 and 2 LSA LSA HAL PPMIC BEAGLE Ratings

HAL

PPMIC

BEAGLE

Ratings

−.213

.113 −.454

.188 −.814 .393

.563 −.271 .150 .228

Note: Correlations including the HAL model are negative, since HAL uses Euclidean distances and therefore a dissimilarity measure.

(t(198) = 85.80, p , .001, r = .987). Therefore, they are not reported separately in Table 3. We first tested whether the different similarity measures actually succeed in predicting the RT data we obtained in Experiment 1 and 2. Therefore, we employed the same LMEM analysis as in the results sections of these experiments. For Experiment 1, adding a HAL fixed effect to ModelA (which only contains length and frequency effects for primes and targets) significantly improved the model (x2 (1) = 7.66, p = .006). Higher Euclidean distances (i.e., lower similarities) predicted longer RTs (b = 0.088). Adding a BEAGLE fixed effect also improved ModelA (x2 (1) = 16.40, p , .001), where higher BEAGLE cosine similarities predicted shorter RTs (b = −0.1527). The same was true for the mean similarity judgements (x2 (1) = 8.54, p = .003), where higher similarity judgements predicted shorter RTs (b = −0.009). The PPMIC cosines however did not significantly improve ModelA (x2 (1) = 0.07, p = .798). The relations between these different similarity measures and RTs in Experiment 1 can be seen in Figure 6.5 The same pattern emerged for Experiment 2. A HAL distance fixed effect significantly improves ModelA (x2 (1) = 4.65, p = .031, b = 0.048), as does a BEAGLE cosine fixed effect

5 It seems as though there is one item for the PPMIC model with a cosine of about .3 that has a great influence on the regression (for both experiments). Removing this item however does not change the pattern of results in the analyses reported here.

18

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

Figure 6. Relation between (non-logarithmic) RTs and HAL Euclidean distances (top left), PPMIC cosine similarities (top right), BEAGLE cosine similarities (bottom left) and mean human similarity ratings (bottom right) for Experiment 1.

(x2 (1) = 8.54, p = .003, b = −0.008) and a fixed effect for human similarity judgements (x2 (1) = 4.75, p = .029, b = −0.004). A fixed effect for PPMIC cosines did not improve the model (x2 (1) = 0.72, p = .393). The relations

between these different similarity measures and RTs in Experiment 2 can be seen in Figure 7. For both experiments, a by-subject random slope does not improve the model further for any of the similarity measures examined.

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

19

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

Figure 7. Relation between (non-logarithmic) RTs and HAL Euclidean distances (top left), PPMIC cosine similarities (top right), BEAGLE cosine similarities (bottom left) and mean human similarity ratings (bottom right) for Experiment 2.

These results show that HAL, BEAGLE, and ratings similarities each predict the priming data of our experiments, while the PPMIC model failed to do so.

20

We then tested whether some similarity measures give better predictions of our data than others. We employed the following method for these tests. To test whether, for example, the

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

HAL distances improve the predictions made by the LSA cosines, we compared [ModelB] log(RT)  LSA Cosine + LengthTarget + FreqTarget + FreqPrime + LengthPrime   + 1 + LengthTarget + FreqTarget |Subject + (1 |Item).

Downloaded by [Universite Laval] at 21:03 05 November 2015

with [ModelX] log(RT)  LSA Cosine + HAL Distance + LengthTarget + FreqTarget + FreqPrime + LengthPrime   + 1 + LengthTarget + FreqTarget |Subject + (1 |Item). If ModelX significantly improves ModelB, then the HAL distances improve the predictions made by the LSA cosines. This type of model comparison can be used to test the predictions made by all four of our predictive similarity measures (LSA, HAL, BEAGLE and ratings) against each other. The results of those model comparisons are summarized in Table 4. As can be seen, LSA improves none of the predictions made by the other similarity measures. HAL improves the predictions made by LSA and by human similarity judgements. BEAGLE and the human similarity judgements

both improve the predictions made by all other measures. These results suggest that BEAGLE and human similarity judgements are best at predicting the priming data of Experiment 1. Interestingly, both measures improve the predictions made by the other measures, so the best model for describing the data of Experiment 1 is a model that contains both a BEAGLE fixed effect and a fixed effect for human similarity judgements. We performed the same analysis for Experiment 2, the results of which are shown in Table 5. For Experiment 2, the results are less clear than they are for Experiment 1, since many of the model comparisons just failed to reach significance level. Maintaining our significance level of a = .05, the results are as follows: LSA, HAL and the human similarity judgements do not improve any of the other similarity measures, while BEAGLE significantly improves the predictions made by all other measures. As in the comparisons for Experiment 1, BEAGLE emerges as the best predictor for the priming data. These results speak clearly in favour of the BEAGLE model, while the relationships between the other variables is far less clear (especially if one considers that many comparisons are very close to significance, which makes it somehow difficult to make definitive statements).

Discussion In this section, we compared four different measures of word similarity (LSA, HAL, BEAGLE and human similarity judgements). A

Table 4. Results of the model comparisons for the different similarity measures for Experiment 1

LSA HAL BEAGLE Ratings

LSA

HAL

BEAGLE

Ratings

– 5.83 (.016) 14.09 (, .001) 4.82 (.028)

2.05 (.152) – 9.13 (.003) 5.92 (.015)

1.57 (.210) 0.39 (.531) – 6.17 (.013)

0.16 (.685) 5.05 (.025) 14.04 (, .001) –

Note: The cell values show the x2 values (1 df) and the p-values (in parentheses) for comparisons between a model containing both the row and the column similarity measure of the cell and a model containing only the column similarity measure (i.e., they indicate whether adding the similarity measure in the row improves the predictions made by the measure in the column alone). THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

21

FRITZ GÜNTHER ET AL.

Table 5. Results of the model comparisons for the different similarity measures for Experiment 1

LSA HAL BEAGLE Ratings

LSA

HAL

BEAGLE

Ratings

– 2.94 (.087) 6.43 (.011) 1.20 (.273)

3.48 (.062) – 3.93 (.047) 3.21 (.073)

3.08 (.079) 0.04 (.843) – 3.38 (.066)

1.64 (.200) 3.10 (.078) 7.17 (.007) –

Downloaded by [Universite Laval] at 21:03 05 November 2015

Note: The cell values show the x2 values (1 df) and the p-values (in parentheses) for comparisons between a model containing both the row and the column similarity measure of the cell and a model containing only the column similarity measure (i.e., they indicate whether adding the similarity measure in the row improves the predictions made by the measure in the column alone).

first interesting result is to be found in the intercorrelations of these variables; of the three DSMs examined, LSA had the highest correlation with the human judgements obtained. This can in itself be taken as evidence that LSA cosine similarities capture word similarities as perceived by humans. The relatively high correlation might reflect that humans focus more on associative relations between words than on semantic relations when they determine word similarities without any specific instructions with respect to which criteria to use for the task. For both Experiment 1 and Experiment 2, all four similarity measures succeeded in predicting the RTs found in the experiments. In all cases, the model parameters indicate that the measures actually predict priming effects (higher similarities/lower dissimilarities predict faster RTs). This is a generally promising result for the psychological validity of DSMs. Comparisons between the different similarity measures showed that, for Experiment 1, BEAGLE and human similarity judgements together best predicted the data, while for Experiment 2, BEAGLE alone best predicted the data. These results are interesting in at least two ways. Firstly, it was shown that LSA did not predict our data best, although it is far more prominent and widespread in terms of its use than BEAGLE. Secondly, it was shown that human similarity judgements are not the single best similarity measure when it comes to predicting priming effects. In Experiment 1, the BEAGLE cosine similarities (as well as the HAL distances) provide additional information, and in

22

Experiment 2, the BEAGLE cosine similarities even outperform the human similarity measures. Additionally, human similarity judgements do not improve the predictions made by LSA in Experiment 2, which gives further evidence that LSA can indeed be a useful variable for predicting RTs. These results demonstrate that DSMs are not just models that struggle to accurately capture human similarity judgements, but can provide complementary, comparable or maybe even more useful results. However, further research on the exact relation between human judgements and DSMs is needed to validate this statement. The fact that BEAGLE emerges as the best of our DSMs in both experiments can easily be explained, with regard to how the employed models work. While the LSA algorithm concentrates heavily on syntagmatic relations between words, the HAL algorithm emphasizes paradigmatic relations. Therefore, both algorithms only take into account one type of information present in the text corpora they analyse. The BEAGLE algorithm on the other hand is designed to capture both types of relation, and thus takes into account more information than either LSA or HAL does. This allows BEAGLE to capture associative priming as well as semantic priming (Jones et al., 2006), which in turn should allow it to explain our data better, where semantically, associatively, or semantically and associatively related pairs might be present in the material. Another initially surprising finding of our analyses of other models is that the HAL-based PPMIC model which Bullinaria and Levy (2007) found to outperform a standard HAL model fails

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

to predict the priming effects found in our data. However, this finding might not be as surprising on closer examination, if one considers the criteria for which the HAL-based models in this study were optimized. These criteria were the TOEFL synonym test (Test of English as a Foreign Language, introduced as a validation criterion by Landauer & Dumais, 1997), a distance comparison test where a semantically related word was the correct target word, as well as a semantic and a syntactic categorization task. All of these tests clearly require a model to capture semantic similarities between words and focus on paradigmatic relationships between them (which is especially true for a synonym test and syntactic categorization), while associative, syntagmatic relations are not important. Therefore, one can assume that models optimized for these tasks are models that are optimized for capturing semantic, paradigmatic relations. As stated above, it has been shown that associative and semantic priming exist independently of each other, and both types can occur in a lexical priming task. This can explain why both LSA and HAL predict priming effects, and BEAGLE outperforms them. Furthermore, this offers a potential explanation as to why the PPMIC model does not predict the priming effects. One part of this model is that it uses a very small window size of one word ahead and one word behind the word of interest. This surely improves the capturing of paradigmatic relations and probably makes the model particularly effective in terms of synonym tests, but this is at the cost of almost all syntagmatic information (if a larger window size such as 10 is used, a HAL model gets some information about which other words co-occur in the broader context of a given word). Assuming that the item material used in our study contained neither purely associatively related pairs nor purely semantically related pairs (which is a rather plausible assumption), one can expect both kinds of priming, associative and semantic, to occur in our experiment. If a model like the PPMIC model is optimized only for semantic similarities, and largely ignores associative similarities, one cannot expect it to capture any kind of associative boost (Hutchison, 2003;

Lucas, 2000) for priming effects (which a HAL model with a larger window size probably can). This might explain why the PPMIC model does not perform as well as the other “standard” HAL model for our data. However, one critical point concerning the interpretation of the results of this section has to be considered. In the introduction to this article, we argued that, up to now, LSA cosine similarities were only analysed as post hoc variables in experiments whose item material was generated mainly with association norms and human intuition. Our concern was to make clear that, in order to truly examine the effects of LSA cosine similarities, they have to be directly manipulated as an independent variable. The exact same argumentation applies to the similarity measures examined in this section, which were all included only as post hoc variables in a post hoc analysis. Because LSA is so prominent amongst DSMs, and often seen as the prototypical model for computing word similarities from corpus data, the present study was designed to investigate the effects of LSA cosine similarities on RTs. This is why these specific similarities were manipulated as an independent variable in our study. It is left to future research to further investigate whether other models such as HAL and BEAGLE really predict priming data better than LSA. We argue that, in order to do so, it is necessary to directly manipulate these similarity measures as independent variables as well. In this way, the results from different models can actually be compared and it can be investigated whether the same pattern of results as found in this section emerges.

GENERAL DISCUSSION In Experiment 1, we conducted a priming experiment using an LDT. The item material was generated by creating a set of prime words and pseudo-randomly selecting target words with varying cosine similarities to the primes. We found a priming effect in that higher cosine similarities predicted shorter RTs toward the target words in an LDT.

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

23

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

In Experiment 2, we reversed the prime-target order of Experiment 1, since it was an arbitrary decision in this experiment to create primes and pseudo-randomly select targets and not vice versa. We also found priming effects in this experiment, confirming the results of Experiment 1. The findings of Experiments 1 and 2 also extend those of previous studies by extending them to another language other than English (namely German). The results of our experiments provide evidence that LSA cosine similarities can predict priming effects in an LDT in that higher LSA cosine similarities predict shorter RTs. These findings can be explained by assuming that LSA cosine similarities are a measure of mental similarities between words. However, LSA cosine similarities are a symmetrical similarity measure, and the existence of asymmetrical word pairs (Kahan et al., 1999; Peterson & Simpson, 1989) does not fit this property. This is a potential restriction for assuming a direct correspondence between LSA cosine similarities and mental similarities, and has to be dealt with in future research (but see Kintsch, 2014). Comparisons with other DSMs (HAL, PPMIC and BEAGLE) as well as human similarity judgements showed that LSA was outperformed by some measures in Experiment 1. In Experiment 2 however, only the BEAGLE model explains the data better than LSA, which in turn explains the data as well as the HAL model and human similarity judgements. The relation between LSA and other measures of semantic similarity has therefore to be further examined in future research. Furthermore, since BEAGLE emerged as the best similarity measure for predicting our data in both experiments, it is still to be shown that this pattern of results can also be found when BEAGLE cosine similarities are directly manipulated as independent variables. While it may at first seem surprising that the optimally parameterized version of HAL, the PPMIC model, fails to predict our priming data, we have argued that this can be explained by the fact that this model is designed to perform specific semantical tasks. This however shows that it can be important to validate models against various different kinds of data, including behavioural data as

24

obtained by priming studies. Models that are tailored to perform well for a specific range of tasks do not necessarily have to be the models that also perform well for other tasks with different requirements. It might also turn out that there is no single best model for measuring semantic similarity, but that different models capture different aspects of language and similarities between words. Some methodological issues of our study should also be addressed. The corpora we used to set up our models (the blogs corpus and the newspaper corpus) were not very large, both consisting of about 5 million words in about 50,000 documents. Although there is evidence that a corpus of this size can produce quite reasonable results (Bullinaria & Levy, 2007), it has been argued that a larger corpus gives better results (Brill, 2003; Bullinaria & Levy, 2007, 2012). Therefore, one can surely expect better predictions of priming effects from DSMs if one uses larger corpora to set them up. Our findings have some further restrictions, so we are cautious not to overgeneralize them. For instance, we only investigated the effects of relatively frequent concrete nouns of medium length. We did not include abstract nouns or emotion words in the material. There are findings that priming effects for word pairs consisting of one concrete and one abstract noun are weaker (but nevertheless existent) compared to more homogeneous pairs consisting of only one type of word (Bleasdale, 1987). In LSA, on the other hand, each word is represented as a vector, and the similarity of each word pair can be computed, regardless of its type (for a similar argumentation on HAL, see Lund & Burgess, 1996). We therefore consider it likely that our results will replicate with abstract or emotion nouns. In principle, similar effects should also be observed for other word categories, such as verbs or adjectives. However, there is a growing body of linguistic research, mostly on composition in DSMs (Baroni, Bernardi, & Zamparelli, 2014; Baroni & Zamparelli, 2010; Guevara, 2010), where property-describing words such as verbs and adjectives are conceptualized as linear functions, modifying the meaning of a noun vector. If this representation of adjectives and verbs as a

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

linear function is their “main” representation then we cannot expect the cosine similarities between their vectors to predict priming effects (it is however possible to compute cosine similarities between linear functions in their matrix form). Future experiments should thus investigate priming effects with other word categories such as adjectives and verbs. Taken together, our findings extend previous results that have found relations between LSA cosine similarities and behavioural data in priming experiments (Jones et al., 2006). In these studies, LSA cosines only predicted priming effects in factorial designs, while our results show that they also predict priming effects at the item level; the magnitude of priming is a function of the LSA cosine similarity between prime and target. In this regard, our results differ from those obtained by Hutchison et al. (2008), who found that LSA cosine similarities did not predict priming effects at the item level. There are at least two possible explanations for this difference. The first main difference between our study and the study by Hutchison et al. is that we directly manipulated the LSA cosines as an independent variable, whereas Hutchison et al. analysed them post hoc, after their item material was already created. Therefore, we found a prediction of priming effects by LSA cosine similarities in experiments that focused explicitly on this variable, whereas it was only one of many post hoc variables in the Hutchison et al. study. This leads directly to the second main difference: in our data analysis, the regression models that were estimated contained far less parameters than the models Hutchison et al. employed. There, the LSA parameter was significantly correlated with four other model parameters, which might have led to flawed parameter estimations that cannot be reliably interpreted. In our experiments, we took care to ensure that the LSA cosine parameter was not correlated with any other model parameters. Furthermore, the use of LMEMs made it possible to account for both by-subject and by-item random effects (Baayen et al., 2008). Thus, we argue that LSA is indeed capable of capturing by-item differences in the magnitude of priming.

On the basis of the results obtained in the two experiments reported in this article, we propose the use of DSMs such as LSA as a tool that can be used for automatically computing word similarities. These word similarities seem to resemble word similarities in the human mental lexicon, thereby disagreeing with the statement in Sahlgren (2008) that the word meanings obtained via DSMs such as LSA are not the same as the word meanings “in our head”. Furthermore, the distributional hypothesis provides a psychologically plausible framework for these algorithms, and there is evidence to suggest that humans can learn word meanings through distributional patterns and cooccurrences (Ouyang, Boroditsky, & Frank, 2012). Word similarities (such as cosine similarities) are valuable not only for priming experiments or as a control variable for creating experimental linguistic material; they are also an important free parameter in computational models of language comprehension (see, for example, Kintsch, 1988; Kintsch & van Dijk, 1978), as argued by Frank, Koppen, Noordman, and Vonk (2008). Such models assume words (or concepts) to be nodes in a semantic or associative network, which are interconnected with weighted links which specify the strength of each connection. This assumption is very similar to those made in network theories of the mental lexicon, as proposed by Quillian (1967) and Collins and Loftus (1975). According to our view, DSMs offer an elegant and rather objective possibility of fixing these parameters of interconnection strengths in computational models of language comprehension. Which types of DSM model these word similarities best, however, or whether different models capture complementary types of word similarities, is yet to be determined.

CONCLUSION Taken together, our experiments show that LSA cosine similarities are a useful word similarity measure that predict lexical priming effects. Our findings thereby replicate and extend previous studies. We employed a new, pseudo-random method for generating the item material, we

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

25

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

examined LSA cosines as a continuous, numerical and independent variable, and we analysed the data at the item-level while considering both participant and item random effects simultaneously. A comparison with other models of semantic similarity (HAL and BEAGLE) however indicated that, despite its prominence in the psychological literature, LSA does not emerge as the best model for predicting priming effects. In particular, the more complex BEAGLE model, which takes into account more information in the corpus data than LSA does, outperforms LSA in predicting our experimental data. However, our experiments were explicitly designed to test for the predictive power of LSA, and the other similarity measures were only analysed as post hoc variables. Therefore, future research is needed in which different models of semantic similarity (especially DSMs) are compared in order to examine which model best captures “the meanings in our heads”.

REFERENCES Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412. Baayen, R. H., & Milin, P. (2010). Analyzing reaction times. International Journal of Psychological Research, 3(2), 12–28. Baier, H., Lenhard, W., Hoffmann, J., & Schneider, W. (2006). Summa LSA-Server Architecture [Software]. Available from http://www.summa.psychologie.uniwuerzburg.de Baroni, M., Bernardi, R., & Zamparelli, R. (2014). Frege in space: A program for compositional distributional semantics. Linguistic Issues in Language Technologies, 9(6), 5–110. Baroni, M., & Zamparelli, R. (2010). Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In H. Li & L. Màrquez (Eds.), EMNLP 2010: Conference on Empirical Methods in Natural Language Processing, proceedings (pp. 1183–1193). Stroudsburg, PA: ACL. Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis

26

testing: Keep it maximal. Journal of Memory and Language, 68, 255–278. Bates, D., Maechler, M., Bolker, B., & Walker, S. (2014). lme4: Linear mixed-effects models using Eigen and S4 [Computer software manual]. Available from http://CRAN.R-project.org/ package=lme4 (R package version 1.1–5) Bleasdale, F. A. (1987). Concreteness-dependent associative priming: Seperate lexical organization for concrete and abstract words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 13, 582–594. Brill, E. (2003). Processing natural language without natural language processing. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing (pp. 360–369). Berlin: Springer. Bullinaria, J. A., & Levy, J. P. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39, 510–526. Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44, 890–907. Collins, A. M., & Loftus, E. F. (1975). A spreadingactivation theory of semantic processing. Psychological Review, 82, 407–428. Cook, R. D. (1977). Detection of influential observations in linear regression. Technometrics, 19, 15–18. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407. Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavior Research Methods, Instrumentation, and Computers, 23, 229– 236. Ferrand, L., & New, B. (2003). Associative and semantic priming in the mental lexicon. In P. Bonin (Ed.), The mental lexicon: Some words to talk about words (pp. 26– 43). New York, NY: Nova Science. Frank, S., Koppen, M., Noordman, L., & Vonk, W. (2008). World knowledge in computational models of discourse comprehension. Discourse Processes, 45, 429–463. de Groot, A. M. B. (1984). Primed lexical decision: Combined effects of the proportion of related prime-target pairs and the stimulus-onset asynchrony of prime and target. The Quarterly Journal of Experimental Psychology Section A: Human Experimental Psychology, 36, 253–280.

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Downloaded by [Universite Laval] at 21:03 05 November 2015

LSA AS A COGNITIVE SIMILARITY MEASURE

Guevara, E. (2010). A regression model of adjectivenoun compositionality in distributional semantics. In R. Basili & M. Pennacchiotti (Eds.), GEMS ’10: Proceedings of the 2010 Workshop on Geometrical Models of Natural Language Semantics (pp. 33–37). Stroudsburg, PA: ACL. Günther, F., Dudschig, C., & Kaup, B. (in press). LSAfun: An R package for computations based on latent semantic analysis. Behavior Research Methods. Harris, Z. (1954). Distributional structure. Word, 10, 146–162. Hodgson, J. M. (1991). Informational constraints on pre-lexical priming. Language and Cognitive Processes, 6, 169–205. Hutchison, K. A. (2003). Is semantic priming due to association strength or feature overlap? A micro-analytic review. Psychonomic Bulletin & Review, 10, 785–813. Hutchison, K. A., Balota, D. A., Cortese, M., & Watson, J. M. (2008). Predicting semantic priming at the item level. The Quarterly Journal of Experimental Psychology, 61, 1036–1066. Johnson-Laird, P. N. (1983). Mental models: Towards a cognitive science of language, inference, and consciousness. Cambridge, MA: Harvard University Press. Jones, M. N., Kintsch, W., & Mewhort, D. J. K. (2006). High-dimensional semantic space accounts of priming. Journal of Memory and Language, 55, 534–552. Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1–37. Jones, M. N., Willits, J. A., & Dennis, S. (in press). Models of semantic memory. In J. R. Busemeyer, Z. Wang, J. T. Townsend, & A. Edels (Eds.), Oxford handbook of mathematical and computational psychology. New York, NY: Oxford University Press. Jurgens, D., & Stevens, K. (2010). The S-Space Package: An open source package for word space models. In S. Kiibler (Ed.), ACL 2010: 48th Annual Meeting of the Association for Computational Linguistics, proceedings of system demonstrations (pp. 30–35). Stroudsburg, PA: ACL. Kahan, T. A., Neely, J. H., & Forsythe, W. J. (1999). Dissociated backward priming effects in lexical decision and pronunciation tasks. Psychonomic Bulletin & Review, 6, 105–110. Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research Methods, 42, 627–633.

Kintsch, W. (1988). The use of knowledge in discourse processing: A construction-integration model. Psychological Review, 95, 163–182. Kintsch, W. (2014). Similarity as a function of semantic distance and amount of knowledge. Psychological Review, 121, 559–561. Kintsch, W., & van Dijk, T. A. (1978). Toward a model of text comprehension and production. Psychological Review, 85, 363–394. Koriat, A. (1981). Semantic facilitation in lexical decision as a function of prime-target association. Memory & Cognition, 9, 587–598. Kroll, J. F., & Merves, J. S. (1986). Lexical access for concrete and abstract words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 92–107. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284. Louwerse, M. M. (2011). Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, 3, 273–302. Louwerse, M. M., & Zwaan, R. A. (2009). Language encodes geographical information. Cognitive Science, 33, 51–73. Lucas, M. (2000). Semantic priming without association. Psychonomic Bulletin & Review, 7, 618–630. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 201–208. Martin, D. I., & Berry, M. W. (2007). Mathematical foundations behind latent semantic analysis. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 35–56). Mahwah, NJ: Erlbaum. McKoon, G., & Ratcliff, R. (1992). Spreading activation versus compound cue accounts of priming: Mediated priming revisited. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 1155–1172. McNamara, T. P., & Altarriba, J. (1988). Depth of spreading activation revisited: Semantic mediated priming occurs in lexical decisions. Journal of Memory and Language, 27, 545–559. Neely, J. H., & Keefe, D. E. (1989). Semantic context effects on visual word recognition: A hybrid

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

27

Downloaded by [Universite Laval] at 21:03 05 November 2015

FRITZ GÜNTHER ET AL.

prospective-retrospective processing theory. In G. H. Bower (Ed.), The psychology of learning and motivation: Advances in research and theory (pp. 207–248). New York, NY: Academic Press. Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1998). The University of South Florida word association, rhyme, and word fragment norms. Available from http://w3.usf.edu/FreeAssociation Nieuwenhuis, R., te Grotenhuis, M., & Pelzer, B. (2012). Influence.me: Tools for detecting influential data in mixed effects models. R Journal, 4, 38–47. Ouyang, L., Boroditsky, L., & Frank, M. C. (2012). Semantic coherence facilitates distributional learning of word meaning. In N. Miyake, D. Peebles, & R. P. Cooper (Eds.), Proceedings of the 34th Annual Meeting of the Cognitive Science Society (pp. 602– 607). Austin, TX: Cognitive Science Society. Peterson, R. R., & Simpson, G. (1989). Effect of backward priming on word recognition in single-word and sentence contexts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15, 1020–1032. Quillian, M. R. (1967). Word concepts: A theory and simulation of some basic semantic capabilities. Behavioral Science, 12, 410–430.

28

R Core Team. (2014). R: A language and environment for statistical computing [Computer software manual]. Available from http://www.R-project.org/ Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy. Communications of the ACM, 8, 627–633. Sahlgren, M. (2008). The distributional hypothesis. Rivista di Linguista, 20, 33–53. Shaoul, C., & Westbury, C. (2006). Word frequency effects in high-dimensional co-occurrence models: A new approach. Behavior Research Methods, 38, 190–195. Shaoul, C., & Westbury, C. (2010). Exploring lexical cooccurrence space using HiDEx. Behavior Research Methods, 42, 393–413. Snyder, H. R., & Munakata, Y. (2008). So many options, so little time: The roles of association and competition in underdetermined processing. Psychonomic Bulletin & Review, 15, 1083–1088. Stevens, J. P. (1984). Outliers and influential data points in regression analysis. Psychological Bulletin, 95, 334– 344. Thompson-Schill, S. L., Kurtz, K. J., & Gabrieli, J. D. (1998). Effects of semantic and associative relatedness on automatic priming. Journal of Memory and Language, 38, 440–458.

THE QUARTERLY JOURNAL OF EXPERIMENTAL PSYCHOLOGY, 2015

Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies.

In distributional semantics models (DSMs) such as latent semantic analysis (LSA), words are represented as vectors in a high-dimensional vector space...
780KB Sizes 0 Downloads 7 Views