JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.1 (1-21)

Available online at www.sciencedirect.com

ScienceDirect Physics of Life Reviews ••• (••••) •••–••• www.elsevier.com/locate/plrev

Review

Approaching human language with complex networks Jin Cong a , Haitao Liu a,b,∗ a School of International Studies, Zhejiang University, Hangzhou, CN-310058, China b Ningbo Institute of Technology, Zhejiang University, Ningbo, CN-315100, China

Received 16 March 2014; received in revised form 7 April 2014; accepted 15 April 2014

Communicated by L. Perlovsky

Abstract The interest in modeling and analyzing human language with complex networks is on the rise in recent years and a considerable body of research in this area has already been accumulated. We survey three major lines of linguistic research from the complex network approach: 1) characterization of human language as a multi-level system with complex network analysis; 2) linguistic typological research with the application of linguistic networks and their quantitative measures; and 3) relationships between the system-level complexity of human language (determined by the topology of linguistic networks) and microscopic linguistic (e.g., syntactic) features (as the traditional concern of linguistics). We show that the models and quantitative tools of complex networks, when exploited properly, can constitute an operational methodology for linguistic inquiry, which contributes to the understanding of human language and the development of linguistics. We conclude our review with suggestions for future linguistic research from the complex network approach: 1) relationships between the system-level complexity of human language and microscopic linguistic features; 2) expansion of research scope from the global properties to other levels of granularity of linguistic networks; and 3) combination of linguistic network analysis with other quantitative studies of language (such as quantitative linguistics). © 2014 Elsevier B.V. All rights reserved. Keywords: Human language; Complex networks; Network topology; Linguistics; Linguistic typology

1. Network models and measures of human language We live in a world pervaded by networks, i.e., systems which can be represented by graphs, with the system elements as vertices (nodes) and the relations between the elements as edges (links) [1,2]. The great majority of real-world networks (biological, social, technological, etc.) are complex networks [3], which are neither regular (as in the case of regular lattices) nor random (with any pair of vertices having a fixed probability to be linked) [4, 5] and exhibit emergent properties which cannot be inferred on the basis of their component parts [6, p. 47]. The recent decade has witnessed the boom of networks science and an explosion of interest in complex networks across * Corresponding author at: Ningbo Institute of Technology, Zhejiang University, No. 1 Xuefu Road, Ningbo, Zhejiang, CN-315100, PR China. Tel.: +86 571 88981923. E-mail address: [email protected] (H. Liu).

http://dx.doi.org/10.1016/j.plrev.2014.04.004 1571-0645/© 2014 Elsevier B.V. All rights reserved.

JID:PLREV AID:486 /REV

2

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.2 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

a multitude of disciplines ranging from natural sciences to social sciences and humanities [1,2,7–17]. In complement to the reductionist approach as commonly used in modern science, this new science of networks makes it possible to probe into the complexity of real-world systems in their entirety and thus constitutes one, if not the only, solution to the challenge of “reassembling” complex systems and capturing their holistic properties [18, p. 93]. Indebted substantially to graph theory and statistical physics, the models and quantitative tools employed by networks science provide a unifying framework for the structure and dynamics of real-world networks of various natures and thus facilitate communication between different disciplines. Language is “one of the wonders of the natural world” [19, p. 15] and “what makes us human” [20, p. 4]. Recognition is increasing that human language can also be modeled and analyzed with complex networks [17,21–29]. Among the growing enthusiasm for complex networks in recent years, the inquiry into human language from the complex network approach has arisen as a highly productive area, which is characterized by the convergence of disciplines such as statistical physics, systems science, linguistics, cognitive science, and natural language processing. This interdisciplinary endeavor contributes both methodological and substantive insights into human language as a system. That language is a system is a central assumption of modern linguistics [30, p. 1]. According to Saussure, the father of modern linguistics, language is a system in which each linguistic unit is defined by, and only by, its relations with the other units [31]. The Saussurean conception of language as a system is generally consistent with the modern definition of a system [32] and is manifested to varying degrees in a number of subsequent linguistic theories and schools (e.g. [33–38]). In the absence of operational methodology, the system perspective on human language has not been carried any further and only amounts to a metaphor. Instead, linguists are preoccupied with detailed structural features of human language, which can be easily handled with the reductionist approach. Inspired by complexity theory, it is recently acknowledged that language is a complex system [39,40]. If language is conceived of as a complex system of linguistic units and their relations, it is expected to exhibit emergent properties at the system-level due to the microscopic-level interactions between the system elements. Complex networks provide appropriate modeling for human language as a complex system and powerful quantitative measures for its complexity at the system-level. The flourishing research of linguistic networks has introduced a holistic and quantitative approach to the understanding of human language as a system. In addition, the unifying framework of complex networks places linguistic research in a broader and interdisciplinary context. This context is what linguistic research intrinsically deserves. Appropriate use of network analysis depends on the right choice of network representation [41]. Linguistic networks are network models for human language as a system. As there is no one network model which can cover the multi-faceted nature of human language, researchers rely on network models of various language sub-systems, each of which is a particular aspect or level of language. Like the network model of any other type of system, the basic form of a linguistic network N is a pair of sets N = (V , E), whereby V is the set of vertices representing the linguistic units and E the set of edges representing the pairwise relations of a particular type between these linguistic units in the language sub-system in question. Some language sub-systems are inventories of linguistic units (such as words, morphemes and phonemes) as found in a dictionary. For instance, the inventory of words (usually termed as lexicon) of a language may be organized by semantic relations (hyponymy, meronymy, antonymy, synonymy, etc.) between the words. Language sub-systems as inventories of linguistic units are modeled by static linguistic networks [42]. Analysis of static linguistic networks can shed light on the complex organization of different inventories of linguistic units of human language. A representative example of static linguistic networks is static semantic networks [43–48], which are based on such resources as word associations, WordNet, thesaurus, and Semantic Web. A static semantic network models the lexicon of a language, with the words as vertices and their semantic relations as edges. Other static linguistic networks may model language sub-systems pertaining to the formation of particular linguistic units. A network of this type can be constructed so that two linguistic units (such as morphemes and phonemes) as vertices are joined by an edge if they form a larger linguistic unit (e.g., a word) [49–54]. Another way is to capture the similarities of the linguistic units in terms of formation so that two linguistic units (such as words) as vertices are joined by an edge if they are (morphologically or phonologically) similar [55–57]. Other language sub-systems, on the other hand, are those of linguistic units and their relations as found in actual language use. These language sub-systems are modeled by dynamic linguistic networks [42]. Dynamic linguistic networks, unlike static linguistic networks, are based on naturally-occurring language data and thus can reflect the complexity of actual language use. The relations between linguistic units in actual language use can be observed at

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.3 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

3

Fig. 1. Two undirected unweighted linguistic networks based on Sentences (1) through (3). (a) Word co-occurrence network; (b) syntactic dependency network.

different language levels along the meaning-form dimension of actual language use. Some of these levels are characterized by the linearity of linguistic expression, while the others by the non-linear underlying linguistic structures. For instance, Tesnière [58] drew the distinction between l’ordre linéaire (the linearity of linguistic expression) and l’ordre structurale (the non-linear structures underlying linguistic expression). Take the English sentence John put a book on the table as an example. A surface level of the sentence is that of the linear ordering of its words [59,60]. This level pertains to how the sentence unfolds spatially and temporally. Underlying this linear level is the level of syntactic structures, which can be analyzed as syntactic relations between the words in the sentence [59,60]. Syntactic relations are hierarchical [61], i.e., non-linear, by nature. An even deeper level is that of semantic structures [59,60,62–64], which can be analyzed as non-linear relations between the lexical concepts (words) (John as agent, put as action, book as patient, and table as goal) [59,60]. Accordingly, dynamic linguistic networks can be constructed and analyzed as models of different language levels as sub-systems along the meaning-form dimension. The language sub-system of linguistic units (such as words) and their linear ordering in linguistic expression is modeled by a co-occurrence network [65–74]. A co-occurrence network is converted from the co-occurrence pairs of linguistic units (such as words) extracted from a body of authentic language data. Two vertices are joined by an edge if the corresponding linguistic units occur together in a window of x (x ≥ 2) words in at least one sentence. Fig. 1(a) illustrates a word co-occurrence network based on Sentences (1) through (3): (1) John put an envelope on the table. (2) The envelope on the table fell to the floor. (3) The address on the envelope is wrong. In the network of Fig. 1(a), two vertices are joined by an edge if the corresponding words co-occur in a window of two words (i.e., they are adjacent) in at least one sentence. The language sub-system of syntactic structures is modeled by a syntactic dependency network [75–79]. A syntactic dependency network is converted from a dependency treebank, which in turn is obtained through manual annotation of the authentic language data with dependency grammar. Dependency grammar [36,58–60,80,81] is a formalism of syntactic analysis which involves determination of the asymmetric pairwise relations, i.e., syntactic dependency relations, between the words in a sentence. The head word in a syntactic dependency relation is the governor, whereas the modifier or complement of the head word is the dependent. The vertices of a syntactic dependency network are usually word forms (sometimes lemmas, see, e.g., [79]) and two vertices are joined by an edge if they form a syntactic dependency relation in at least one sentence. Fig. 1(b) illustrates a syntactic dependency network based on Sentences (1) through (3). The language sub-system of semantic structures can be modeled by a dynamic semantic network [82,83], although it has been rarely examined from the complex network approach. A dynamic semantic network is converted from a semantic dependency treebank, which in turn is obtained by manually annotating the authentic language data with semantic roles (essentially pairwise predicate–argument relations between content words in the sentences). In modeling and analyzing a language sub-system as a linguistic network, the relations between the linguistic units are often treated as being symmetric and meanwhile equal in strength. In this case, the edges in the linguistic network constructed are undirected and unweighted (see the two networks in Fig. 1). Directed and/or weighted linguistic

JID:PLREV AID:486 /REV

4

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.4 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

networks (e.g., [70,75]) can be constructed if the relations between linguistic units are treated as asymmetric (e.g., the ordering of linguistic units in a co-occurrence pair and the governor-dependent distinction in a syntactic dependency relation) and/or different in strength (e.g., the frequency of a certain type of relations). Although undirected unweighted networks contain less information than their directed and/or weighted counterparts, they can be analyzed with more convenience than the latter. There is a wide range of quantitative measures [10,12,13,15,16,23,42,84–86] available for the characterization of the topological properties of a linguistic network, which cover different aspects of the complex organization of the language sub-system in question. Space precludes a detailed account of these measures. Instead, we focus on the commonly-used measures, namely, those concerning vertex connectivity, degree distribution, small-world property, and centrality. Unless otherwise specified, the complex network measures are of an undirected unweighted linguistic network N = (V , E) with n vertices and m edges as the model of a particular language sub-system. The connectivity of a linguistic unit in a language sub-system is measured by its degree as a vertex in the linguistic network. Given a vertex i ∈ V , its degree (ki ) equals to the number of edges which join it with its neighboring vertices. In the network of Fig. 1(a), for instance, the degree of the is 6 and that of envelope is 4. In a directed network, a vertex has both out-degree (the number of outgoing edges) and in-degree (the number of ingoing edges). For a dynamic linguistic network, the degree of a vertex can be generally interpreted with the theory of Probabilistic Valency Pattern (PVP) [77,81,87]. This theory views valency as the capacity of any linguistic unit (e.g., a word) to combine with other units in the formation of structural relations of a particular type. For instance, a word’s syntactic valency is the sum of its centripetal (input) and centrifugal (output) forces in the formation of syntactic dependency relations with other words. The former force is a word’s capacity to be governed by other words and the latter the capacity to govern other words. In a syntactic dependency network, the degree of any given vertex is the range of the corresponding word’s possible syntactic dependency relations with other words [88] and thus a measure for the corresponding word’s combinatorial capacity to form syntactic dependency relations, i.e., its syntactic valency. An estimator of any given linguistic unit’s connectivity in the sub-system is provided by the average degree (k) of the linguistic network, which in turn is the mean of degrees of all its vertices. In a language sub-system, some pairs of linguistic units are involved in a relation (co-occurrence, syntactic dependency, semantic relation, etc.) while the others are not. The probability of any given pair of linguistic unit to be involved in a relation is the density (ρ) of the linguistic network, which equals to the ratio of the actual number of edges in the network to the maximal possible number of edges. Like other real-world networks, language sub-systems as linguistic networks tend to be sparse [26], implying the low probability of any given pair of linguistic units to be involved in a relation. A summary of connectivity of linguistic units in a language sub-system is provided by the degree distribution (P (k)) of the linguistic network, which in turn is the probability of any of its given vertex to have degree k. Real-world networks are generally scale-free in that their degree distributions generally follow a power law P (k) ∼ k −γ [89,90], implying that only a small number of vertices have extremely high degrees while most vertices have rather low degrees. The few vertices with high degrees usually behave as hubs of their networks [6]. Like other real-world networks, language sub-systems of various types, when modeled as linguistic networks, generally exhibit scale-free degree distributions [43,44,46,47,49–54,65,67–75,77–79,82]. For a language sub-system modeled as a dynamic linguistic network, for instance, scale-free degree distribution means that only a small number of linguistic units have extremely great combinatorial capacity. In a syntactic dependency network, for instance, the hubs tend to be function words (e.g., articles and prepositions) [26]. In a language sub-system, where the linguistic units are connected by relations of a particular type, some pairs of linguistic units are immediately connected while the rest are only connected through one or more other units. The degree of separation between a pair of linguistic units is measured by the shortest path length between them as network vertices, which equals to the number of edges in the shortest path between them. In Fig. 1(a), for instance, the shortest path length between an and envelope is 1 and that between an and table is 3. The maximal shortest path length in a linguistic network is its diameter (D). The average path length (L) of a linguistic network is the shortest path length averaged over all possible pairs of vertices. It is shown that hubs play an important role in the reduction of the level of L of a network [4,91]. For instance, the low level of L in a random network is largely due to the existence of hubs as “short cuts” [4]. The L’s of syntactic dependency networks (with word forms as vertices) of morphologically richer languages (such as Russian and Czech) tend to be greater than those of more analytic languages (such as Swedish and Danish) [92], for the latter rely more on function words, which behave as hubs in their syntactic dependency networks.

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.5 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

5

In a language sub-system, the neighbors of a given linguistic unit may be neighbors themselves. This tendency is measured by a probability termed as clustering coefficient of the linguistic unit as a network vertex. In Fig. 1(a), for instance, the clustering coefficient of envelope is 1/6 ≈ 0.1667. The clustering coefficient of a linguistic network (C) is the mean of clustering coefficients of all its vertices. It is a measure of the tendency of a network “to form cliques in the neighborhood of any given vertex” [93]. Morphologically richer languages, when modeled as syntactic dependency networks (with word forms as vertices) are expected to have smaller C’s than those of more analytic languages [92], for the former have more inflected word forms and thus a low probability of the neighbors of the same vertex to be neighbors themselves. Like other real-world systems [4,94], language sub-systems of various types generally have small-world property when they are modeled and analyzed as linguistic networks [43,44,46,47,49–52,54–57,65,67–75,77–79,82]. A network is a small-world network if its average path length L is almost as small as that of its random network counterpart (Lrandom ≈ ln(n)/ ln(k)) and its clustering coefficient C is far greater than that of its random network counterpart (Crandom ≈ k/n) [4]. In other words, a small-world network exhibits a low degree of separation between vertices and high level of clustering. For a language sub-system, the small-world topology of its linguistic network model facilitates communication between the vertices and thus facilitates mental navigation (e.g., [21,44,56,65,75]), if the linguistic network can be viewed as a model of the mental representation of linguistic knowledge. As previously mentioned, only a small number of linguistic units in a language sub-system are highly connected (i.e., as network hubs). If the connectivity (measured by vertex degree) of these linguistic units relative to the others in the language sub-system is great, the linguistic network will have a strong tendency to exhibit a star-like (i.e., highly centralized) topology. This tendency can be measured by a centrality measure termed as network centralization (NC) [95]. For a dynamic linguistic network, NC reflects the relative combinatorial capacity of the linguistic units behaving as hubs. For other centrality measures see, e.g., [96–98]. A large amount of effort has been devoted to characterizing the complex organization of various language subsystems by means of topological analysis of their linguistic network models [43,44,46–57,65–79,82,99]. Topological analysis of various language sub-systems as linguistic networks has provided a characterization of the complex organization of these sub-systems and thus a picture of the system-level complexity, i.e., the macro structure, of human language. This macro structure is a necessary complement to the wealth of findings concerning the micro structure of human language in traditional linguistic research. The non-trivial statistical patterns (especially small-world and scale-free properties) of various language sub-systems as linguistic networks are candidates for language universals at the system-level [21,26,75]. Meanwhile, the discovery of these non-trivial statistical patterns in various language sub-systems has extended our knowledge of the universality of these patterns in real-world networks to the case of human language. There have been a number of reviews of research on linguistic networks (e.g., [21,23–26]). The studies reviewed are mainly those concerning the topology of various linguistic networks. We argue that linguistic networks, together with the quantitative measures, are tools for understanding human language. Mere network analysis of linguistic networks themselves may not lend much to the understanding of human language and the development of linguistics. In the above-mentioned studies concerning the network topology of various language sub-systems, the linguistic networks and their quantitative properties were often not treated as tools for the inquiry into human language but the research objects themselves. Meanwhile, little effort (e.g., [44,56,65,75, 77,99]) has been made to connect results of topological analysis with the theories and findings in linguistics and its neighboring disciplines. It is necessary to exploit the possibility of adopting complex networks as a methodology for linguistic inquiry, so that investigations of linguistic networks can contribute to the understanding of human language and the development of linguistics. This review focuses on three major lines of research devoted to this end (henceforth, Line 1, Line 2, and Line 3). For Line 1, we review Liu and Cong’s [83] recent work, which is a multi-level complex network analysis of human language, with modern Chinese as a case study. This study has obtained a global characterization of human language as a multi-level system and contributed to the interdisciplinary perspective on human language. Line 2 concerns linguistic typological studies with complex networks [92,100–102], which have shown that it is possible to classify different languages on the basis of major parameters of their linguistic networks. Line 3 consists of studies concerning the relationships between the system-level complexity of human language (determined by topological analysis of linguistic networks) and microscopic linguistic (e.g., syntactic) features [103–108]. We conclude our review with an outline of directions of future research: 1) relationships between the system-level complexity of human language and microscopic linguistic features; 2) expansion of research scope from the global properties to other levels of granularity

JID:PLREV AID:486 /REV

6

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.6 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

of linguistic networks; and 3) combination of linguistic network analysis with other quantitative studies of language (such as quantitative linguistics). 2. Human language as a multi-level system: a complex network analysis As previously mentioned, a central assumption of modern linguistics is that language is a system. Moreover, language as a system is generally regarded as consisting of a series of levels along the meaning-form dimension (e.g., [33–36]). This multi-level approach is in line with the view that a complex system is usually hierarchical (i.e., consisting of sub-systems) [109]. A number of frameworks for linguistic description (e.g., [59,60,63]) emphasize a multi-level characterization of the structure of human language. Psycholinguistics also generally assumes that language processing occurs at different language levels [110]. It follows that human language is a system of multiple levels, each of which can be defined as a sub-system of relevant linguistic units and their relations of a particular type. As mentioned in Section 1, different language levels (those of the linearity of linguistic expression, syntactic structures, semantic structures, etc.) as language sub-systems along the meaning-form dimension of actual language use can be modeled and analyzed as dynamic linguistic networks. However, previous studies concerning the complex organization of language sub-systems as dynamic linguistic networks generally neglect the fact that language is a multi-level system, for they usually focus on a single language level as a language sub-system. Such a mono-level approach to human language is far from sufficient for the understanding of human language as a multi-level system. Therefore, it is necessary to adopt a multi-level approach to human language as a system. This involves analysis and comparison of dynamic linguistic network models of different language levels. Although the previous studies of dynamic linguistic networks cover different language levels, their results obtained from the different language levels are hardly comparable. This is because the linguistic networks they constructed and analyzed are often based on different language data. Such being the case, it is uncertain whether the differences between the linguistic networks of different language levels are due to the differences between the language levels or those between the language data. A desirable way of multi-level network analysis of a language is to analyze and compare the topology of dynamic linguistic network models constructed at different levels of the same language based on the same body of language data. In addition, as discussed in Section 1, the previous studies of the complexity of various language sub-systems, although modeling and analyzing human language with complex networks, often lack sufficient linguistics-related motivation and linguistics-related interpretation of their results. On the one hand, they do not seem to show much interest in the language levels (as language sub-systems) modeled by the linguistic networks. In this way, analysis of linguistic networks is not treated as a means to an end for the understanding of human language but becomes the end itself. On the other hand, as mentioned in Section 1, little effort is made to build a bridge between the results of linguistic network analysis and the theories and findings in linguistic and its neighboring disciplines. An important message from these previous studies is that various language sub-systems, when modeled and analyzed as linguistic networks, generally share in common with other real-world networks a series of non-trivial statistical patterns (especially small-world and scale-free properties). However, these statistical patterns are not what linguists are traditionally preoccupied with, although they are candidates of system-level language universals and a necessary complement to the abundant findings concerning the microscopic linguistic features. Without sufficient linguistics-related explanation, the results of linguistic network analysis can hardly contribute to the understanding of human language. In sum, the characterization of the system-level complexity of human language with dynamic linguistic networks should not only consider the multi-level nature of human language but also have linguistics-related motivation and explanation. The work of Liu and Cong [83] is an attempt to characterize human language as a multi-level system with complex network analysis, with modern Chinese as a case study. The data source for the construction of linguistic networks is a modern Chinese corpus of 46,685 word tokens transcribed from the news (x¯ınwénliánb¯o) of China Central Television. On the basis of this corpus, four linguistic networks were constructed respectively at four different levels along the meaning-form dimension of modern Chinese. The four network models, from deep (meaning) to surface (form), are a dynamic semantic network (Network1), a syntactic dependency network (Network2), a word co-occurrence network (Network3), and a Chinese-character co-occurrence network (Network4). Network1 and Network2 were constructed at the levels of non-linear semantic and syntactic structures, respectively, whereas Network3 and Network4 at the levels characterized by the linearity of linguistic expression. The following three Chinese sentences can be used as the

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.7 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

7

Fig. 2. Four linguistic networks based on Sentences (4) through (6). (a) Dynamic semantic network; (b) syntactic dependency network; (c) word co-occurrence network; (d) Chinese-character co-occurrence network.

language data for the demonstration of how the four networks were constructed (PFV = perfective marker; CLF = classifier; REL = relative marker). Fig. 2 illustrates the four types of networks based on the following three sentences: (4) yu¯ehàn zài zhu¯ozi shàng fàng le bˇen sh¯u John at table up put PFV CLF book ‘John put a book on the table.’ (5) zhu¯ozi shàng de sh¯u diào dào le dì shàng table up REL book fall to PFV floor up ‘The book on the table fell onto the floor.’ (6) sh¯u de f¯engmiàn pò le book REL cover break PFV ‘The cover of the book is broken.’ The language sub-system of semantic structures is modeled by Network1, which captures the lexical concepts (words) and their possible combinatorial patterns. Semantic structures are encoded by syntactic structures, a level which is modeled by Network2, which captures different words and their possible combinatorial patterns. Syntactic structures underlie the linear ordering of words in linguistic expression, a level which is modeled by Network2. Network2 was constructed so that two words as vertices are joined by an edge if they are adjacent in at least one sentence. This network model captures the words and their possible linear ordering in actual language use. All languages have syntax, which encodes the relations between concepts (semantic structures) and which underlies the linear sequencing of words. Therefore, the language levels of semantic structures, syntactic structures, and their linear realization (modeled by Network1 through Network3, respectively) are universal to human language. In the Chinese language, however, there is yet another level of analysis, which is characterized by the linear sequencing of Chinese characters and which is modeled by Network4 in this study. Note that Sentences (4) through (6) are presented with the words in them separated by spaces. However, unlike other languages such as English, the word separators do not exist in actual use of written Chinese and Chinese sentences generally exhibit themselves as continuous streams of

JID:PLREV AID:486 /REV

8

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.8 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

Chinese characters. The original form of Sentence (4), for instance, is (John put a book on the table), a stream of 10 continuous characters. A Chinese character can be generally conceived of as a morpheme, for it corresponds to strictly one syllable, one logographic symbol, and usually a certain amount of meaning [111–113]. Chinese characters form Chinese words, which in turn are sequences of one, two, or sometimes more Chinese char(zhu¯ozi, table) in Sentence (4) consists of two characters (zhu¯o) and (zi). acters. For instance, the word The user of Chinese has to decode the streams of characters into streams of words. Network4 was constructed in such a way that two vertices are joined by an edge if the two corresponding characters are adjacent within at least one sentence. By focusing on modern Chinese as a multi-level system, we can examine the language levels as sub-systems universal to human language and the level peculiar to the Chinese language. The findings of this study, therefore, have implications for language sub-systems which are universal to human language and the sub-system which is rather peculiar to the Chinese language. The major parameters of the four linguistic networks were calculated with Network Analyzer [114]. These cover vertex connectivity, degree distribution, small-world topology, correlations in connectivity pattern [115], and centrality of the four network models. The theory of PVP [77,81,87] can be adopted in the interpretation of vertex degree of the four language sub-systems modeled as linguistic networks. As mentioned previously, the degree of a vertex in Network2 as a syntactic dependency network is indicator of the corresponding word’s syntactic valency. As the notion of valency can be extended to the level of semantic structures [116], the degree of a vertex in Network1 measures the corresponding lexical word’s semantic valency. It has been found that usually over 50% of the syntactic dependency relations in Chinese sentences are between adjacent words [117]. It follows that the degree of any given vertex in Network3 largely reflects the corresponding word’s syntactic valency. Considering the fact that most words in modern Chinese are disyllabic (i.e., consisting of two characters) [111,118,119], a considerable part of a given vertex’s edges in Network4 are the corresponding character’s combinatorial patterns in word formation with other characters. Therefore, without considering the small number of vertices representing characters which usually form monosyllabic (i.e., one-character) words, the degree of a given vertex in Network4 largely reflects the corresponding character’s combinatorial capacity in word formation. This combinatorial capacity can be termed as the character’s lexical valency. According to the usage-based approach to human language [120–122], the actual use of language constitutes the basis of the cognitive representation of a language. The four linguistic networks constructed in the present study, which capture the linguistic units and their relations in actual language use, can be treated as models of how the knowledge of different levels (as sub-systems) of modern Chinese is represented in the human mind. Spreading activation [36, 123,124] is proposed as a basic mechanism of information retrieval in the human mind. The loss of activation energy during the course of spreading is inevitable. All of the four networks are found to be small-world, implying the high efficiency of communication between their vertices. Given the small-world topology, whereby each pair of vertices can be generally connected by a short path, the loss of activation energy can be minimized and the success of retrieval is thus maximized. Bordag and Bordag [125], through their simulation, have demonstrated the effectiveness of spreading activation in word sense disambiguation when it took place across a small-world word co-occurrence network. In each of the four language sub-systems, there are only a small number of highly-connected linguistic units, which is indicated by the scale-free degree distributions of their linguistic network models. The scale-free property of a language sub-system as a dynamic linguistic network is readily interpreted as the heterogeneity of the distribution of the linguistic units’ combinatorial capacity, although it is closely correlated to the Zipf’s law (essentially a power-law decay) [126] of the linguistic units’ rank-frequency distribution (e.g., [65,75]). As previously discussed, the vertex degree of any given linguistic unit in the four language sub-systems as linguistic networks corresponds, directly (for Network1 and Network2) or indirectly (for Network3 and Network4), to the linguistic units’ combinatorial capacity from the perspective of PVP. Therefore, the scale-free property of each language sub-system suggests that there are only a small number of linguistic units which have extremely great combinatorial capacity. Monosyllabic function (de, a relative marker), (hé, a conjunction with a meaning close to and), (zài, a preposition words such as with a meaning close to at) and (le, a perfective marker) are among most important hubs in Network2, Network3 and Network4. The scale-free property found in the four language sub-systems as linguistic networks points to the principle of least effort [126], which is in essence a balance between the demand of speaker/writer and hearer/reader to minimize the effort in language production and comprehension. The combinatorial capacity of a linguistic unit generally corresponds to the richness of its contexts of use (or, the flexibility of its use). On the one hand, the linguistic units

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.9 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

9

(although few in number) used in extremely rich contexts reduce the encoding effort of the speaker/writer in language production. On the other hand, with most linguistic units used rather unambiguously due to their relatively limited ranges of contexts, the decoding effort of the hearer/reader in language comprehension can also be effectively reduced. Therefore, with only a small number of linguistic units used in extremely rich contexts, the needs for effort saving on both sides of speaker/writer and hearer/reader can be satisfied in communication. Note that spreading activation as a mechanism of information retrieval is a brute-force process, for the activation of a vertex, once reaching a certain threshold, always spreads to all the neighboring vertices. In other words, spreading activation is a mental navigation process without global knowledge of the network in which it is taking place. Therefore, the organizational patterns of the four language sub-systems must be responsible for the success of this process. Boguñá et al. [127] have shown that the co-existence of high clustering and scale-free degree distribution generally found in real-world networks increases the navigability of these networks and thus facilitates communication between the vertices, which takes place without global knowledge of the networks. As network models of the four language sub-systems all exhibit high clustering and scale-free degree distribution, their navigability facilitates the brute-force mechanism of information retrieval without global knowledge of the cognitive representations of the language. As shown by the above discussion, the system-level language universals, namely, small-world and scale-free properties have been confirmed in all of the four language sub-systems along the meaning-form dimension of modern Chinese. In addition, the four language sub-systems as linguistic networks generally exhibit disassortative mixing [115], implying that linguistic units with strong combinatorial capacity tend to be combined with those with weak combinatorial capacity. This non-trivial statistical pattern is often found in language sub-systems as dynamic linguistic networks (e.g., [66,67,69,70,75,78,82]). Meanwhile, the four sub-systems generally exhibit hierarchical organization [115,128], an organizational pattern whereby linguistic units with weak combinatorial capacity tend to form dense sub-networks whereas those with great combinatorial capacity join these sub-networks into a connected whole. This statistical pattern is often found in various language sub-systems [51,66,67,69,70,75,78]. In order to obtain a holistic view of the similarities and differences of the four language sub-systems in terms organizational patterns, cluster analysis was conducted of their network models with their network parameters as input. As indicated by the result of clustering, Network1 is rather distinct from the other three as one cluster in terms of network topology (similarity = 28.45). This suggests that the sub-system of Chinese semantic structures is rather distinct from the other three sub-systems and the Chinese semantic structures may be a separate level in language processing. This finding supports particular arguments about semantic structures in the linguistic literature. For instance, in Givón’s [129, p. 7] terms, the sub-system of semantic structures is a system of cognitive representation rather than a system of communicative codes, to which the language sub-systems modeled by Network2 through Network4 are related. Jackendoff [62, p. 18] claimed that the level of semantic structures does not belong exclusively to language but is an interface between language and other cognitive capacities. Semantic structures serve as a representation of the events and states in the world, for which linguistic units (such as function words with strong combinatorial capacity) expressing the relations between the concepts are unnecessary. As linguistic units behaving as network hubs play an important role in the topology of a linguistic network (e.g., [43,75,104,105]), the topological uniqueness of the sub-system of Chinese semantic structures is largely attributed to the absence of such powerful linguistic units. The topological distinctiveness of the sub-system of semantic structures may also indicate the distinctiveness of this sub-system from an evolutionary perspective. It is noteworthy that Network1 exhibits some non-trivial statistical patterns weakly (especially small-world topology and disassortative mixing) and has a lower degree of network centralization. As the structure of a complex network is the result of the corresponding system’s evolution [16], the weaker exhibition of particular non-trivial statistical patterns in Network1 may indicate the low degree of evolution of the Chinese semantic structures sub-system compared with the other three sub-systems. This supports the claim that semantic structures are pre-linguistic [64,130]. Lacking highly evolved devices for explicit encoding of the relations between the concepts, semantic structures can be viewed as the basis of a primitive mode of communication termed as proto-language [131] or proto-grammar [129]. The sub-system of (non-linear) syntactic structures and the sub-system of their linear realization are found to be highly resemblant in terms of organizational patterns, as indicated by the great similarity (90.43) of their network models in network topology. This great resemblance also confirms the feasibility of applying word co-occurrence networks as a more convenient alternative to syntactic dependency networks in studies such as Liu and Cong [102] (see Section 3). Some non-trivial statistical patterns are more strongly exhibited by Network2 and Network3 than Network1. According to the insights from previous studies (e.g., [26,105]), this is due to the existence of powerful

JID:PLREV AID:486 /REV

10

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.10 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

hubs, the most important of which are function words. The importance of function words to the topology of Network2 and Network3 also indicates the importance of these linguistic units to the syntax of modern Chinese. The resemblance between the sub-system of syntactic structures and that of its linear realization is largely due to the significant overlap between syntactic dependency relations and word adjacency relations in the sentences [117]. This overlap is also manifested by the tendency of mean dependency distance (MDD) of sentences to be minimized, which in turn is an important characteristic of natural language syntax [132]. Dependency distance [81,132–135] is the linear distance (in the number of words) between the governor and dependent. For instance, the dependency distance between John and put is 1 and that between put and on is 3 in Sentence (1) of Section 1. MDD is the dependency distance averaged over all syntactic dependency relations in a sentence. MDD is found to be minimized in various different languages [135]. In other words, there is a cross-linguistic tendency of the governor and the dependent to be minimally separated. This tendency can be accounted for by adopting the processing-based approach to linguistic complexity [135–138]. The strong cross-linguistic tendency of MDD to be minimized means that the difficulty of sentence processing of human language is generally kept to a minimum. The shorter the dependency distance, the less the load on working memory in the processing of the dependency relation (e.g., [133,135]). The significant overlap between non-linear syntactic structures and their linear realization is also expected to facilitate the acquisition of syntax on the basis of (linear) authentic language data. This overlap, therefore, supports the learnability of syntactic knowledge on the basis of linguistic experience [137,139]. In sum, the great similarity between the sub-system of Chinese syntactic structures and that of their linear realization is indicator of the convenience which the language provides for the processing and acquisition of its syntax. The great similarity between the two sub-systems should be the result of an evolutionary adaptation of the language so that the cognitive effort in both syntactic processing and acquisition can be minimized. It validates, in quantitative terms, the claim that syntax has evolved to be easy to process and learn [140,141]. The sub-system of Chinese characters is found to be distinct in organization from the sub-system of syntactic structures and that of their linear realization. This indicates that the role of the linear ordering of Chinese characters is different from syntactic encoding. As mentioned previously, several monosyllabic (one-character) Chinese function words are among the most important hubs in Network4. Meanwhile, Network4 shows the highest value of NC, implying that these hubs are more powerful than those in the other three sub-systems. The high frequencies of these monosyllabic function words in actual language use lead to their abundant co-occurrence with neighboring characters in the utterances and thus make them powerful network hubs. Their importance in Network4 suggests that they play an important role in Chinese utterances as continuous streams of characters. However, this important role is not about syntactic encoding. As discovered by relevant research, in addition to being a syntactic device, function words also contribute considerably to lexical processing. Function words “tend to be productively and perceptually minimal” [142] and are easily distinguished from content words. For instance, the Chinese function words are characterized by such properties as smaller word length and neutral tone [143]. These make function words an important cue in lexical segmentation and categorization out of the speech streams for both adults and pre-grammatical children [143,144]. For written Chinese, these monosyllabic function words are especially important considering that the words in the utterances are not separated by spaces as in the alphabetic languages. Topological analysis of the network models of four levels (as language sub-systems) along the meaning-form dimension of modern Chinese has yielded a rather complete picture of the language as a multi-level system. More importantly, topological comparison of the network models has revealed the meaning-form relations in the language. The results of topological analysis help connecting networks science, linguistics, and cognitive science. The findings of this study highlight the harmony between the Chinese language and human cognition. This accounts for the complex organization of modern Chinese as a multi-level system. The conclusions about semantic structures, syntactic structures, and the linear realization of syntactic structures are expected to be generalizable to other languages, although these need to be confirmed by further studies. From a quantitative and holistic approach, this line of research has provides empirical insights into our understanding of human language as a multi-level system. In addition to characterizing what kind of system language is, it is also necessary to examine the applicability of linguistic networks and their quantitative measures to particular branches in modern linguistics. Such is the case of Line 2 to be reviewed next.

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.11 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

11

3. Linguistic typological research from the complex network approach The parameters of linguistic networks can potentially reflect the typological properties of languages [71,72,74, 79]. Typological (or genetic) classification of languages [145] is one of the major interests in modern linguistics. Linguistic typology [146–148] classifies languages according to various linguistic features. Holistic typology [149] characterizes human language in its entirety. Most typological studies, however, address partial typology [149], which relies on particular microscopic linguistic features such as word order. Even though holistic typology could be of more interest, the microscopic linguistic features can be dealt with more easily than the macroscopic ones. Another pitfall of most typological studies is that they usually do not use naturally occurring texts. Complex networks, by focusing on the global properties of real-world systems, can be potentially translated into a methodology for holistic typology on the basis of large-scale authentic language data [92,100–102]. The work of Liu and Li [100] has been the first to attempt language classification using complex networks. Syntactic dependency networks were converted from the syntactic dependency treebanks of 15 languages, namely, Arabic, Catalan, modern Greek, ancient Greek, English, Basque, Hungarian, Italian, Japanese, Portuguese, Romanian, Spanish, Turkish, Latin, and Chinese. For the 15 networks to be comparable, a body of language data of the same size (in the number of words) was selected from the corpus for each language. The typological characterization of a language is a matter of degree rather than all or nothing (e.g., [150,151]). Parameters of the 15 syntactic dependency networks constitute a potential data source for the characterization of the holistic typological features of the corresponding languages and their classification. With Network Analyzer [114], seven parameters of the 15 networks were calculated, namely, k, C, L, NC, D, power-law exponent of P (k), and power-law determination coefficient of P (k) (a measure of how well the distribution is fitted to a power law, with a value ranging from 0 to 1). Cluster analysis was conducted of the 15 networks with their parameters as input. With Euclidean minimum distance as clustering method, the 15 networks were first clustered with five parameters (with k and D excluded) as input. The result of clustering yielded close similarities between the Romance languages (90.17 between Portuguese and Spanish, 92.39 between Catalan and Italian, and 88.45 between Romanian, Catalan, Italian, and Latin). Hungarian, Japanese and Turkish are separated from the other languages, implying their different typological characteristics. However, this result still has some pitfalls. For instance, Portuguese and Spanish are not clustered with the other Romance languages. In order to improve the result of clustering, all of the seven network parameters were used as input for cluster analysis of the 15 networks. A better result of clustering was obtained with all seven parameters as input. For instance, the Romance languages are clustered together with a high similarity of 83.17. Japanese, Turkish, Basque, and Hungarian as non-Indo-European languages are separated from most of the Indo-European languages. This result is comparable with that obtained based on word-order parameters in Liu [152]. For instance, they both capture the resemblance of the Romance languages and that of Chinese and English. In sum, this study has preliminarily confirmed the feasibility of applying complex networks to typological language classification. The syntactic networks used in Liu and Li [100], with word forms as vertices, contain morphological as well as syntactic information of the corresponding languages. The result of language clustering in [100] largely corresponds to the variation of the languages in terms of morphological features, implying that morphological variation may emerge as a determinant in the system-level similarities and differences of the languages. However, syntactic dependency networks can also take lemmas as vertices (e.g., [79]) and thus only contain the syntactic information. A question thus arises as to whether such syntactic networks with no morphological information can be used for language classification. The work of Liu and Xu [101] examined the effect of language classification with two types of syntactic dependency networks, namely, word-form networks and lemma networks. The 15 languages involved in language clustering are Catalan, Czech, modern Greek, ancient Greek, Basque, Hungarian, Italian, Portuguese, Spanish, Turkish, Latin, Dutch, French, Slovenian, and Russian. On the basis of the dependency treebank of each language, a word-form network and a lemma network were constructed. The same set of parameters as in [100] was calculated for the 15 word-form networks and their lemma network counterparts. With these parameters as input, cluster analysis was conducted of the 15 word-form networks. The five Romance languages (Catalan, Italian, French, Portuguese, and Spanish), together with Dutch fall into the same cluster (similarity = 79.65). The close similarity (81.74) between Czech, Russian, Latin, modern Greek, and ancient Greek corresponds to the rich inflectional morphology that these languages share in common. This result is similar to that obtained with word-order parameters in Liu [152]. With the same set of parameters as input, cluster analysis yielded better result of language clustering than their lemma

JID:PLREV AID:486 /REV

12

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.12 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

network counterparts. This is attributed to the morphological information contained in the former. As the word-form networks contain both morphological and syntactic information of the corresponding languages whereas their lemma network counterparts only the morphological information, it is intuitively reasonable to check the effect of language clustering using the difference values between the parameters of the two types of syntactic networks. An even better clustering result was obtained with these difference values. For instance, the five Romance languages now form one cluster exclusive of any non-Romance language. By examining the effect of language classification with the parameters of word-form and lemma networks and the differences of their parameters, this study has not only contributed new methods for network-based language classification but also confirmed the important role of the morphological variation of languages in the similarities and differences of their syntactic dependency networks. The statistics of the word-form and lemma networks in Liu and Xu [101] can be adopted as an empirical response to ˇ particular hypotheses made by Cech and Maˇcutek [79] concerning the relationships between the typological (morphological) characteristics of a language and the discrepancy of its word-form network and lemma network in particular parameters. For instance, the word-form network of a language with low inflection (e.g., English) is expected to have k equal to or higher than that of its lemma network counterpart. For a highly inflectional language, the discrepancy of its word-from network and lemma network in k is expected to be unpredictable. However, according to the results of Liu and Xu [101], all of the 15 lemma networks exhibit higher k than their word-form network counterparts, although the corresponding languages vary in inflectional richness. Although the statistics of word-form and lemma ˇ networks of the 15 languages do not completely support the hypotheses of Cech and Maˇcutek [79], the clustering results of Liu and Xu [101] have shown that the differences between word-form and lemma networks in major network parameters do reflect the typological characteristics of the corresponding languages. Abramov and Mehler [92] examined the possibility of language classification under the framework of quantitative network analysis [153], the basic assumption of which is that complex networks can be classified on the basis of their topological properties. This assumption is generally consistent with that in [100] and [101]. Of the 11 languages involved in the clustering experiment, Dutch, Danish, and Swedish belong to the Germanic branch; Romanian, Italian, Catalan, and Spanish the Romance branch; and Russian, Slovene, Czech, and Bulgarian the Slavic branch. 21 parameters of each network were calculated and clustering experiment was carried out with different combinations of these parameters as input. The results of classification were tested against the genealogical grouping of these languages according to language families. The results of clustering experiment show that it is possible to cluster the languages correctly into their respective branches. This study is a more systematic attempt to examine the possibility of applying complex networks to typological language classification. Firstly, it calculated more parameters of the networks and tested different parameter combinations in the clustering experiment. Secondly, the results of clustering with different parameter combinations can be easily compared with the application of F -measures [154] to the evaluation of classification. Finally, the roles of different network parameters in language clustering were compared. It is noteworthy that the above-mentioned studies [92,100,101] are generally satisfied with rather coarse-grained language clustering, that is, the classification of languages into their respective branches (e.g., Romance, Germanic, and Slavic) without considering the subdivision of these language branches. If the complex-network approach is to become a well-established methodology for language classification, it is necessary to find out whether it can yield more fine-grained language classification. Meanwhile, these studies have two major methodological limitations. On the one hand, the corpus data for the dependency treebanks employed in these studies are usually inconsistent in semantic content and genre, which may have an impact on the results of language clustering. A more desirable type of language data is parallel texts (e.g., a novel and its translations in different languages). On the other hand, the resource of syntactic dependency networks is limited due to the considerable manpower and material resources it takes to build dependency treebanks. Word co-occurrence networks, which can be constructed automatically, are a potential alternative to syntactic dependency networks. The study of Liu and Cong [102] investigated the feasibility of fine-grained language classification with word cooccurrence networks based on parallel texts. Based on parallel texts of the Russian novel Kak Zakaljalas’ Stal’ (How the Steel was Tempered) [155], 14 word co-occurrence networks (with co-occurrence defined as adjacency of two words in at least one sentence) were constructed. These networks are of 12 Slavic languages and two non-Slavic languages, namely, Chinese and English. The 12 Slavic languages are subdivided into the Eastern (Russian, Belarusian, and Ukrainian), Western (Czech, Slovak, Polish, and Upper-Sorbian), and Southern (Serbian, Croatian, Slovenian, Bulgarian, and Macedonian) sub-branches. 10 parameters of the networks were calculated. The combination of k, L, C and NC was selected as a base set. Altogether 64 parameter combinations (obtained by adding other parameters

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.13 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

13

Fig. 3. Clustering of the 14 language with eight network parameters as input.

to the base set) were tested as input of cluster analysis with Ward method and Manhattan distance. 15 parameter combinations yielded a result which could distinguish the Slavic languages from the non-Slavic and correctly cluster the 12 Slavic languages into their respective sub-branches. One of the results is shown in Fig. 3. As can be seen from the figure, the result captures not only the grouping the Slavic languages but also the similarities of some of these languages in their sub-branches. For instance, although Serbian and Croatian use different writing systems, it is commonly accepted that they are the same language [156], which is reflected by a short distance of 1.70 between them. The close similarity between Bulgarian and Macedonian is also captured (distance = 3.57) by the result. This result is also generally comparable with those achieved by other methods including lexicostatistics [157]. This also inspires our consideration of the relationship between language and writing system. For instance, some of the Slavic languages adopt the Cyrillic alphabet and the others the Latin alphabet. However, the difference in writing system does not contribute to their clustering. The writing systems of Chinese and English are totally different, but their difference turns out to be smaller than intuitively expected as indicated by the result of this study and those of [100,152]. It is also noteworthy that the effect of classification for Slavic languages in this study is significantly better than that based on such word-order parameters as dependency direction [152]. This is because the method adopted in this study relies on global characterization of language as a system, instead of a set of microscopic linguistic features, which can hardly capture the wholeness of the language system. This also indicates that word order may not be the most appropriate basis for the classification of languages with richer inflectional morphology as in the case of Slavic languages [158]. In sum, word co-occurrence networks based on parallel texts are applicable to finegrained language classification and they constitute a more convenient substitute for syntactic dependency networks in complex-network-based language classification. Studies of language classification with complex networks have extended the application of complex networks to more specific fields in social sciences and humanities. It has been shown that topological analysis of linguistic networks can constitute a potential methodology for language classification from the approach of holistic typology. By focusing on system-level complexity of human language, it is an important complement to the methods of partial typology. In addition, with the application of quantitative measures, this potential methodology captures the continuous rather than discrete similarities and differences of languages. The parameters of dynamic linguistic networks (e.g., [69,71–73,77]) are also potential indicators of the variation of texts with respect to different stylistic variables. With the application of the models and measures of complex networks, linguistic typological studies and stylistic studies can be conducted under a unifying framework. Linguistic typology can be possibly explored through comparison of linguistic networks based on texts of different languages. Likewise, a “text typology” can be potentially revealed through comparison of linguistic networks based on texts of the same language. A number of studies (e.g., [159–162]) are devoted to the stylistic classification of texts of the same language with complex networks. In the study of Antiqueira et al. [159], the Portuguese texts on the same topic produced by high-school students with approximately the same age and academic background were modeled as word co-occurrence networks. It is found that the quality of these texts (measured by three types of scores assigned to them by human judges) is strongly correlated with particular parameters of their network models. The results of [160] and [161] show that translated texts by human translators and machine translations tools, when modeled

JID:PLREV AID:486 /REV

14

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.14 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

as word co-occurrence networks, can be distinguished by some of their network parameters. Amancio et al. [162] represented books published from 1590 to 1922 as word co-occurrence networks. The parameters of these networks, when analyzed with multivariate techniques, generated clusters which correspond to relevant literary movements over the last five centuries. Note that research on human language from the complex network approach depends on the system-level complexity of human language. As the complexity of human language system is emergent at the macroscopic scale due to the interactions between linguistic units at the microscopic scale, a natural question which arises is: how do the microscopic linguistic features (the traditional concern of modern linguistics) affect the system-level complexity of human language? Line 3 to be reviewed next is devoted to the macro–micro relationships in human language. 4. Macro vs. micro structures of human language The development of syntax is found to affect the complex organization of the syntactic sub-system [103,104]. On the basis of the language data in the Manchester corpus from CHILDES [163], Ke and Yao [103] constructed directed word co-occurrence networks (as approximate models for the syntactic sub-system) of children and their caretakers at different stages of language acquisition. Hubs-authorities analysis [164] was conducted to determine the globally important vertices in the networks, namely, hubs and authorities (similar to vertices with high out- and in-degrees, respectively, but determined with a more complicated procedure, see [164]). The hubs and authorities in the adult networks in different stages of language acquisition of their children are rather consistent, with the most important hubs and authorities being closed-class words such as pronouns and prepositions. By comparison, the hubs and authorities in child networks exhibit significant change during the course of language acquisition. In earlier stages, the most important hubs and authorities in the child networks contain a number of content words. These gradually give way to function words in later stages. A more interesting finding concerns the change in hub/authority status of articles a and the in the child networks. The two articles appear constantly as authorities in the adult networks. In the child networks, however, the two articles behave as hubs in earlier stages of acquisition but as authorities in later stages. The results of this study suggest that the development of syntax can lead to change in the global importance of particular words in the sub-system of syntactic structures. Corominas-Murtra et al. [104] investigated the topological change of syntactic dependency networks of children during the course of language acquisition. An abrupt reorganization of network topology occurred around two years of age. Non-trivial statistical patterns such as small-world and scale-free properties appeared around this time and were maintained ever since. This time approximately corresponds to that of the syntactic spurt, which marks the transition of children’s language from the non-grammatical stage (with function words and inflectional morphology absent) to grammatical stage (with function words and inflectional morphology present). The findings of Corominas-Murtra et al. [104] provide immediate evidence for the fundamental importance of function words (as network hubs) to the complex organization of the sub-system of syntactic structures. The role of function words as network hubs in the complex organization of the syntactic sub-system can also be determined by examining the topological change of the corresponding syntactic network with the removal of these hubs. The work of Chen and Liu [105] focuses on the behavior of three most frequently-used function words, namely, (de, a relative marker), (le, a perfective marker) and (zài, a preposition with a meaning close to at) in two Chinese syntactic dependency networks. All three vertices are central vertices of the syntactic dependency networks, as indicated by a series of centrality measures of them. The removal of each of the three vertices led to significant change in network topology. For instance, the removal of caused a slight decrease in L and ρ and an abrupt increase in the number of isolated vertices (i.e., vertices which are not joined to any other vertex in the network). The removal caused a significant increase in L and D and an abrupt drop in ρ. The results of this study, therefore, of and constitute another source of immediate evidence for the vital role of function words as network hubs in the complex organization of the sub-system of syntactic structures. Moreover, the results also suggest that different function words may affect the complexity of the syntactic sub-system in different ways. Liu and Hu [106] attempted to determine the role of syntax in the complexity of the syntactic sub-system by comparing one syntactic dependency network and two non-syntactic networks (with the pairwise relations between words in the same sentence determined at random with different algorithms) based on the same body of language data. It is found that the syntactic network has higher L but lower k and C than the two non-syntactic networks. However, their difference in these parameters is not sufficient to consider them as networks of different types, for all of them are small-world and scale-free (with slightly different power-law exponents of P (k)). The results indicate that although

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.15 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

15

syntax affects the parameters of the syntactic sub-system as a syntactic network, the non-trivial statistical patterns such as small-world and scale-free properties can hardly explain the role of syntax in the complexity of the syntactic sub-system. It is claimed by some that the scale-free architecture is of essential importance to the syntax of human language [165,166]. However, as shown by the results of Liu and Hu [106], both syntactic and non-syntactic networks are scale-free, implying that scale-free property might be only a necessary condition for a language sub-system to be a syntactic one. For the analysis of the same type of syntactic structure, dependency grammar usually provides one definite solution. However, there are still particular controversial issues which are dealt with differently by different schemes for dependency analysis. Such being the case, the topology of a syntactic dependency network might be affected by the scheme for dependency analysis adopted. Liu et al. [107] dealt with how the schemes of dependency analysis affect the topological properties of the syntactic dependency networks by focusing on coordinating structures (with such forms as X, Y and Z), which constitute a great challenge to dependency analysis. The language data consist of 1000 sentences with coordinating structures selected from People’s Daily, China’s leading newspaper. The language data were analyzed with three different schemes for dependency analysis, namely, conjunction as head, the first conjunct as head, and all conjuncts as heads. The three syntactic dependency networks are all small-world and scale-free. The results of statistical analysis show that they exhibit no significant difference in L, k, and C. Moreover, no significant difference is found in their distributions of L, k nn (k), and c(k) (see [115] for introduction of the latter two distributions). Although the different schemes for dependency analysis did not cause significant difference of the three networks in topological properties, they are found to have an impact on the centrality measures of particular words such as conjunctions. As suggested by the findings of this study, the currently available complex network parameters may be insufficient to capture the topological difference of syntactic dependency networks due to the difference in the structural features at the microscopic scale. ˇ Cech et al. [108] approached the role of syntax in the complexity of the syntactic sub-system by focusing on the relationship between the local and global importance of verbs. The local importance of a verb refers to its ability to structure the sentence. The global importance of a verb, on the other hand, is its importance as a vertex in a syntactic dependency network, which was measured by its out-degree (i.e., the number of its possible dependents) in this ˇ study. From the decisive role played by verbs in the sentences, Cech et al. [108] deduced that the local importance of verbs in the sentences should affect their global importance in the syntactic dependency network. A hypothesis was formulated that verbs should occur among the most important vertices in the syntactic dependency network. Syntactic dependency networks (with lemmas as vertices) of six different languages (Catalan, Czech, Dutch, Hungarian, Italian, and Portuguese) were constructed and analyzed to test the hypothesis. It is found that for each network, the proportions of verbs in histogram bins of the ranked distribution of out-degree tend to decrease, in other words, a vertex with a higher value of out-degree in a syntactic dependency network is more likely to be a verb. The proportions of verbs in histogram bins of the rank-frequency distribution of lemmas, on the other hand, tend to be more or less constant. In addition, they are generally lower than the proportions of verbs in histogram bins of the ranked distribution of out-degrees. This suggests that the local importance of verbs does play a role in their global importance in the syntactic dependency network. The results of this study, therefore, show that syntax does plays a role in the complexity of the syntactic sub-system, at least in the case of the global importance of verbs. Compared with topological analysis of various language sub-systems as linguistic networks, the relationships between the system-level complexity of human language and microscopic linguistic features have only been marginally studied. Despite the limited findings, the research of these macro–micro relationships in human language is a worthy effort in that it helps to better understand both the macro and micro structures of human language. On the one hand, this line of research has shed preliminary light on how particular microscopic linguistic features affect the emergence of system-level properties of human language. On the other hand, this line of research suggests the possibility of a system perspective on particular linguistic (e.g., syntactic) features, which are usually investigated at the microscopic scale in traditional linguistic research. 5. Conclusion and suggestions for future research The models and quantitative tools of complex networks have provided an operational methodology for the systemlevel complexity of human language. When different sub-systems of human language are modeled as networks with linguistic units as vertices and their relations as edges, they can be analyzed like other real-world networks. More

JID:PLREV AID:486 /REV

16

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.16 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

importantly, linguistic networks, when exploited properly, can constitute an operational methodology for linguistic inquiry, which contributes to the understanding of human language and the development of linguistics. In this review, we have surveyed three lines of linguistic research from the complex network approach. Line 1 has obtained a rather complete picture of human language as a multi-level system from the complex network approach and interpreted the complex organization of different language levels (as language sub-systems) and their relationships from such perspectives of networks science, linguistics, and cognitive science. Line 2 concerns the application of linguistic networks and their quantitative measures in linguistic typological research and has proved that quantitative analysis of dynamic linguistic networks can be translated into a potential methodology for linguistic typology. Line 3 attempts to determine the relationships between the system-level complexity of human language (determined by the topology of linguistic networks) and microscopic linguistic (e.g., syntactic) features and has obtained a preliminary knowledge of the macro–micro relationships in human language. Linguistic research from the complex network approach is a relatively young domain of scientific endeavor. Although research interest in this area is on the rise and abundant findings have already been made, researchers need to have a clear knowledge of the limitations of the existing studies in order to determine directions of further research. As we have argued in Section 1, the models and measures of linguistic networks are tools for understanding human language. Therefore, the need to deepen our understanding of human language through investigations of linguistic networks should be the basic consideration underlying future studies. We suggest three directions of future investigation, which we believe can fill this need. 1) Relationships between the system-level complexity of human language and microscopic linguistic features. Research of the relationships between the system-level complexity of human language (determined by the topology of linguistic networks) and microscopic linguistic features helps to deepen our insights into both the macro and micro structures of human language. However, the existing studies [103–108] concerning this issue have only provided a limited picture of the relationships between the two scales of human language. For instance, the capacity of particular linguistic units (such as function words and verbs) to form syntactic structures is found to play an important role in the complex organization of particular language sub-systems as linguistic networks [104,105] or the global importance of these linguistic units in the sub-systems [108]. However, the structural behavior of these linguistic units only accounts for a limited part of the intricacy and delicateness of linguistic features at the microscopic scale. Therefore, more systematic research is needed in order to show how different linguistic features may affect the system-level complexity of human language. Linguistic features (such as syntactic structures) are of diverse types and they generally co-exist in the language data on the basis of which the linguistic networks are constructed. Such being the case, determining the roles of different linguistic features in the complex organization of various language sub-systems might be a big challenge for future studies. One possible solution, as can be found in Liu et al. [107], is to select sentences which share one particular linguistic feature in common to form the data source and examine the role of this feature in the topology of the linguistic network obtained. Meanwhile, the macro–micro relationships already found in previous studies may need more in-depth analysis. For instance, although Ke and Yao [103] have shown that the development of syntax affects the global importance of particular words in the syntactic sub-system, further research is needed to determine what types of syntactic structure or patterns of language use lead to these changes. Finally, it is noteworthy that the currently available network measures may not be sharp enough for the purpose of determining the macro–micro relationships in human language. For instance, no significant difference in network topology was found either between syntactic networks and their non-syntactic counterparts in Liu and Hu [106] or between syntactic networks based on language data with different annotation schemes in Liu et al. [107], although slight differences were found in particular parameters. Intuitively speaking, any change in the microscopic structural features of the language data will make the linguistic networks obtained different. However, it may be that the quantitative measures are not sufficiently sensitive to these changes in the linguistic networks. Therefore, much sharper measures for complex network analysis are needed for future studies of the macro–micro relationships in human language. Meanwhile, these sharper network measures may also help to quantify particular subtle aspects of language use. For instance, some languages seem to be more emotional (e.g., [167]) and some seem to be more redundant. These subtle differences in language use can be hopefully captured at the system-level with the sharper network measures. 2) Expansion of research scope from the global properties to other levels of granularity of linguistic networks. According to Brandes and Erlebach [168], a network can be analyzed at three major levels of granularity, namely, network-level, group-level, and element-level. The studies of human language from the complex network approach

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.17 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

17

generally fall under the heading of network-level analysis, in that they focus on the global properties of a linguistic network. By comparison, the group-level and element-level of linguistic networks have been rarely studied. In addition to network-level patterns such as small-world and scale-free properties, real-world networks may also exhibit regularities at the other two levels. For instance, many real-world networks are found to have such group-level properties as community structure [169,170], i.e., the presence of cohesively connected sub-networks. While it might be challenging to determine the relationships between the system-level complexity of human language (as network-level phenomena) and microscopic linguistic features (as the major concern of modern linguistics), an easier link might be drawn between group- or element-level analysis of language sub-systems as linguistic networks and the microscopic scale of human language. For instance, Ferrer i Cancho et al. [76] applied spectral methods to the community detection in a syntactic dependency network. It is found that the spectral methods clustered words of the same class. Siew [171] examined the communities detected with the Louvain optimization method from the phonological network described in Vitevitch [56]. A correlation is found between the size of community and the length, frequency, and age of acquisition rating of the words in it. In future research, more studies at the group-level and element-level of linguistic networks are needed not only as a complement to those at the network-level but also as a potentially new approach to investigating the linguistic features as traditionally researched in linguistics. 3) Combination of linguistic network analysis with other quantitative studies of language. Research of human language from the complex network approach depends on quantitative models and methods from networks science. With the application of these quantitative tools, linguistic network analysis has the same advantage as other quantitative studies of human language, that is, the exactness and precision of the observation of various phenomena of language. The complex network approach to human language should not develop only in parallel with other quantitative research of human language. It also needs to be combined with the latter. Ferrer i Cancho [172] claims that research of linguistic networks is a young field within quantitative linguistics. Quantitative linguistics [173–175] concerns itself with various quantitative properties of human language and attempts to find universal laws of language which are formulated mathematically. Linguistic network analysis and quantitative linguistics are similar in more than one way. For instance, both fields emphasize the application of scientific methods and both have uncovered a series of statistical laws of human language. However, more effort is needed for better communication and cooperation between these two fields. Firstly, linguistic network analysis can take into consideration particular quantitative properties which are dealt with in quantitative linguistics. In a linguistic network model, all the linguistic units are treated as identical in that they are all modeled as vertices. In quantitative linguistics, however, linguistic units (such as words) can be measured in terms of such indices as frequency, length, polysemy, and polytextuality. The analysis of relevant linguistic networks can incorporate these quantitative measures of linguistic units as additional information in order to find out how they correlate with the quantitative measures of the networks. Secondly, attempts can be made to correlate the statistical laws uncovered by linguistic network analysis with those in quantitative linguistics. For instances, some researchers (e.g., [65,77]) have already discussed the relationship between the scale-free distribution of linguistic networks and Zipf’s law [126], which in turn is perhaps the most important law in quantitative linguistics. More attempts need to be made to determine whether there are correlations between the statistical universals of linguistic networks and those in quantitative linguistics, and if possible, whether these correlations can be formulated in mathematical terms. Finally, the statistical laws discovered with linguistic network analysis need to be explained as done in quantitative linguistics. In quantitative linguistics, universal laws of language are building blocks of linguistic theories for the explanation of various language phenomena. However, quantitative linguistics, like all other sciences, seeks a hierarchy of explanations and the universal laws of language themselves also need explanation. Likewise, the non-trivial statistical patterns of linguistic networks as potentially new types of universal laws of human language are in need of much deeper explanation so that they can contribute to the construction of linguistic theories and can be better appreciated by researchers of linguistics and its neighboring disciplines. Acknowledgement This work was supported by the National Social Science Foundation of China (11&ZD188).

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.18 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

18

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45]

Barabási A-L. Linked: the new science of networks. Cambridge, MA: Perseus Publishing; 2002. Watts DJ. Six degrees: the science of a connected age. New York: WW Norton & Company; 2003. Dorogovtsev SN, Goltsev AV, Mendes JFF. Critical phenomena in complex networks. Rev Mod Phys 2008;80(4):1275–335. Watts DJ, Strogatz SH. Collective dynamics of “small-world” networks. Nature 1998;393(6684):440–2. Sinha S. From network structure to dynamics and back again: relating dynamical stability and connection topology in biological complex systems. In: Ganguly N, Deutsch A, Mukherjee A, editors. Dynamics on and of complex networks. Boston: Birkhäuser; 2009. p. 3–17. Barrat A, Barthélemy M, Vespignani A. Dynamical processes on complex networks. New York: Cambridge University Press; 2008. Dorogovtsev SN, Mendes JFF. Evolution of networks: from biological nets to the Internet and WWW. New York: Oxford University Press; 2003. Newman MEJ, Barabási A-L, Watts DJ, editors. The structure and dynamics of networks. Princeton, NJ: Princeton University Press; 2006. Pastor-Satorras R, Vespignani A. Evolution and structure of the Internet: a statistical physics approach. New York: Cambridge University Press; 2007. Caldarelli G, Vespignani A, editors. Large scale structure and dynamics of complex networks: from information technology to finance and natural science. Singapore: World Scientific; 2007. Ganguly N, Deutsch A, Mukherjee A, editors. Dynamics on and of complex networks: applications to biology, computer science, and the social sciences. Boston: Birkhäuser; 2009. Newman MEJ. Networks: an introduction. Oxford, UK: Oxford University Press; 2009. Estrada E. The structure of complex networks: theory and applications. New York: Oxford University Press; 2011. Strogatz SH. Exploring complex networks. Nature 2001;410(6825):268–76. Albert R, Barabási A-L. Statistical mechanics of complex networks. Rev Mod Phys 2002;74(1):47–97. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang D. Complex networks: structure and dynamics. Phys Rep 2006;424:175–308. Costa LDF, Oliveira Jr ON, Travieso G, Rodrigues FA, Villas Boas PR, Antiqueira L, et al. Analyzing and modeling real-world phenomena with complex networks: a survey of applications. Adv Phys 2011;60(3):329–412. Wilson EO. Consilience: the unity of knowledge. New York: Vintage Books; 1999. Pinker S. The language instinct. New York: Harper Collins; 1994. Bickerton D. Adam’s tongue: how humans made language, how language made humans. New York: Hill and Wang; 2009. Ferrer i Cancho R. The structure of syntactic dependency networks: insights from recent advances in network theory. In: Altmann G, Levickij V, Perebyinis V, editors. The problems of quantitative linguistics. Chernivtsi: Ruta; 2005. p. 60–75. Markošová M. Network model of human language. Physica A: Stat Mech Appl 2008;387:661–6. Mehler A. Large text networks as an object of corpus linguistic studies. In: Lüdeling A, Kytö M, editors. Corpus linguistics: an international handbook, vol. 1. Berlin: Walter de Gruyter; 2008. p. 328–82. Choudhury M, Mukherjee A. The structure and dynamics of linguistic networks. In: Ganguly N, Deutsch A, Mukherjee A, editors. Dynamics on and of complex networks: applications to biology, computer science, and the social sciences. Boston: Birkhäuser; 2009. p. 145–66. Borge-Holthoefer J, Arenas A. Semantic networks: structure and dynamics. Entropy 2010;12:1264–302. Solé RV, Corominas-Murtra B, Valverde S, Steels L. Language networks: their structure, function and evolution. Complexity 2010;15(6):20–6. Baronchelli A, Ferrer i Cancho R, Pastor-Satorras R, Chater N, Christiansen MH. Networks in cognitive science. Trends Cogn Sci 2013;17(7):348–60. Mihalcea R, Radev D. Graph-based natural language processing and information retrieval. New York: Cambridge University Press; 2011. Biemann C. Structure discovery in natural language. Berlin, Heidelberg: Springer; 2012. Kretzschmar Jr WA. The linguistics of speech. New York: Cambridge University Press; 2009. Saussure FD. Course in general linguistics [Baskin W, Trans.]. New York: Philosophical Library; 1959. Bunge M. Semiotic systems. In: Altmann G, Koch WA, editors. Systems: new paradigms for the human sciences. Berlin: Walter de Gruyter; 1998. p. 337–49. Hjelmslev L. Prolegomena to a theory of language [Whitfield FJ, Trans.]. Madison: University of Wisconsin Press; 1961. Halliday MAK, Matthiessen C. An introduction to functional grammar. 3rd edition. London: Hodder Arnold; 2004. Lamb SM. Pathways of the brain: the neurocognitive basis of language. Amsterdam, Philadelphia: John Benjamins; 1999. Hudson R. An introduction to word grammar. New York: Cambridge University Press; 2010. Goldberg AE. Constructions at work: the nature of generalization in language. New York: Oxford University Press; 2006. Langacker RW. Cognitive grammar: a basic introduction. New York: Oxford University Press; 2008. Larsen-Freeman D, Cameron L. Complex systems and applied linguistics. Oxford: Oxford University Press; 2008. Beckner C, Blythe R, Bybee J, Christiansen MH, Croft W, Ellis NC, et al. Language is a complex adaptive system: position paper. Lang Learn 2009;59(Suppl 1):1–26. Butts CT. Revisiting the foundations of network analysis. Science 2009;325:414–6. Liu H. Linguistic complex networks: a new approach to language exploration. Grundlagenstud Kybern Geisteswiss (grkg/Humankybernetik) 2011;52(4):151–70. Sigman M, Cecchi GA. Global organization of the WordNet lexicon. Proc Natl Acad Sci USA 2002;99(3):1742–7. Motter AE, de Moura APS, Lai YC, Dasgupta P. Phys Rev E, Stat Nonlinear Soft Matter Phys 2002;65(6 Pt 2):065102. Holanda AJ, Torres Pisa I, Kinouchi O, Souto A, Seron Ruiz E. Thesaurus as a complex network. Physica A 2004;344:530–6.

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.19 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

19

[46] Steyvers M, Tenenbaum JB. The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cogn Sci 2005;29:41–78. [47] Gil R, García R. Measuring the semantic web. In: Sanchez-Alonso S, editor. Advances in metadata research. Proceedings of MTSR 2005. Princeton, NJ: Rinton Press; 2006. p. 72–7. [48] Gravino P, Servedio VDP, Barrat A, Loreto V. Complex structures and semantics in free word association. Adv Complex Syst 2012;15:1250054. [49] Medeiros Soares M, Corso G, Lucena LS. The network of syllables in Portuguese. Physica A: Stat Mech Appl 2005;355(2–4):678–84. [50] Peng G, Minett JW, Wang WS-Y. The networks of syllables and characters in Chinese. J Quant Linguist 2008;15(3):243–55. [51] Li J, Zhou J, Luo X, Yang Z. Chinese lexical networks: the structure, function and formation. Physica A: Stat Mech Appl 2012;391(21):5254–63. [52] Li J, Zhou J. Chinese character structure analysis based on complex networks. Physica A: Stat Mech Appl 2007;380:629–38. [53] Li Y, Wei L, Niu Y, Yin J. Structural organization and scale-free properties in Chinese phrase networks. Chin Sci Bull 2005;50(13):1305–9. [54] Yamamoto K, Yamazaki Y. A network of two-Chinese-character compound words in the Japanese language. Physica A: Stat Mech Appl 2009;388:2555–60. [55] Li Y, Wei L, Li W, Niu Y, Luo S. Small-world patterns in Chinese phrase networks. Chin Sci Bull 2005;50(3):286–8. [56] Vitevitch MS. What can graph theory tell us about word learning and lexical retrieval? J Speech Lang Hear Res 2008;51(2):408–22. [57] Arbesman S, Strogatz SH, Vitevitch MS. The structure of phonological networks across multiple languages. Int J Bifurc Chaos 2010;20(3):679–85. [58] Tesnière L. Éléments de syntaxe structurale. Paris: Klincksieck; 1959. [59] Sgall P, Hajiˇcová E, Panevová J. The meaning of the sentence in its semantic and pragmatic aspects. Dordrecht: Reidel Publishing Company; 1986. [60] Mel’ˇcuk I. Dependency syntax: theory and practice. Albany: State University of New York Press; 1988. [61] Ferreira F, Engelhardt PE. Syntax and production. In: Traxler MJ, Gernsbacher MA, editors. Handbook of psycholinguistics. 2nd edition. New York: Elsevier; 2006. p. 61–91. [62] Jackendoff R. Semantic structures. Cambridge, MA: MIT Press; 1990. [63] Jackendoff R. Foundations of language: brain, meaning, grammar, evolution. New York: Oxford University Press; 2002. [64] Moulton J, Robinson GM. The organization of language. New York: Cambridge University Press; 1981. [65] Ferrer i Cancho R, Solé RV. The small world of human language. Proc - Royal Soc, Biol Sci 2001;268:2261–5. [66] Masucci AP, Rodgers GJ. Network properties of written human language. Phys Rev E, Stat Nonlinear Soft Matter Phys 2006;74:026102. [67] Zhou S. An empirical study of Chinese language networks. Physica A: Stat Mech Appl 2008;387(12):3039–47. [68] Shi Y, Liang W, Liu J, Tse CK. Structural equivalence between co-occurrences of characters and words in the Chinese language. In: International symposium on nonlinear theory and its applications. 2008. p. 94–7. [69] Brede M, Newth D. Patterns in syntactic dependency networks from authored and randomised texts. Complex Int 2008;12:msid23. [70] Sheng L, Li C. English and Chinese languages as weighted complex networks. Physica A: Stat Mech Appl 2009;388(12):2561–70. [71] Liang W, Shi Y, Tse CK, Liu J, Wang Y, Cui X. Comparison of co-occurrence networks of the Chinese and English languages. Physica A: Stat Mech Appl 2009;388(23):4901–9. [72] Grabska-Gradzi´nska I, Kulig A, Kwapie´n J, Dro˙zd˙z S. Complex network analysis of literary and scientific texts. Int J Mod Phys C 2012;23(07):1250051. [73] Liang W, Shi Y, Tse CK, Wang Y. Study on co-occurrence character networks from Chinese essays in different periods. Sci China Inf Sci 2012;55(11):2417–27. [74] Gao Y, Liang W, Shi Y, Huang Q. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Stat Mech Appl 2014;393:578–89. [75] Ferrer i Cancho R, Solé RV, Köhler R. Patterns in syntactic dependency networks. Phys Rev E, Stat Nonlinear Soft Matter Phys 2004;69(5 Pt 1):051915. [76] Ferrer i Cancho R, Capocci A, Caldarelli G. Spectral methods cluster words of the same class in a syntactic dependency network. Int J Bifurc Chaos 2007;17:2453–63. [77] Liu H. The complexity of Chinese syntactic dependency networks. Physica A: Stat Mech Appl 2008;387(12):3048–58. [78] Liu Z, Zheng Y, Sun M. Complex network properties of Chinese syntactic dependency network. Complex Syst Complex Sci 2008;5(2):37–45 [in Chinese]. ˇ [79] Cech R, Maˇcutek J. Word form and lemma syntactic dependency networks in Czech: a comparative study. Glottometrics 2009;19:85–98. [80] Ágel V, Eichinger LM, Eroms H-W, Hellwig P, Heringer HJ, Lobin H, editors. Dependenz und Valenz: Eininternationales Handbuch der zeitgenössischen Forschung (Dependency and valency: an international handbook of contemporary research). Berlin: Walter de Gruyter; 2003. [81] Liu H. Dependency grammar: from theory to practice. Beijing: Science Press; 2009 [in Chinese]. [82] Liu H. Statistical properties of Chinese semantic networks. Chin Sci Bull 2009;54:2781–5. [83] Liu H, Cong J. Empirical characterization of modern Chinese as a multi-level system from the complex network approach. J Chin Linguist 2014;42:1–38. [84] van Steen M. Graph theory and complex networks: an introduction. Maarten van Steen; 2010. [85] Dehemer M, editor. Structural analysis of complex networks. Boston: Birkhäuser; 2011. [86] Costa LDF, Rodrigues FA, Travieso G, Villas Boas PR. Characterization of complex networks: a survey of measurements. Adv Phys 2007;56(1):167–242. [87] Liu H, Feng Z. Probabilistic valency pattern theory for natural language processing. Lang Sci 2007;6(3):32–41 [in Chinese].

JID:PLREV AID:486 /REV

20

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.20 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

[88] Chen X, Xu C, Li W. Extracting valency patterns of word classes from syntactic complex networks. In: Gerdes K, Hajiˇcová E, Wanner L, editors. Proceedings of the international conference on dependency linguistics. Barcelona. 2011. p. 165–72. [89] Barabási A-L, Albert R. Emergence of scaling in random networks. Science 1999;286:509–12. [90] Caldarelli G. Scale-free networks: complex webs in nature and technology. New York: Oxford University Press; 2007. [91] Nishikawa T, Motter AE, Lai Y-C, Hoppensteadt FC. Smallest small-world network. Phys Rev E, Stat Nonlinear Soft Matter Phys 2002;66(4 Pt 2):046139. [92] Abramov O, Mehler A. Automatic language classification by means of syntactic dependency networks. J Quant Linguist 2011;18(4):291–336. [93] Caldarelli G, Vespignani A. Preliminaries and basic definitions in network theory. In: Caldarelli G, Vespignani A, editors. Large scale structure and dynamics of complex networks: from information technology to finance and natural science. Singapore: World Scientific; 2007. p. 5–16. [94] Watts DJ. Small worlds: the dynamics of networks between order and randomness. Princeton, NJ: Princeton University Press; 1999. [95] Horvath S, Dong J. Geometric interpretation of gene coexpression network analysis. PLoS Comput Biol 2008;4(8):27. [96] Freeman LC. A set of measures of centrality based on betweeness. Sociometry 1977;40:35–41. [97] Freeman LC. Centrality in social networks: I. Conceptual clarification. Soc Netw 1979;1:215–39. [98] Wasserman S, Faust K. Social network analysis: methods and applications. New York: Cambridge University Press; 1994. [99] Yu S, Liu H, Xu C. Statistical properties of Chinese phonemic networks. Physica A: Stat Mech Appl 2011;390:1370–80. [100] Liu H, Li W. Language clusters based on linguistic complex networks. Chin Sci Bull 2010;55(30):3458–65. [101] Liu H, Xu C. Can syntactic networks indicate morphological complexity of a language? Europhys Lett 2011;93(2):28005. [102] Liu H, Cong J. Language clustering with word co-occurrence networks based on parallel texts. Chin Sci Bull 2013;58:1139–44. [103] Ke J, Yao Y. Analysing language development from a network approach. J Quant Linguist 2008;15(1):70–99. [104] Corominas-Murtra B, Valverde S, Solé R. The ontogeny of scale-free syntax networks: phase transitions in early language acquisition. Adv Complex Syst 2009;12(03):371–92. [105] Chen X, Liu H. Central nodes of the Chinese syntactic networks. Chin Sci Bull 2011;56(10):735–40 [in Chinese]. [106] Liu H, Hu F. What role does syntax play in a language network? Europhys Lett 2008;83(1):18002. [107] Liu H, Zhao Y, Huang W. How do local syntactic structures influence global properties in language networks? Glottometrics 2010;20:38–58. ˇ [108] Cech R, Maˇcutek J, Žabokrtský Z. The role of syntax in complex networks: local and global importance of verbs in a syntactic dependency network. Physica A: Stat Mech Appl 2011;390(20):3614–23. [109] Simon HA. The architecture of complexity. Proc Am Philos Soc 1962;106:467–82. [110] Traxler MJ, Gernsbacher MA, editors. Handbook of psycholinguistics. 2nd edition. New York: Elsevier; 2006. [111] Packard JL. The morphology of Chinese: a linguistic and cognitive approach. Cambridge, UK: Cambridge University Press; 2000. [112] Sun C. Chinese: a linguistic introduction. New York: Cambridge University Press; 2006. [113] Chen P. Modern Chinese: history and sociolinguistics. Cambridge, UK: Cambridge University Press; 1999. [114] Assenov Y, Ramírez F, Schelhorn S-E, Lengauer T, Albrecht M. Computing topological parameters of biological networks. Bioinformatics 2008;24(2):282–4. [115] Serrano MÁ, Boguñá M, Pastor-Satorras R, Vespignani A. Correlations in complex networks. In: Caldarelli G, Vespignani A, editors. Large scale structure and dynamics of complex networks: from information technology to finance and natural science. Singapore: World Scientific; 2007. p. 35–65. [116] Götz-Votteler K. Describing semantic valency. In: Herbst T, Götz-Votteler K, editors. Valency: theoretical, descriptive and cognitive issues. Berlin: Mouton de Gruyter; 2007. p. 37–50. [117] Liu H, Zhao Y, Li W. Chinese syntactic and typological properties based on dependency syntactic treebanks. Pozna´n Stud Contemp Linguist 2009;45(4):495–509. [118] Duanmu S. Stress and the development of disyllabic words in Chinese. Diachronica 1999;16(1):1–35. [119] Dronjic V. Mandarin Chinese compounds, their representation, and processing in the visual modality. Writing Syst Res 2011;3(1):5–21. [120] Langacker RW. A dynamic usage-based model. In: Barlow M, Kemmer S, editors. Usage-based models of language. Stanford, CA: CSLI Publications; 2000. p. 1–63. [121] Tomasello M. Constructing a language: a usage-based theory of language acquisition. Cambridge, MA: Harvard University Press; 2003. [122] Bybee J. Language, usage and cognition. New York: Cambridge University Press; 2010. [123] Collins AM, Loftus EF. A spreading-activation theory of semantic processing. Psychol Rev 1975;82(6):407–28. [124] Dell GS. A spreading-activation theory of retrieval in sentence production. Psychol Rev 1986;93(3):283–321. [125] Bordag S, Bordag D. Advances in automatic speech recognition by imitating spreading activation. In: Matoušek V, Mautner P, editors. Text, speech and dialogue. Heidelberg: Springer; 2003. p. 158–64. [126] Zipf GK. Human behaviour and the principle of least effort: an introduction to human ecology. Cambridge, MA: Addison-Wesley Press; 1949. [127] Boguñá M, Krioukov D, Claffy KC. Navigability of complex networks. Nat Phys 2008;5(1):74–80. [128] Barabási A-L, Ravasz E. Hierarchical organization in complex networks. Phys Rev E, Stat Nonlinear Soft Matter Phys 2003;67(2):1–7. [129] Givón T. Bio-linguistics: the Santa Barbara lectures. Amsterdam & Philadelphia: John Benjamins; 2002. [130] Pinker S, Jackendoff R. The components of language: what’s specific to language. In: Christiansen MH, Collins C, Edelman S, editors. Language universals. New York: Oxford University Press; 2009. p. 126–51. [131] Bickerton D. Language and species. Chicago: University of Chicago Press; 1990. [132] Liu H. Probability distribution of dependency distance. Glottometrics 2007;15:1–12. [133] Hudson R. The psychological reality of syntactic dependency relations. In: Kahane S, Nasr A, editors. Proceedings of the first international conference on meaning-text theory. Paris: École Normale Supérieure; 2003. p. 181–92.

JID:PLREV AID:486 /REV

[m3SC+; v 1.191; Prn:29/04/2014; 13:22] P.21 (1-21)

J. Cong, H. Liu / Physics of Life Reviews ••• (••••) •••–•••

[134] [135] [136] [137] [138] [139] [140] [141] [142] [143] [144] [145] [146] [147] [148] [149] [150] [151] [152] [153] [154] [155] [156] [157] [158] [159] [160] [161] [162] [163] [164] [165] [166] [167] [168] [169] [170] [171] [172] [173] [174] [175]

21

Ferrer i Cancho R. Euclidean distance between syntactically linked words. Phys Rev E, Stat Nonlinear Soft Matter Phys 2004;70:056135. Liu H. Dependency distance as a metric of language comprehension difficulty. J Cogn Sci 2008;9(2):159–91. Gibson E. Linguistic complexity: locality of syntactic dependencies. Cognition 1998;68(1):1–76. Hawkins JA. Efficiency and complexity in grammars. New York: Oxford University Press; 2004. Temperley D. Minimization of dependency length in written English. Cognition 2007;105(2):300–33. Seidenberg MS. Language acquisition and use: learning and applying probabilistic constraints. Science 1997;275(5306):1599–603. Christiansen MH, Chater N. Language as shaped by the brain. Behav Brain Sci 2008;31(5):489–508. Kirby S, Christiansen MH, Chater N. Syntax as an adaptation to the learner. In: Bickerton D, Szathmáry E, editors. Biological foundations and origin of syntax. Cambridge, MA: MIT Press; 2009. p. 325–43. Morgan JL, Shi R, Allopenna P. Perceptual bases of rudimentary grammatical categories: toward a broader conceptualization of bootstrapping. In: Morgan JL, Demuth K, editors. From signal to syntax. Mahwah, NJ: Lawrence Erlbaum Associates Inc; 1996. p. 263–81. Christophe A, Guasti T, Nespor M. Reflections on phonological bootstrapping: its role for lexical and syntactic acquisition. Lang Cogn Process 1997;12(5):585–612. Hicks JP. The impact of function words on the processing and acquisition of syntax. Ph.D. dissertation. Evanston, IL: Northwestern University; 2006. Ruhlen M. A guide to the world’s languages 1: classification. Stanford: Stanford University Press; 1991. Shibatani M, Bynon T, editors. Approaches to language typology. New York: Oxford University Press; 1995. Croft W. Typology and universals. 2nd edition. Cambridge, UK: Cambridge University Press; 2003. Song JJ, editor. The Oxford handbook of linguistic typology. Oxford, UK: Oxford University Press; 2013. Shibatani M, Bynon T. Approaches to language typology: a conspectus. In: Shibatani M, Bynon T, editors. Approaches to language typology. New York: Oxford University Press; 1995. p. 1–26. Altmann G, Lehfeldt W. Allgemeine sprachtypologie: prinzipien und messverfahren. Munich: Fink; 1973. Greenberg JH. A quantitative approach to the morphological typology of language. In: Method and perspective in anthropology. Minneapolis: University of Minnesota Press; 1954. p. 192–220. Liu H. Dependency direction as a means of word-order typology: a method based on dependency treebanks. Lingua 2010;120:1567–78. Mehler A. Structural similarities of complex networks: a computational model by example of Wiki graphs. Appl Artif Intell 2008;22:619–83. Hotho A, Nürnberger A, Paaß G. A brief survey of text mining. J Lang Technol Comput Linguist 2005;20(1):19–62. Ostrovsky N. How the steel was tempered [Prokofieva R, Trans.]. London: Central Books Ltd.; 1973. Katzner K. The languages of the world [new edition]. London, New York: Routledge; 1995. Novotná P, Blažek V. Glottochronolgy and its application to the Balto-Slavic languages. Baltistica 2007:185–210. Comrie B, Corbett GG. Introduction. In: Comrie B, Corbett GG, editors. The Slavonic languages. London: Routledge; 2002. p. 1–19. Antiqueira L, Nunes MGV, Oliveira Jr ON, Costa LDF. Strong correlations between text quality and complex networks features. Physica A: Stat Mech Appl 2007;373:811–20. Amancio DR, Antiqueira L, Pardo TAS, Costa LDF, Oliveira Jr ON, Nunes MGV. Complex networks analysis of manual and machine translations. Int J Mod Phys C 2008;19(4):583–98. Amancio DR, Nunes MGV, Oliveira Jr ON, Pardo TAS, Antiqueira L, Costa LDF. Using metrics from complex networks to evaluate machine translation. Physica A: Stat Mech Appl 2011;390:131–42. Amancio DR, Oliveira Jr ON, Costa LDF. Identification of literary movements using complex networks to represent texts. New J Phys 2012;14(4):043029. Theakston AL, Lieven EVM, Pine JM, Rowland CF. The role of performance limitations in the acquisition of verb-argument structure: an alternative account. J Child Lang 2001;28:127–52. de Nooy W, Mrvar A, Batagelj V. Exploratory social network analysis with pajek. revised and expanded second edition. New York: Cambridge University Press; 2011. Ferrer i Cancho R, Riordan O, Bollobás B. The consequences of Zipf’s law for syntax and symbolic reference. Proc - Royal Soc, Biol Sci 2005;272(1562):561–5. Solé R. Syntax for free? Nature 2005;434:289. Perlovsky L. Language and emotions: emotional Sapir–Whorf hypothesis. Neural Netw 2009;22(5–6):518–26. Brandes U, Erlebach T. Introduction. In: Brandes U, Erlebach T, editors. Network analysis: methodological foundations. Berlin, Heidelberg: Springer; 2005. p. 1–6. Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci USA 2002;99(12):7821–6. Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys Rev E, Stat Nonlinear Soft Matter Phys 2004;69(2 Pt 2):026113. Siew CSQ. Community structure in the phonological network. Frontier Psychol 2013;4:article 553. Ferrer i Cancho R. Network theory. In: Hogan PC, editor. The Cambridge encyclopedia of the language sciences. New York: Cambridge University Press; 2011. p. 555–7. Best KH. Quantitative Linguistik. Göttingen: Peust & Gutschmidt; 2006. Köhler R, Altmann G, Piotrowski RG, editors. Quantitative linguistics: an international handbook. Berlin: Walter de Gruyter; 2005. Köhler R. Quantitative syntax analysis. Berlin: Walter de Gruyter; 2012.

Approaching human language with complex networks.

The interest in modeling and analyzing human language with complex networks is on the rise in recent years and a considerable body of research in this...
1MB Sizes 2 Downloads 3 Views