Journal of Theoretical Biology 365 (2015) 226–237

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Comparing and distinguishing the structure of biological branching Timothy O. Lamberton a, James Lefevre a, Kieran M. Short b, Ian M. Smyth b,c, Nicholas A. Hamilton a,d,n a

Division of Genomics of Development and Disease, Institute for Molecular Biosciences, The University of Queensland, Brisbane QLD 4072, Australia Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia c Department of Anatomy and Developmental Biology, Monash University, Clayton, Melbourne, Victoria, Australia d Division of Cell Biology and Molecular Medicine, Institute for Molecular Biosciences, The University of Queensland, Brisbane QLD 4072, Australia b

H I G H L I G H T S

    

A graph alignment metric for quantifying similarity in branching structures is developed. It shows increased ability to structurally distinguish ureteric trees across development. We developed a consensus metric to align multiple trees simultaneously. Its use to create an atlas of stages and across stages of development is shown. Strict stereotypic branching across development in control and mutant kidneys is shown.

art ic l e i nf o

a b s t r a c t

Article history: Received 27 May 2014 Received in revised form 29 September 2014 Accepted 1 October 2014 Available online 13 October 2014

Bifurcating developmental branching morphogenesis gives rise to complex organs such as the lung and the ureteric tree of the kidney. However, a few quantitative methods or tools exist to compare and distinguish, at a structural level, the critical features of these important biological systems. Here we develop novel graph alignment techniques to quantify the structural differences of rooted bifurcating trees and demonstrate their application in the analysis of developing kidneys from in normal and mutant mice. We have developed two graph based metrics: graph discordance, which measures how well the graphs representing the branching structures of distinct trees graphs can be aligned or overlayed; and graph inclusion, which measures the degree of containment of a tree graph within another. To demonstrate the application of these approaches we first benchmark the discordance metric on a data set of 32 normal and 28Tgfβ þ / mutant mouse ureteric trees. We find that the discordance metric better distinguishes control and mutant mouse kidneys than alternative metrics based on graph size and fingerprints – the distribution of tip depths. Using this metric we then show that the structure of the mutant trees follows the same pattern as the normal kidneys, but undergo a major delay in elaboration at later stages. Analysis of both controls and mutants using the inclusion metric gives strong support to the hypothesis that ureteric tree growth is stereotypic. Additionally, we present a new generalised multi-tree alignment algorithm that minimises the sum of pairwise graph discordance and which can be used to generate maximum consensus trees that represent the archetype for fixed developmental stages. These tools represent an advance in the analysis and quantification of branching patterns and will be invaluable in gaining a deeper understanding of the mechanisms that drive development. All code is being made available with documentation and example data with this publication. Crown Copyright & 2014 Published by Elsevier Ltd. All rights reserved.

Keywords: Kidney Development Graph alignment Algorithms

1. Introduction 1.1. Biology and software The effective delivery and removal of biologically important fluids and gases is paramount to homoeostasis in multicellular

n

Corresponding author. Tel.: þ 61 7 3346 2033; fax: þ61 7 3346 2101. E-mail address: [email protected] (N.A. Hamilton).

http://dx.doi.org/10.1016/j.jtbi.2014.10.001 0022-5193/Crown Copyright & 2014 Published by Elsevier Ltd. All rights reserved.

organisms. Tubular branched networks are one structure by which this occurs, and branching morphogenesis is a fundamental process underlying the development of many organs of animals such as the vascular system, lung, kidney and mammary gland. In organs such as the kidney and lung, repeated branching of epithelial tissue efficiently fills a confined space while creating a large functional surface area at specialised tips to allow exchange of waste or oxygen. Recent reviews of the roles of branching morphogenesis in organ development can be found in Affolter et al. (2009) and Costantini and Kopan (2010).

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

Branching morphogenesis in developing mouse lungs has been suggested to follow a remarkable predictable programme with three distinct modes, which suggests hard-wired, genetic control of the branching process (Metzger et al., 2008). Similarly, the kidney is a heavily branched organ and shows genetically controlled branching and patterning mechanisms (Short et al., 2014). As such, branching morphogenesis of the kidney has been an area of considerable study with the cellular mechanisms of control have being of particular interest (Dressler, 2006; Kopan et al., 2007; Quaggin and Kreidberg, 2008). Much early kidney work focused on in vitro studies, and looked at development modes (al-Awqati and Goldberg, 1998) and quantifying spatial data (Cullen-McEwen et al., 2002; Watanabe and Costantini, 2004). The majority of kidney branching events are bifurcations, with trifurcations occurring relatively rarely (Short et al., 2014). Often, as the internal branches grow, the trifurcated branches will reform into a pair of bifurcated branches, with a pair of the trifurcated branches growing away from the third (Watanabe and Costantini, 2004). This suggests that trifurcations may be cases of a bifurcation followed by a rapid successive bifurcation (al-Awqati and Goldberg, 1998), which cannot be distinguished until the branches extend over time. More recent work in mouse kidneys suggests that a self-avoidance mechanism of branching structures could explain the growth modes seen in the lung and other studies (Davies et al., 2014). The use of optical projection tomography (OPT) combined with quantitative imaging methods (Short et al., 2010) have demonstrated that ex vivo culture methods often poorly reflect in vivo development. Further refinements to this approach (Short et al., 2013) have seen the development of software packages which are able to extract and analyse skeletons that describe the branching ureteric tree, and reducing the 3D branching volumes of the ureteric tree of the kidney to a set of points and edges joining them. These methods and others were then applied to large scale data sets in Short et al. (2014) to give a comprehensive, quantitative, multiscale analysis of mammalian kidney development. An important aim of in vivo studies is to objectively and quantitatively assess kidney development under a range of genetic conditions and determine individual genetic contributions. For instance, it has been demonstrated that mouse embryos heterozygous for a mutation in the gene encoding transforming growth factor-beta 2 (Tgfβ2 þ /  ) exhibit decreased branching, a significant developmental delay, changes to the branching programme, and a significantly smaller renal pelvis compared to control embryos (Short et al., 2013, 2010). At early developmental stages, Tgfβ2 þ /  and control kidneys are similar as measured by number of branch points and (terminal) tips (Fig. 1). However, as development progresses the increase in complexity of the Tgfβ2 þ /  ureteric tree (as measured by tip number) kidney growth lags the controls. While tip number is a simple metric that can distinguish later stage mutant kidneys from controls it provides no information about the architecture of the distinct organs. It is therefore unclear as to whether the smaller later stage Tgfβ2 þ /  kidneys are similar in structure to the earlier stage controls but have retarded growth, or if their branch patterning is fundamentally different. Although earlier stage 7–10 controls and mutant kidneys have the same number of tips, the branching pattern could ostensibly be different, precipitating a later stage tip deficit. A broader question is whether a fundamental set pattern exists which controls ureteric tree formation. For instance, if two ureteric trees are examined that have the same number of tips will their branching structures be identical or is the bifurcation of branches to some extent random in direction but at a constant rate giving rise to trees of the same size but with very different structures? Moreover, it is unknown whether the growth of the ureteric tree is

227

Fig. 1. Detecting difference between ureteric trees of control and Tgfβ2 þ /  kidneys across early development. Shown are the mean numbers of the size (number of branch points and tips) for ureteric trees of control and mutant (Tgfβ þ /  ) mouse kidneys at distinct developmental stages, adapted from Short et al. (2010) plot showing number of tips. While between stages 7 (12 dpc) to 10 (13.75 dpc) the controls and mutants are statistically indistinguishable, by stages 11 (14.5 dpc) and 12 (15.5 dpc), Tgfβ2 þ /  kidneys show notably smaller size than control kidneys at the same stage. Error bars are standard errors of the mean (SEM).

structurally stereotypical to the point where the branching pattern of any later stage normal kidney can be seen to be an extension of the structure of earlier stage kidneys. In order to address these questions, metrics that are refined to be structurally aware are necessary for the comparison of the branching patterns. Here we describe the development of a number of such metrics for quantifying, structurally comparing and distinguishing rooted bifurcating structures. We evaluate the utility of these metrics and compare their performance against readily available metrics such as tip number. The new metrics are then used to detect and understand the structural nature of the Tgfβ2 þ /  mutant kidney phenotype. Finally, we give a detailed analysis of the patterning of the normal mouse kidney from stages from 12 dpc to 15.5 dpc (stages 7 to 12) and show that the growth is highly controlled and stereotypic. Supporting software implementing the algorithms and methods, as well as example data, are available as Supplementary material. In order to improve readability, mathematical proofs and some details are included in Section 4. Note: In the following, the term tree is used in the sense of a connected, acyclic, undirected graph T with a vertex set V ðT Þ and an edge set EðT Þ, derived from the physical kidney ureteric tree. 1.2. Quantitative tree metrics This work considers four distance metrics which provide measures of the similarity between trees. As they are distance metrics they may also be used in other analyses such as hierarchical clustering (for example). 1.2.1. Graph size difference The simplest metric for tree comparison, which we include as a benchmark for the more sophisticated methods, is to compare the

228

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

graph size, here defined as the number of tree vertices using the convention of Gupta and Nishimura (1998), i.e. jTj ¼ jV ðT Þj for a tree T with a vertex set VðTÞ. Other common definitions of a tree size are: number of tips (sometimes called leaves or terminal nodes), and number of edges (branches). For a bifurcating tree, these measures are linearly related and hence we consider them equivalent. In terms of physical kidney trees, this metric represents a measure of the degree of arborisation or ramification of the kidney structure, and not necessarily the physical size although there is a general correspondence. With this metric the distance between two trees T 1 and T 2 is measured by difference in vertex number, expressed as dif f ðT 1 ; T 2 Þ ¼ jjT 1 j  jT 2 jj. Fig. 2A shows two simple trees which would have a graph size difference of |1111|¼0. 1.2.2. Fingerprint Another proposed method for rooted tree comparison that takes into account the distribution of branching depth is a fingerprint distance. First, for any tip, the depth of the tip is defined as the number of edges on a path between it and the root vertex. The fingerprint of a rooted tree is then the vector of tip depth frequencies, that is, for a tree T let t d be the number of tips at depth d. The fingerprint of tree T is denoted f print ðT Þ ¼ ðt 1 ; …; t maxdepthðT Þ Þ, where maxdepth(T) is the maximum depth of any tip in a tree (often called tree height in literature). An example of the fingerprint of a tree is given in Fig. 2A. The fingerprint distance of two trees T 1 ; T 2 is defined as the Euclidean distance between two points in the vector space of dimension maxðmaxdepthðT 1 Þ; maxdepthðT 2 ÞÞ, expressed as f dist ðT 1 ; T 2 Þ ¼ f print ðT 1 Þ  f print ðT 2 Þ. For example, the trees in Fig. 2A have fingerprints ½0; 0; 2; 4; 0 and ½0; 0; 3; 1; 2 respectively and fingerprint distance qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffi 2 2 2 2 2 ð0  0Þ þ ð0 0Þ þ ð2  3Þ þ ð4  1Þ þ ð0 2Þ ¼ 14. It is clear that the fingerprint and the fingerprint distance encode more structural information about the trees and as such we would expect these to be more structurally descriptive than the simple graph size metric. 1.2.3. Discordance and inclusion tree metrics The next level of structural comparison is to examine the common substructures of the trees. The wide body of work involving the comparison of tree graphs can be applied to the connectivity graphs of the skeletonised ureteric trees of kidneys. One method of comparing the similarity of two trees is to consider

Fig. 2. Metrics and graph measures. (A) Graph Size Difference and Fingerprint. The figure shows two tree graphs with vertices marked with cyan dots. Both trees have a graph size of eleven and the size difference of the trees is then 0. Tips are marked with x’s and the tip count difference at each depth are shown, and the Euclidean norm of the vector of differences gives the tree fingerprint distance as ((0  0)2 þ (0  0)2 þ (2  3)2 þ (4  1)2 þ(0–2)2)1/2 ¼ 141/2. (B) Common subtrees and supertrees. (Left) The trees from (A) are overlayed. (Right) graph representation showing the maximum common subgraph (cyan) and minimum common supertree (magenta) of the two trees.

the maximum common connected subgraph (MCCS) (Gupta and Nishimura, 1998). Consider two trees T 1 ; T 2 . A common connected subgraph S  is a tree such that S  is isomorphic to subgraphs T 1'  T 1 and T 2'  T 2 . A maximum common connected subgraph is then a common connected subgraph S_ of maximum size. The idea here is that S  is the largest graph for which there is a copy of it in both trees T 1 ; T 2 . An example is given in Fig. 2B. Finding the maximum common connected subgraph is equivalent to finding the smallest common supertree (Gupta and Nishimura, 1998). A common supertree S þ of two trees T 1 and T 2 is a tree such that T 1 and T 2 are isomorphic to subgraphs S0þ  S þ and S00þ  S þ . The idea here is to find the smallest tree which contains copies of both trees T 1 and T 2 . The equivalence with the common connected subgraph may be seen by noting that the size of the supertree plus the size of the common subgraph (the intersection of S0þ and S00þ ) is at least equal to the sum of the input tree graph sizes, as demonstrated by Fig. 2B. Hence, as the size of the common subgraph increases, the size of the minimum supertree (consisting of the union S0þ [ S″þ ) decreases. Effectively, the smallest common supertree and the maximum common connected subgraph are both measures of how the branches of one tree can be matched to the branches of another, minimising unmatched branches. We define a distance metric, the graph discordance, based on the common subgraph and supertree of two trees as the difference between the common supertree size and the common subgraph size, written as discordðT 1 ; T 2 Þ ¼ ðjS þ j jS  jÞ=jS þ j. The discordance gives a measure of the closeness of the tree structures with respect to their shapes, with two trees that have a high level of alignment having a small discordance value. For instance, for identical trees the minimum common super tree and the minimum common subtree are identical to one another and hence the discordance is zero. The discordance will always be less than 1 since non-empty trees will have a common subtree of at least one vertex. The size jS  j of the common subgraph may also be used to produce a graph inclusion measure, where the common subgraph jS  j is normalised to the size of the smaller of the trees i.e. inclusionðT 1 ; T 2 Þ ¼ jS  j= minðjT 1 j; jT 2 jÞ. The inclusion measure, which is not a distance measure, gives the fraction the structure of the smaller tree that is a subset of the larger. Hence the inclusion measure can be thought to represent the degree to which a smaller tree is following a growth pattern consistent with a larger one, perhaps suggesting a shared stereotypical growth programme. To understand the difference between these metrics, consider the case of a smaller tree which can be completely mapped to a subgraph of a larger tree. The inclusion metric is equal to its maximum value, 1, indicating complete consistency with a growth pattern in which the smaller tree develops into the larger. In contrast, the discordance is equal to the difference in tree graph sizes relative to the larger tree; in general this ratio forms a lower bound on the discordance, representing difference due to size alone (captured in a sense by the graph size difference metric). Hence, discordance only approaches zero for trees of the same size and structure.

1.2.4. Maximum consensus trees Multi-tree alignments allow the visualisation and quantification of all uretic trees from kidneys of a single developmental age, or even between kidneys at different developmental ages. Archetypal representations of kidneys (an “atlas”) within stages can be produced by determining shared structures and areas of difference across multiple kidneys at the same developmental age. Once determined, an atlas can be used as a baseline to test against ureteric trees from mutant kidneys to see if they fit within the range of structures typically seen

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

in the control kidney atlas. In addition, multi-tree alignments can be used to show the common structures across development to build a global picture of the entire timeline of kidney development (Fig. 7). This work developed a multi-tree alignment metric, the maximum consensus supertree, for these purposes. The metric and algorithms are discussed in Section 4. 1.3. Graph alignment algorithms A range of tree comparison problems have been shown to be algorithmically soluble in a reasonable (meaning polynomial) time (Aho et al., 1974; Akutsu, 2010; Chung, 1987; Dubiner et al., 1994; Farach et al., 1995; Jiang et al., 1995; Kilpeläinen and Mannila, 1995; Lingas, 1983; Matula, 1968; Shamir and Tsur, 1999). The Maximum Common Connected Subgraph Problem with root-to-root embedding is of most biological relevance for kidney morphogenesis, reflecting the fact that kidney ureteric tress have a natural root and grow and branch in a continuous manner which preserves the partial-ordering of branch points. In other words, the generation and parental relationships of branch points are expected to be preserved during the development process. The general Maximum Common Subgraph Problem for non-trees has been shown to a subset of the Maximal Clique Problem which is known to be NP-hard (Koch, 2001). When considering trees, the Largest Common Subforest Problem (also called the Largest Common Topologically Embeddable Subtree Problem), which involves finding the size of the largest subtree of one tree that can be topologically embedded in the other, has been shown to have a polynomial time solution (Gupta and Nishimura, 1998). Topological embedding requires that edges are mapped to vertex-disjoint paths. An adapted algorithm that forces connectedness of the subtree can solve the Maximum Common Connected Subgraph Isomorphism Problem with the same complexity (Gupta and Nishimura, 1998). A complimentary problem is the Smallest Common Supertree Problem (Gupta and Nishimura, 1998). Note that extending the Largest Common Subforest Problem to multiple trees has been shown to be NP-hard (Shamir and Tsur, 1999). Lingas (1983) developed an algorithm for the root-to-root subtree isomorphism problem that involves reducing trees to labelled directed acyclic graphs (ldags) and finding subgraph isomorphisms between ldag representations of trees. This process avoids repeated calculations for trees with multiple isomorphic subtrees (Lingas, 1983). In this work, a simplified version of the Gupta and Nishimura (1998) algorithm for binary trees and ldags was used to compute maximum common connected subgraphs and corresponding supertrees for the analysis. This algorithm is detailed in Section 4. Also detailed in the Methods section is a new approach for extending the algorithm for multiple binary trees using a measure involving minimising the sum of pair-wise common connected subgraph sizes, which we have called the Maximum Consensus Supertree. There it is shown that determining the Maximum Consensus Supertree is computationally tractable by presenting a novel algorithm of com  putational order O nk for it computation. We also argue that this measure is superior to other measures such as the multi-tree smallest common supertree.

comparison metrics. Summary information is shown in Table 1. Details of the data sets can be found in Short et al. (2014,2013). Each kidney ureteric tree was imaged using optical projection tomography (OPT), skeletonised, and graphs of the branching structure of the ureteric trees created using Tree Surveyor (Short et al., 2013). These trees were output in GraphML format which contained details of segment IDs and parent segment IDs. In Short et al. (2014), it was shown that stage 7 to 12 mouse kidneys all exhibit a rooted branching structure with major 6 lobes. For each of the kidneys, the Tree Surveyor GraphML files were imported into IllouraTM (McComb et al., 2009) (a software tool for visualisation and analysis of spatial relationship data), and the lobe structure manually annotated for each (see Section 4 for details).

2.1.2. Comparison and statistical power of the discordance measure The first test of the new graph metrics was to determine their capability to distinguish distinct stages of normal ureteric tree development. To quantify the similarity between the branching structures of kidney ureteric trees at various stages of development, the smallest common supertree (SCS) algorithm (described in Section 4) was applied to graph representations of the kidney ureteric trees for pairs of kidneys. This determines the common subgraph S  , which was then used to calculate a discordance and inclusion score for the pair. Pairs of kidneys were put into classes distinguished by the stages and mutant/control status of the kidneys, and mean and standard errors for each class were calculated. For example a pair of control kidneys from stage 7 would be denoted as a C7-C7 pair, and a control stage 8 matched with a mutant (Tgfβ2 þ /  ) stage 9 would be denoted as a C8T9 pair. In Fig. 3 the branching structures for control kidneys for Stages 7 to 12 are compared within stages and between subsequent stages using the discordance measure. It can be clearly seen that the mean discordance between trees at a given stage (shown centrally above markers C7, …, C12) is generally much lower than the inter-stage discordances (shown between markers). Hence the branching structures of the ureteric trees are readily distinguished between subsequent stages using the graph discordance metric.

Table 1 Summary of the control and Tgfβ2 þ /- mutant kidney ureteric tree graph data. Stage Approximate age (dpc)

Samplesa ntree

7

12

6/3

267 4/267 1

8

12.5

5/3

49 7 3/32 73

9

13.25

7/11

10

13.75

5/6

11

14.5

5/2

12

15.5

4/3

1707 14/ 188 76 230 7 14/ 2167 7 4487 40/ 239 7 40 12417 35/ 3587 35

Total

2. Results 2.1. Data and analysis 2.1.1. Imaging, skeletonisation and graph generation A data set of 32 control and 28Tgfβ2 þ /  mutant mouse kidneys over five stages of development was used for evaluate the tree

229

a

nldag a

nldag x% ntree

100a 137 1/ 147 1 157 1/ 147 1 237 1/ 297 1 31 71/ 31 71 49 72/ 337 3 947 1/ 447 3

49%/ 55% 30%/ 45% 14%/15% 14%/14% 11%/14% 8%/12%

32/28

The developmental stage is defined by observation of embryonic limb morphology and the approximate corresponding age, days post coitum (dpc), are shown. The number of kidneys at each stage as well as corresponding embryo ages is shown as well as the average kidney tree graph size and variability for control and mutant kidneys, respectively. Also shown is the average number of vertices of the labelled directed acyclic graph (ldag) reductions of the kidney trees at each stage. Reduction of kidney tree graphs to ldags is a computational step used to speed up the alignment algorithm, and is detailed in Section 4. a

values ¼control/ Tgfβ2 þ /-, n values include standard errors.

230

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

Fig. 3. Kidney stages are distinguishable with discordance metric. Mean and standard error in pairwise graph discordance between control kidneys within and between consecutive development stages are shown. Each colour represents a distinct stage. Firstly, all distinct pairs of trees for a stage are compared against each other with the discordance measure and the mean and the standard error of discordances are marked centrally above 7, 8, …, 12 in the plot. Each stage is then compared pairwise against the previous stage kidneys, and then against next stage kidneys. These discordance comparisons between subsequent stages are plotted between the stage markers. It can be seen that is all cases that the within-group means are lower than the inter-group means indicating that kidneys are structurally less different to other kidneys at within the same stage than adjacent stages. Standard errors bars shown for within- and -between-group means are all nonoverlapping indicating that within-stage variation is much smaller than the differences between stages.

Indeed, in all cases the means were statistically significantly different in all cases using a two tailed t-test. In order to further test the statistical power of the discordance metric in comparison to the fingerprint and graph size difference metrics, a number of tests to distinguish control and mutant trees were performed. For each mutant stage, a two-sample t-test was performed against each control stage to test if the mutant-control group to the within-control group had significantly different means. This tested the hypothesis that mutant-control and control pair metrics have the same distribution. In Fig. 4, the ratios of the t-test p-values for the discordance metric against both graph size difference and fingerprint metrics are plotted. In most cases, the ratio is much less than one, indicating that the discordance measure found a more significant difference between Tgfβ2 þ /  mutant kidneys and control kidneys than the other metrics. The discordance measure performance showed less power in some cases at distinguishing the tree graphs of mutant kidneys of stage 9 or higher from much smaller stage 7 control tree graphs, although absolute values were still much smaller than the 5% threshold of significance. Additionally, p-values for the discordance metric were lower for the mutant stage 12 comparisons with some control stages. This may be due to the fact that discordance values are normalised, which may weaken size-dominated differences compared to the un-normalised metrics. It also suggests that much of the difference between closely related kidneys (e.g. within a stage) could be due to the variation in graph size within these groups, as when size difference is accounted for (by normalisation) closely related kidneys are less statistically distinct. The metric showed lower levels of distinction between

Fig. 4. Discordance metric generally shows greater statistical power to distinguish between Tgfβ2 þ /  and control kidneys than fingerprint and graph size metrics. In each plot, the ureteric trees graphs of one Tgfβ2 þ /  mutant stage, T7, …, T12 is compared against all of the control trees pairwise for stages 7 to 12. For each control stage – Tgfβ þ /− stage set of trees, two sample t-tests were calculated for the means calculated by each of the metrics. The ratio of the p-values, pdiscord/pdiff (discordance p-value over graph size difference p-value) and pdiscord/pfprint (discordance p-value over graph size difference p-value) are then shown. In the majority of cases the statistical power of the discordance metric is similar to or better than that of the other measures, though there are exceptions in some of the cases where the tree graph sizes are very different such as the C7-T11 and C7-T12 comparisons.

certain mutants and the much smaller stage 7 controls. It is also interesting to note that it can be seen from the plot that the fingerprint metric has a generally lower power to distinguish trees than the graph size difference metric, which may suggest that variation in distribution of tip depths is not highly correlated with graph size, and hence could obscure variation in size difference.

2.1.3. Understanding the nature of the Tgfβ2 þ /  mutant mouse kidneys While it was clear that the mutant Tgfβ2 þ /  mouse kidneys had significantly reduced numbers of tips at stages 11 and 12 in comparison to control kidneys (Fig. 1), the structural nature of the difference at these stages, and similarities or differences at earlier stages, was less obvious. In order to probe this, we first used the discordance measure to examine which developmental stage control trees the mutant trees were structurally most similar to. Fig. 5 shows the mean discordance for pairwise comparison between mutant Tgfβ2 þ /  and control kidney ureteric trees at different stages of development. We can determine the most similar control stage for each mutant stage by finding the control stage at which the discordance is a minimum. Within-group

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

231

Fig. 5. Mutant Tgfβ2 þ /  kidneys match best to control kidneys of the same or earlier development stages. Mean pairwise discordance comparisons between Tgfβ2 þ /  (labelled by T) and control (C) kidney ureteric trees at different stages of development. Lower values indicate a higher similarity between the Tgfβ2 þ /  and index kidney groups. Circles indicate the minimum discordance for each Tgfβ þ /− group, corresponding to the control group to which each Tgfβ þ /− group is most similar. The within-group means for control groups are shown by the black line. Error bars are SEMs.

means for each control stage are indicated by black diamonds, allowing us to compare how well the mutant kidneys match to an average pair of controls of that stage. It can be seen that the stage 11 mutant trees have a discordance minimum when compared to the control stage 10 trees. Similarly, the stage 12 mutant trees have a discordance minimum when compared to the control stage 11 trees. Further, the inter-stage discordance values comparing from T11 to C10 are almost indistinguishable from the intra-stage average discordance of C10 pairs to itself. And similarly, the inter-stage discordance values comparing from T12 to C11 are almost indistinguishable from the intra-stage average discordance of C11 pairs to itself. The mutant stages 7, 9 and 10 are most similar to control stages 7, 9 and 10, respectively, and somewhat surprisingly the mutant stage 8 is most similar to the control stage 7, though given that these two stages are not greatly dissimilar in graph size a small effect might lead to such an observation. Hence overall, the discordance metric is provides strong evidence that the early stages of development the control and mutant kidneys trees structurally track each other quite closely, however at stages 11 and 10 essentially a whole stage delay has occurred in the mutant kidneys.

2.1.4. Stereotypy in normal and Tgfβ2 þ /  mutant development Next we wished to determine the degree to which the branching pattern of ureteric tree growth is stereotypic, that is all trees at a given stage have approximately the same branch structure and one stage extends into another over time. To do this the inclusion metric, which measures how well one tree will “fit” inside another, and hence one tree is an extension of another, is considered. This differs from the discordance measure which will only give a low value when trees are of a similar graph size. Fig. 6A shows the inclusion metric used to compare each control stage against later stage controls, each colour corresponding to a distinct stage. Note that an inclusion score of 1 corresponds to one tree being fully contained within another. It can be seen that the average inclusion score for any stage against any later stage is high, ranging from 0.93 to 1. In other words, a very high proportion of each earlier stage tree will fit within each later stage tree. Observing the initial data points of each line, which represent

Fig. 6. Stereotypy of ureteric tree structure across stages 7 to 12. (A) Normal mouse kidney ureteric tree development is stereotypic. Each coloured line denotes the set of trees of a distinct stage of normal (control) development. This set of trees is then compared against each of the stages C7 to C12 using the discordance metric and a mean discordance between the stages calculated and plotted. It can be seen that earlier development stages show progressively increasing values when compared to later stages, indicating that the generally smaller early stage trees become contained within the larger, later stage trees. Generally, comparisons within a stage show some degree of variation that is not explained just by a graph size difference, indicating variability in the growth rates of different parts of the kidney structure while maintaining a stereotypical growth pattern over longer time periods. (B) Early-to-mid stage mutant Tgfβ2 þ /  kidney growth programme is consistent with control kidneys while Tgfβ2 þ /  late stage kidneys show delayed growth. (Left) Graph inclusion measures for control and mutant kidney ureteric trees stages at early and middle stages of development are shown. Mutant growth is consistent with and tracks the controls. (Right) Here graph inclusion is used to compare late stage (11-12) mutant trees to control stages. It can be seen that mutant stages 11 and 12 show similar percentage of inclusion to earlier stage 10 and 11 control kidneys, thus suggesting that timing but not structural changes are occurring in the mutant growth programme. Error bars are all SEMs.

the within-group inclusion means, and the following points shows that there is some variation in the growth pattern within a stage, with the variation gradually reduced over following stages as the kidney structures converge. Overall, the inclusion metric shows that all earlier stages are largely contained within later kidneys, suggesting that all kidneys follow a relatively stereotypical predetermined pattern of growth. Examining the mutant Tgfβ2 þ / trees against controls shows a similar pattern. In Fig. 6B comparisons are made between the inclusion metric values for mutant stage 7 to 10 trees and control trees at all stages, and show a pattern convergence that track the control comparisons against themselves and hence follow the same stereotypical growth programme. Fig. 6C then shows that stages 11 and 12 mutants track closely to earlier stage controls as was to be expected, given the similarity between mutants and their closest controls shown

232

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

Fig. 7. Maximum consensus tree alignment shows stereotypical growth pattern of developing kidneys. Overlay of normal kidney ureteric tree structures from each of Stages 7 to 12, with each stage being completely contained within the later stages.

in Fig. 5. Hence the mutant Tgfβ2 þ / trees follow a pattern that is still structurally stereotypic, but is delayed at later stages. As was noted above and is fully described in Section 4, the inclusion (and discordance) metrics may be used algorithmically on multiple trees simultaneously to find a consensus tree that minimises the sum of the discordance between all of the trees in the consensus tree (the maximum consensus tree). This maximum consensus tree can be used, for instance, as an ‘atlas’ for all of the trees of a given stage and to some extent represent the fundamental structure of that stage. Another use is to consider multiple trees, each at a different stage, in order to visualise the growth pattern and stereotypy across stages. In Fig. 7, the maximum consensus tree for 5 control trees, one each from stages 7 to 12. In this case, the trees each fit perfectly within each other and so each colour represents the extension of that stage tree over the earlier stages.

control ureteric trees by showing that while the structure of the mutants was the same as that of controls, there was a delay in development. In order to understand growth and stereotypy of trees, the graph inclusion metric was introduced. There it was shown how the inclusion metric could be used to present strong evidence for stereotypical growth during development, with earlier stage kidney structures being fully contained within later stages. It was also used to show that mouse kidneys heterozygous for a mutation in Tgfβ remained stereotypical while showing a slowing of development at around stage 10, confirming the results of Short et al. (2010). This work also presented a novel generalised multi-tree alignment algorithm for finding the maximum consensus supertree that minimises the sum of pairwise discordances between multiple trees. As well enabling the visualisation and inclusion between stages as was shown in Fig. 7, we believe that this algorithm will have utility in creating “median tree atlases” that represent an archetype for each development stage, and multi-tree overlays that could be used to visualise the variation in growth within a stage. More generally, the graph alignment-based metrics will allow fine-grained analysis of internal kidney structures. One area of interest is to apply graph metrics to internal lobes or major branches in kidneys, as identified in Short et al. (2014). This could allow the general spatial layout of graphs to be investigated, to see if differences are localised for instance to particular lobes, or are evenly distributed. In the future, this type of analysis also has the capability to detect subtle details of the kidney development programme such as localised changes in branching patterns after genetic perturbation, and therefore presents a valuable tool for the evaluation of genes which play a role in patterning. As well as allowing quantitative measurements of kidney stereotypy, graph alignments are a useful visualisation tool, displaying areas of differential growth within and development across stages. This is demonstrated by the graph alignments shown in Fig. 8. The graph alignment metrics developed could readily be applied to other biological systems that have a bifurcating branched structure, such as respiratory, salivary and circulatory systems. In particular, the metrics and algorithms could be applied to quantitatively test claims that lung structure and other branching patterned organs in development are stereotypical process (Metzger et al., 2008). Graph metrics might also be used to compare between different biological systems, for example, investigating similarities and differences in the growth programs of lungs and kidneys, and thus perhaps hint at underlying common regulatory mechanisms.

3. Conclusions Quantitative analysis of developing kidney structures is important for many purposes such as analysing the range of natural growth variability, identifying and classifying important stages of kidney development, and allowing the detection of subtle phenotypic changes caused by perturbing growing conditions during development. Here we have presented several new metrics and methodologies towards these goals. We have demonstrated that graph alignment-based metrics, in particular the discordance of the common supertree, can be used to measure differences in kidney tree structures. The discordance metric was shown to be able to distinguish ureteric tree structure of mouse kidneys at differing stages of in-utero development. The relative discordance measure was shown to generally distinguish differences between kidneys of Tgfβ þ /  mutant and control mouse kidneys at different stages to a greater degree than metrics based on tree graph size or fingerprint/tip depth distribution. It was then shown how the discordance measure could be used to dissect and determine the nature of the difference between mutant and

4. Methods 4.1. Lobe annotation The lobe structure of the kidney data set was annotated as described in Short et al. (2014), and is summarised as follows: for each kidney, the anterior–posterior axis (A–P axis) can be identified as the anterior more rounded. The two child vertices of the kidney root are then annotated as the roots of the anterior (A) and posterior (P) lobes by their relative position on the A–P axis. The procedure for annotating the sub-lobes involves identifying tips of the kidney which are furthest from the root along the A–P axis. Then, for the children of the anterior and posterior lobe roots, the children that are ancestors of the furthest tips on the A–P axis is labelled anterior major (AM) and posterior major (PM), with the other children being labelled anterior minor (Am) and posterior minor (Pm). If practical, this can be repeated with the children of

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

233

Fig. 8. Graph alignments of control kidneys. Shows a stage 9 and 10 pair (left) and two stage 10 s (right). Both graphs show overlapping branches in cyan. In each graph, branches of the graph which map to distinct kidneys are shown in green for one tree and blue for the other. For the left figure, the absence of blue tips indicates that stage 9 graph is completely contained by the stage 10 graph. The graph alignment technique can be applied to investigate the regional distribution of differences between kidneys of a similar level of development, and the rate of growth in different regions shown as development progresses. Node labels correspond to the main lobes of the mouse ureteric tree as defined in Short et al. (2014): anterior (A); anterior minor (Am); anterior major (AM); anterior major lateral (AMl); anterior major anterior (AMA); posterior (P); posterior minor (Pm); posterior major (PM); posterior major lateral (PMl); posterior major posterior (PMP);

the AM and PM branches, to determine the anterior major anterior (AMA), anterior major lateral (AMl), posterior major posterior (PMP) and posterior major lateral (PMl) lobes. Additional divisions of the AMA and PMP lobes can be made to determine anterior major anterior anterior (AMAA), anterior major anterior lateral (AMAl), posterior major posterior posterior (PMPP) and posterior major posterior lateral (PMPl) lobes. For kidneys at early stages of development, it is often not meaningful to consider divisions beyond the highest six (A, P, AM, PM, Am, Pm), or ten (A, P, AM, PM, Am, Pm, AMA, AMl, PMP, PMl) lobe divisions. The annotations described above represent a visually observed stereotypic growth behaviour, and by labelling the trees, matches between corresponding lobes were forced when matching (SCS and MCS) algorithms were applied. In addition to the lobe structure, the front–back (dorsal– ventral) positioning of the lobes was also annotated. 4.2. Algorithms The results presented in this work for the most part consider metrics comparing pairs of trees and how well they align with one another. Algorithms for pairwise tree alignment are well-studied in the literature (Aho et al., 1974; Gupta and Nishimura, 1998; Lingas, 1983; Matula, 1968). However, these comparison methods can be considered as part of a larger class of problems in which we compare multiple trees simultaneously, which to our knowledge has not been previously analysed in detail. Such alignments based on minimising a metric associated with multiple simultaneous comparisons we call consensus trees. Below we will prove a number of general results on consensus trees, which when N ¼ 2 corresponds to pairwise tree comparisons. A consensus tree utilising five trees (N ¼5) at distinct stages was generated algorithmically using the below methods and is shown in Fig. 7. 4.2.1. Algorithm implementation: Supplementary Material The algorithms described below were implemented as a library of Matlab functions, provided as Supplementary material to this

document. For more information on how to use the provided library, open the file “README.md” from the Supplementary material in a text editor or Matlab. A short demo of the library is provided in the “demo” folder (to follow the demo, open “demo. md” in a text editor or Matlab). For the actual algorithm implementation see the function source file “gmr.m”. 4.2.2. Consensus supertree definition For a multiple tree generalisation of the maximum common connected subgraph/smallest common supertree problem, we introduce the multi-tree generalisation of a (pairwise) common supertree: a common supertree S þ such that a set of trees T 1 ; T 2 ; …; T N , are isomorphic to subgraphs SðþiÞ DS þ ; i ¼ 1; …; N. In the following we denote the associated isomorphic tree-tosubgraph mappings f i : T i -SðþiÞ ; i ¼ 1; 2; …; N. A straight-forward metric to measure S þ is, analogous to the smallest common supertree problem, to use the size of the supertree jS þ j. This metric has some undesirable properties, as changing any alignments that do not change the size of the supertree, preserves the metric score, so two smaller trees can be aligned in an arbitrary way, if also aligned to a larger tree that contains both smaller trees and doesn’t have lower weighted branches. Metrics that consider only the common connected subgraph of all trees have similar problems. A metric that avoids these problems and nicely generalises the MCCS problem, which we will call the tree consensus, is defined as the sum of pairwise common connected subgraph sizes. For the supertree problem as defined above, let Lði; jÞ be the size of the pairwise common connected subgraph of T i and T j under the mappings f i and f j , that is Lði; jÞ is the size of the subgraph of S þ consisting of vertices shared by SðþiÞ and SðþjÞ . We can state the size of a common connected subgraph as a sum over all supertree vertices u A V ðS þ Þ of the product of a piecewise alignment function 8   < 1; if u A V SðþiÞ xi ðuÞ ¼ ð1Þ : 0; otherwise

234

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

calculated for each tree (see Fig. 9). It can be seen that by summing over pairs of input trees that ! 1 1 ∑ ∑ xi ðuÞxj ðuÞ : L ¼ ∑ ∑ Lði; jÞ ¼ ∑ 2 i jai u A V ðS þ Þ 2 i j a i The term in brackets on the L.H.S is simply the number of pairs of trees that map to a vertex u. The number of ways to choose pairs   N from N objects is given by the binomial coefficient ¼ 2 1 2N ðN  1Þ. Defining wðuÞ as the number of trees that map to a vertex u, we can hence express the concensus score of a supertree as L¼

1 ∑ wðuÞðwðuÞ 1Þ: 2u A V ðS þ Þ

ð2Þ

From this expression it follows that consensus trees can be considered as weighted binary trees, with a set of vertex weights wðS þ Þ ¼ wðV ðS þ ÞÞ. For these weighted trees (where weights represent branch repetition) we can generalise the common supertree metric by defining the weight of a common supertree branch to be the sum of the weights of mapped tree branches. The metric L ¼ ∑u A V ðS þ Þ ðw1 ðuÞ þ w2 ðuÞÞðw1 ðuÞ þw2 ðuÞ 1Þ, where S þ is the common supertree of weighted trees T 1 and T 2 , thus generalises the pairwise supertree size metric for weighted trees. It is easily seen that the concensus metric is exactly the common connected subgraph metric for a pair of trees with branch weights being 1 or zero. A consensus tree is a maximum consensus tree if it attains the maximum value of L over all possible consensus trees on the set of trees.

4.2.3. Discordance metric definition For a given consensus supertree, we can define a (sum pairwise) discordance that is equivalent to the consenus score, as follows. The discordance score is found by summing the difference in the size of the common supertree and common connected subgraph for each pair of trees. We can apply the following identity jT i jþ jT j j ¼ jSðþiÞ [ SðþjÞ jþ jSðþiÞ \ SðþjÞ j which states that the size of the common supertree plus the common connected subgraph of an alignment of two trees equals the sum of the tree graph sizes. Noting that Lði; jÞ ¼ jSðþiÞ \ SðþjÞ j the difference in common supertree and common connected subgraph size is given by Dði; jÞ ¼ jSðþiÞ [ SðþjÞ j  jSðþiÞ \ SðþjÞ j ¼ jT i j þ jT j j  2Lði; jÞ:

Fig. 9. Piece-wise alignment function is used to determine consensus score. (Left) The alignment of the trees from Fig. 2B; and (Right) the common supertree showing tuples [x1  2] of values of the alignment functions (Eq. 1) at each vertex. The tuples contain 1's for each tree that is mapped to the corresponding supertree vertex. For example a tuple of [1 0] for a supertree vertex indicates that tree 1 (black on left) is mapped to that vertex and tree 2 (red on left) is not. The size of the common subgraph can be found by summing the product of the vertex alignment function values for each associated tree over all vertices. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Summing over all pairs of i and j returns the score 1 D ¼ ∑ ∑ Dði; jÞ ¼ ðN  1Þ∑jT i j 2L: 2 i jai i As both N and the tree graph sizes jT i j are independent of the alignment, we can see that the minimum discordance corresponds to the maximum consensus value due to the negative linear relationship. 4.2.4. Labelled directed acyclic graphs (ldags) definition In order to develop algorithms to find maximum consensus tree/minimum discordance for a set of trees, a useful computational tool is to find labelled directed acyclic graph (ldag) representations (Lingas, 1983), which give a compact form for trees which have multiple isomorphic subtrees (see Fig. 10). The process of forming an ldag representation merges subtrees that are isomorphic and links parents to the merged subtree. For the ldag representation, each vertex corresponds to the root of a unique subtree of the original tree, with the subgraph induced by descendants of a vertex forming the ldag representation of the subtree (in the following, depending on the context, we use subtree to refer to both the subtree itself and the induced subgraph representation). In the ldag representation, edge weights correspond to the number of isomorphic child subtrees of the parent vertex. Note that the ldag representation is not necessarily a tree, as vertices can have multiple parents. In the following, the size of an ldag refers to the size of its tree representation, and not necessarily the number of ldag vertices. Considering the similarity of ldags, we find analogous common subtree problems to those for trees. Lingas (1983) defines an rembedding of ldag A1 in ldag A2 if the tree representation T ðA1 Þ of A1 is root-to-root isomorphic to a subgraph of T ðA2 Þ. We can similary define a common r-embedding of A1 and A2 , which is the ldag representation of the common connected subgraph of T ðA1 Þ and T ðA2 Þ. Analogous to a common supertree is a common supergraph into which A1 and A2 can be r-embedded. 4.2.5. Algorithm for tree reduction to labelled directed acyclic graph (ldag) An algorithm for finding a ldag reduction LDAGðTÞ of a tree T is described by Lingas (1983), and can be summarised as follows: 1. Define the maximum descendent distance maxddistðVÞ of a vertex V as the number of edges along the path from V to a descendent of V of maximum depth (also sometimes known as the vertex height). Go to step 2. 2. Find leaves of T, assign leaves to groups based on identical labels (and weights if T is weighted), and for each group create a corresponding vertex u A V ðLDAGðT ÞÞ. Let i ¼ 0. Go to step 3. 3. If i ¼ maxdepthðT Þ stop, else go to step 4. 4. Find all internal (non-leaf) vertices maxddist ðV Þ ¼ i þ 1. For each vertex, create a list of the LDAG vertices of its children.

Fig. 10. Tree and ldag representation of a graph. (Left) Tree and (Right) ldag representations of a rooted tree. Each vertex of the ldag corresponds to the root of a unique subtree (up to isomorphism) in the tree representation, indicated by a shared colour. Ldag edges are weighted by the number of isomorphic child subtrees.

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

Groups vertices that have identical labels (and weights) and identical sorted child lists, and for each group create a corresponding vertex u A V ðLDAGðT ÞÞ. Create directed edges e A EðLDAGðT ÞÞ pointing towards the LDAG vertices in the group’s child list, weighted by the number of times that vertex appears on the child list. Set i-iþ 1 and return to step 3. 4.2.6. Algorithm for finding a smallest common supertree (SCS) The approach to find the smallest common (root-to-root) supertree between two (binary, rooted) trees was via ldag reduction and a recursive algorithm to find the maximum common connected subgraph (shown to be equivalent to SCS above), as described by the following. See Supplementary material for source code for the algorithm. Assume for the moment that both T 1 and T 2 are full binary trees. 1. Find A1 ¼ LDAGðT 1 Þ and A2 ¼ LDAGðT 2 Þ via ldag reduction. Go to step 2. 2. With the following notation:  p ¼ ðv1 ; v2 Þ a pair of vertices with v1 A V ðA1 Þ; v2 A V ðA2 Þ;  rðAi Þ the root of Ai ;  Avi i the subgraph of Ai induced by the descendants of a vertex vi A V ðAi Þ;    MCCSðpÞ ¼ MCCS Av11 ; Av22 the maximum common connected subgraph of subgraphs Av11 and Av22 ;  LðpÞ ¼ Lðv1 ; v2 Þ the size of maximum common connected subgraph MCCSðpÞ;    CSðpÞ ¼ CS Av11 ; Av22 the common supertree; and p v  mappings f i : Ai i -CSðpÞ;

3.

4. 5.

6.

7.

determine Lðr ðA1 Þ; r ðA2 ÞÞ, CSðA1 ; A2 Þ and mappings f 1 : A1 -CSðA1 ; A2 Þ and f 2 : A2 -CSðA1 ; A2 Þ by evaluating the recursive algorithm beginning at step 3 with p ¼ ðr ðA1 Þ; r ðA2 ÞÞ. Once evaluated, go to step 9. (Recursive algorithm) For p ¼ ðv1 ; v2 Þ, if the labels of v1 and v2 are incompatible (that is, if both vertices have labels and are labelled differently), return LðpÞ ¼ NaN (not a number), p p CSðpÞ ¼ ∅ (empty set), and f 1 ¼ f 2 ¼ ∅ else go to step 4. If p has been previously traversed, then return cached results p p for LðpÞ; CSðpÞ; f 1 ; f 2 , else go to step 5. If v1 is a leaf then create CSðpÞ C Av22 (i.e. form the common supertree so that it is isomorphic to the subgraph of other input p p tree) and return LðpÞ ¼ 1; CSðpÞ; f 1 ðv1 Þ ¼ f 2 ðv2 Þ ¼ r ðCSðpÞÞ, else if v1 v2 is a leaf then create CSðpÞ C A1 and return LðpÞ ¼ 1; p p CSðpÞ; f 1 ðv1 Þ ¼ f 2 ðv2 Þ ¼ r ðCS

ðpÞÞ, else go to step 6. Let C i ¼ ci : ðvi ; ci Þ A EðAi Þ be the set of children of vi . Find C 1 and C 2 . For all pairs of children pc ¼ ðc1 ; c2 Þ, if unknown, p p determine L pc (and the corresponding CS pc ; f 1c ; f 2c ) by evaluating  recursive function with p ¼ pc beginning at step 3. When L pc is known for all pairs of children then go to step 7.

We define a pairing P ðC 1 ; C 2 Þ ¼ pc as a set of pairs such that the total number of pairs with a child ci is equal to the weight wðeÞ of the child-parent edge e ¼ ðvi ; ci Þ A EðAi Þ (see example below). For each possible pairing P ðC 1 ; C 2 Þ, determine the sum of MCCS size for all child pairs, i.e. SLðP Þ ¼ ∑ Lðpc Þ. If p AP

c   L pc ¼ NaN for any pair pc A P, then SLðP Þ ¼ NaN. Go to step 8. 8. If SLðP Þ ¼ NaN for all possible pairings P ðC 1 ; C 2 Þ, then return p p LðpÞ ¼ NaN; CSðpÞ ¼ ∅; f 1 ¼ f 2 ¼ ∅. Otherwise, find the maxmis n n ing pairing P such that SL P ¼ max SLðP Þ (i.e. the pairing such P that the sum of the MCCS size for all pairs in set is maximum). If more than one such pairing exists, then choose one (see also Algorithmic modifications section below). Form the common supertree CS  ðp  Þ by creating a root vertex and appending the subtree CS pc for each pair pc A P n (see example below). The  size of the common subtree is LðpÞ ¼ SL P n þ 1. The mappings

235

p

f i are formed by concatenating the mappings for the child pairs p f i c : pc A P n and the mapping of vi to the root of CSðpÞ, i.e. p p p f i ðv1 Þ ¼ r ðCSðpÞÞ. Return LðpÞ; CSðpÞ; f 1 ; f 2 . 9. Undo the ldag reduction algorithm to get tree form treeðCSÞ. 1 2 Example. If both trees have two children, i.e. C 1 ¼ fc1 ; c1g; C 2 ¼ fc12 ; c22 g, then the two possible pairings are P ¼ c11 ; c12 ;  2 2    

n c1 ; c2 g and P ¼ c11 ; c22 ; c21; c12  . The  maximising    pairing  P is thus P : SLðP Þ ¼ max L c11 ; c12 þ L c21 ; c22 ; L c11 ; c22 þL c21 ; c12 .

4.2.7. Finding a maximum consensus supertree (MCNS) Generalising the approach used for finding the smallest common root-to-root supertree for two trees, we can define a recursive algorithm to find the maximum root-to-root consensus supertree of N trees. The algorithm adapts the SCS algorithm above in the following ways:

 (Step 1) Ldag reductions A1 ¼ LDAGðT 1 Þ; …; AN ¼ LDAGðT N Þ are found for trees T 1 ; …; T N .





 (Step 2) Pairs are replaced by M-tuples p ¼ vi ; …; vNi , for 1

 

 

M

some subset of M trees, i A ð1; N Þ; 2 rM r N. Recursive algorithm returns maximum consensus score LðpÞ, common superp tree CSðpÞ and mapping f ib ; b ¼ 1; …; M. (Step 5) If at most one vi is internal, then create CSðpÞ CAvi i and p return LðpÞ ¼ 1; CSðpÞ; f i ðvi Þ ¼ r ðCSðpÞÞ, else go to (Step 6). (Step 6) Let vj ; j A ð1; NÞ be the interval vertices of vi . Sets of children C j are found for internal vertices vj . The recursive   algorithm is applied for all child M-tuples pc ¼ cj1 ; …; cjM , where M is the number of internal vertices.   (Step 7) Pairings are replaced by alignments P C j1 ; …; C jM , which are sets of child M-tuples such that cj appears   in a number of sets equal to the weight wðeÞ; e ¼ vj ; cj A E Aj . n n (Step  n  8) The maximising alignment P is P such that SL P ¼ max SL  ðP Þ. The size of the concensus common subtree P is LðpÞ ¼ SL P n þ N ðN  1Þ=2 (note that the N ðN  1Þ=2 factor follows from the root supertree vertex contribution to Eq. (2), with N being the weight of the vertex).

4.2.8. Algorithm properties and running time We show that for a set of trees T 1 ; …; T N that the MCNS algorithm will return the maximum consensus score and supertree, with the condition that the supertree is itself a binary tree. First, we note that an alignment of subtrees where at most a single subtree has more than one vertex, the consensus score is a fixed value, and the smallest common supertree is isomorphic to the largest-size subtree. Next, an important feature of tree graphs is that any pair of subtrees with roots that are not an ancestordescendent pair can be aligned independently. In particular for root-to-root alignments, all subtrees with roots at a fixed depth are independent, so that for a set of trees, once an alignment of subtrees has been determined that maximises the consensus score, this holds for any alignment of other set of subtrees at that depth. Hence for an aligned set of subtrees, with two sets of matched children corresponding to the branches of the binary supertree, changing an internal alignment within one set of child subtrees does not affect the alignment or score of the other set of child subtrees. Finally, we note that the consensus score of the parent subtree alignment is given by a fixed value added to the sum of the child subtree alignments, as is made clear by Eq. (2) which shows that the consensus score of a graph is the addition of its subgraphs' scores. It follows the parent subtree alignment can be maximised by trying all possible sets of child groupings. The maximum consensus alignment for set of child subtrees can be

236

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

determined recursively, with the base case of an alignment with a single internal vertex as described above. It is shown below that the following statements hold. Theorem. The MCS algorithm for trees T i ; i ¼ 1; 2; …; N, will run in Oð∏N i ¼ 1 ni Þ time, where ni ¼ jV ðAi Þj. Corollary. The SCS algorithm for trees T 1 ; T 2 will run in Oðn1 n2 Þ time, where ni ¼ jV ðAi Þj; 8 i ¼ 1; 2. Proof. Let nD be the number of vertices of depth D in j T j ; j : height   T j Z D. Letting MD be the number of trees T j with height T j Z D, the algorithm requires that all sets of MD vertices at D D each depth be aligned. There are exactly nD ways to j1 nj2 ; …; njM D align all vertices at depth D of the MD trees: This is the number of ways one can choose MD objects, one from a group of size nD j1 , one from a group of size nD j2 , and so on. For groups of vertices with at most one internal vertex, the MCS can be computed in constant time. For groups with more than one internal vertex, if the MC of all possible subtree alignments at depth D þ 1 are known, then determining the MC of all alignments at depth D can hence be D D found in Oð∏M i ¼ 1 nji Þ time as described by the algorithm. The maximum depth of a leaf of any maximum common subtree of a pair of trees T i ; T j will be given by     max min maxdepthðT i Þ; maxdepth T j , which will be the maxij imum depth of the second highest tree. By summing the mappings of vertices ðminðmaxdepthðT i Þ;   at depths D ¼ 0; 1; …; max ij maxdepth T j ÞÞ; the total running time will be 0 1 ! ! maxðminðmax depthðT i Þ; max depthðT j ÞÞÞ ij MD N B D C O ∏ n ji A o O ∏ n i ∑ @ D¼0

i¼1

i¼1

where we have used the fact that maxðminðmax depthðT i Þ; max depthðT j ÞÞÞ ij



D¼0

nD ar

max depthðT a Þ



D¼0

nD a ¼ na ; a ¼ 1; 2; …; N:

The corollary follows for the case N ¼ 2. 4.2.9. Algorithmic modifications Tiebreak functions: If for a set of subgraphs being aligned, multiple pairings to children evaluate to the same smallest common supergraph size (in step 8 of SCS and MCNS algorithms), then a tie is said to occur. In these cases, a tiebreak function can be used to choose the pairing. The following describes some possible options, with option 2 used in our code, which is provided in Supplementary material: Option 1 “random”: Breaks ties randomly. Option 2 “treedepth”: Chooses the alignment such that the treedepth of largest of the child consensus subtrees is smallest. This attempts to model a ‘synchronised growth pattern’ by finding the alignment that minimises the number of bifurcations between the two trees. Option 3 “subgraph”: Maps the children such that size of the largest child MCS is smallest. This is similarly an attempt to model a ‘synchronised growth pattern’ where the alignment that distributes differences in growth between two trees between as many branches as possible is found. Trifurcations: In some biological systems, some trifurcations may occur. In the case of the kidney ureteric tree, as noted above, these may be cases of a bifurcation followed by a rapid successive bifurcation (al-Awqati and Goldberg, 1998), which cannot be distinguished at the time of measurement. The SCS and MCS algorithms may be adapted to detect trifurcated branches (ldag vertices of out degree equal to 3). When this occurs, the algorithm

may be modified to iterate over each child vertex and return as the children of the trifurcated vertex, the child vertex, and the trifurcating parent vertex with the corresponding child edge weight reduced by 1. In effect this treats the trifurcated vertex subtree by attaching two of the children to a new child vertex of the original trifurcating parent. This approach can be thought of as simulating the remodelling of developing kidney's trifurcated branch points into successive bifurcations as development progresses. The process for this addition to the SCS algorithm in more detail is as follows: Let u A V ðAÞ be a trifurcated branch point, with children TC ¼ fu0 jðu; u0 Þ A EðAÞg. Define pruneðu; u0 Þ for some u' A TC as an operation that returns the vertex u as the root of a pruned subtree induced by u and its descendents but with the weight of edge ðu; u0 Þ reduced by 1 (and removed if becomes zero). When finding the pairing that results in the smallest common supergraph, the algorithm looks at pairings with the children of u being

C ¼ u0 ; pruneðu; u0 Þ ju0 A TC. For the MCS algorithm, trifurcating branch points are handled in a similar manner     as with the  SCS algorithm. Let vertices uj1 A V Aj1 ; uj2 A V Aj2 ; …; ujM A V AjM be M trifurcating branch   

 points with children TC j ¼ uj'j uj ; uj' A E' Aj . Try a pairing, for each jA j1 ; j2 ; …; jM , for each u A TC j ; with C j ¼ uj';   prune uj ; uj' g. This increases the total number of alignments to test by jTC j1 jnjTC j2 jn…njTC jM j. 4.2.10. Greedy consensus supertree While still polynomial for bound N, the MCNS algorithm is computationally intensive for large trees. We can define an approximate greedy algorithm that will run in Oðmax ðni ÞÞ time i where ni ¼ jV ðAi Þj, that recursively maximises an uppper bound function on the sum of pairwise common connected subgraph sizes, by pairing children such that one group contains the larger children (having at least as many descendants than their siblings) together, and the other contains the siblings. To see how this algorithm approximates the MCNS algorithm, first note that for a pair of trees T 1 and T 2 an upper bound on the connected subtree size is given by minðn1 ; n2 Þ. We use the following definitions:

 nðuÞ for the number of descendents of a vertex u (including u);      c1i ; c2i children of the root of tree T i , such that n c1i Zn c2i ; Let T s be the tree such that c2s has the least descendents of and let the other  tree be T l . The  two possible   pairing   of the children cji are thus c1s ; c1l ; c2s ; c2l and c1s ; c2l ; c2s ; c1l . The upper bound of the common connected subgraph size for T 1 ; T 2 with these pairings will be given by the sum of the upper bounds for the common connected subgraph of each of the aligned child subtree pairs plus one. We can reason about the upper bounds for the common connected subgraphs of the aligned child subtree pairs thus: the pair that contains the child with the least descendents c2s will have a common connected subgraph of at   most size n c2s . To maximise the upper bound of the common connected subgraph for the other pair (containing c1s ), the child of T l that has more descendents, c1l should be grouped with c1s , with a resulting upper bound for the common connected subgraph                sizeof min n c1s ; n c1l . As min n c1s ; n c1l Z min n c1s ; n c2l 1 2 if n cl Z n cl , this is the maximum upper bound for common connected subgraph of the child subtree pair. The upper bound of the common connected subgraph of T 1 ; T 2 with the maximal        alignment is min n c1s ; n c1l þ n c2s þ 1. For a set of trees T 1 ; …; T N , the sum of pairwise upper bounds becomes c21 ; c22 ,

  N 1 1 ∑ ∑ min ni ; nj ¼ ∑ ðN  iÞnki 2 i jai i¼1

T.O. Lamberton et al. / Journal of Theoretical Biology 365 (2015) 226–237

where nki r nki þ 1 ; i ¼ 1; …; N. This follows as the i-th smallest tree will be the smallest in N  i pairwise comparisons. As grouping children into siblings with more and less descendents maximises each pairwise upper bound independently, it follows that sum of pairwise upper bounds will be maximised. Generalising the algorithm for weighted trees that represent consensus trees can be done trivially by ignoring branch weights, with the assumption that the branches of the consensus tree represent tree branches that are aligned by descendent number, with the larger branches (more descendents) aligned to the larger consensus tree branch, and so forth. As there is a correspondence between the children of a tree vertex and the children of the corresponding ldag vertex, the alignment strategy also applies to ldags. To confirm that this algorithm runs in Oðmax ðni ÞÞ time, note i that the algorithm is called only once for each vertex of any ldag Ai until a leaf is encountered. As the number of descendents nðuÞ of  all subtrees T Au ; u A V ðT Þ of a tree of size n can be found in OðnÞ time, it can be easily seen that Oðmaxðni ÞÞ is the order of the i algorithm. It can be seen that this greedy algorithm is exact for identical trees where no branches are equally weighted. Under a constant growth model (i.e. branching occurring at a rate proportional to the number of tips), the child descendent number ratio nLi =nSi of a vertex will remain constant. Hence a greedy alignment of a tree that has been allowed to grow will align with a previous form of the tree, and should theoretically match well trees at different development levels. Acknowledgements The authors would like thank Professor Melissa Little and Dr Alexander Combes for interesting and useful discussions during the preparation of this paper. This work was supported by the Australian Research Council (DP10310086). I.M.S. holds a Future Fellowship from the Australian Research Council. Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.jtbi.2014.10.001. References Affolter, M., Zeller, R., Caussinus, E., 2009. Tissue remodelling through branching morphogenesis. Nat. Rev. Mol. Cell Biol. 10, 831–842. http://dx.doi.org/10.1038/ nrm2797. Aho, A.V., Hopcroft, J.E., Ullman, J.D., 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley Pub. Co., Reading, MA.

237

Akutsu, T., 2010. Tree edit distance problems: algorithms and applications to bioinformatics. IEICE Trans. 93-D, 208–218. al-Awqati, Q., Goldberg, M.R., 1998. Architectural patterns in branching morphogenesis in the kidney. Kidney. Int. 54, 1832–1842. http://dx.doi.org/10.1046/ j.1523-1755.1998.00196.x. Chung, M.J., 1987. O(n2.5) time algorithms for the subgraph homeomorphism problem on trees. J. Algorithms 8, 106–112. http://dx.doi.org/10.1016/0196-6774 (87)90030-7. Costantini, F., Kopan, R., 2010. Patterning a complex organ: branching morphogenesis and nephron segmentation in kidney development. Dev. Cell 18, 698–712. http://dx.doi.org/10.1016/j.devcel.2010.04.008. Cullen-McEwen, L.A., Fricout, G., Harper, I.S., Jeulin, D., Bertram, J.F., 2002. Quantitation of 3D ureteric branching morphogenesis in cultured embryonic mouse kidney. Int. J. Dev. Biol. 46, 1049–1055. Davies, J.A., Peter, H., Chang, C-Hong, Berry, Rachel, 2014. A self-avoidance mechanism in patterning of the urinary collecting duct tree. BMC Dev. Biol., 14. http://dx.doi.org/10.1186/s12861-014-0035-8. Dressler, G.R., 2006. The cellular basis of kidney development. Annu. Rev. Cell Dev. Biol. 22, 509–529. http://dx.doi.org/10.1146/annurev.cellbio.22.010305.104340. Dubiner, M., Galil, Z., Magen, E., 1994. Faster tree pattern matching. J. ACM 41; pp. 205–213. http://dx.doi.org/10.1145/174652.174653. Farach, M., Przytycka, T.M., Thorup, M., 1995. On the agreement of many trees. Inf. Process. Lett. 55, 297–301. http://dx.doi.org/10.1016/0020-0190(95)00110-X. Gupta, A., Nishimura, N., 1998. Finding the largest subtrees and smallest supertrees. Algorithmica 21, 183–210. http://dx.doi.org/10.1007/PL00009212. Jiang, T., Wang, L., Zhang, K., 1995. Alignment of trees — an alternative to tree edit. Theor. Comput. Sci. 143, 137–148. http://dx.doi.org/10.1016/0304-3975(95) 80029-9. Kilpeläinen, P., Mannila, H., 1995. Ordered and unordered tree inclusion. SIAM J. Comput. 24, 340–356. http://dx.doi.org/10.1137/S0097539791218202. Koch, I., 2001. Enumerating all connected maximal common subgraphs in two graphs. Theor. Comput. Sci. 250, 1–30. http://dx.doi.org/10.1016/s0304-3975 (00)00286-3. Kopan, R., Cheng, H.T., Surendran, K., 2007. Molecular insights into segmentation along the proximal-distal axis of the nephron. J. Am. Soc. Nephrol. 18, 2014–2020. http://dx.doi.org/10.1681/asn.2007040453. Lingas, A., 1983. An application of maximum bipartite c-matching to subtree isomorphism'. In: Ausiello, G., Protasi, M. (Eds.), CAAP'83, 159. Springer, Berlin Heidelberg, pp. 284–299. Matula, D.W., 1968. An Algorithm for Subtree Identification. SIAM Review 10, pp. 273–274. http://dx.doi.org/10.1137/1010054. McComb, T., Cairncross, O., Noske, A.B., Wood, D.L., Marsh, B.J., Ragan, M.A., 2009. Illoura: a software tool for analysis, visualization and semantic querying of cellular and other spatial biological data. Bioinformatics 25, 1208–1210. http: //dx.doi.org/10.1093/bioinformatics/btp125. Metzger, R.J., Klein, O.D., Martin, G.R., Krasnow, M.A., 2008. The branching programme of mouse lung development. Nature 453, 745–750. Quaggin, S.E., Kreidberg, J.A., 2008. Development of the renal glomerulus: good neighbors and good fences. Development 135, 609–620. http://dx.doi.org/ 10.1242/dev.001081. Shamir, R., Tsur, D., 1999. Faster subtree isomorphism. J. Algorithms 33, 267–280. http://dx.doi.org/10.1006/jagm.1999.1044. Short, K., Hodson, M., Smyth, I., 2013. Spatial mapping and quantification of developmental branching morphogenesis. Development 140, 471–478. http: //dx.doi.org/10.1242/dev.088500. Short, K.M., Hodson, M.J., Smyth, I.M., 2010. Tomographic quantification of branching morphogenesis and renal development. Kidney Int. 77, 1132–1139. http: //dx.doi.org/10.1038/ki.2010.42. Short, Kieran M., Combes, Alexander N., Lefevre, J., Ju, Adler L., Georgas, Kylie M., Lamberton, T., Cairncross, O., Rumballe, Bree A., McMahon, Andrew P., Hamilton, Nicholas A., Smyth, Ian M., Little, Melissa H., 2014. Global quantification of tissue dynamics in the developing mouse kidney. Dev. Cell, 29; , pp. 188–202. http://dx.doi.org/10.1016/j.devcel.2014.02.017. Watanabe, T., Costantini, F., 2004. Real-time analysis of ureteric bud branching morphogenesis in vitro. Dev. Biol. 271, 98–108. http://dx.doi.org/10.1016/j. ydbio.2004.03.025.

Comparing and distinguishing the structure of biological branching.

Bifurcating developmental branching morphogenesis gives rise to complex organs such as the lung and the ureteric tree of the kidney. However, a few qu...
2MB Sizes 0 Downloads 5 Views