# Mapping Quantitative Trait Loci onto a Phylogenetic Tree

- Karl W. Broman*,
^{1}, - Sungjin Kim
^{†},^{2}, - Śaunak Sen
^{‡}, - Cécile Ané
^{†}^{§}and - Bret A. Payseur**

^{*}Department of Biostatistics and Medical Informatics^{†}Department of Statistics^{§}Department of Botany, and^{**}Laboratory of Genetics, University of Wisconsin, Madison, Wisconsin 53706, and^{‡}Department of Epidemiology and Biostatistics, University of California, San Francisco, California 94107

- 1Corresponding author: Department of Biostatistics and Medical Informatics, University of Wisconsin—Madison, 1300 University Ave., Room 4710 MSC, Madison, WI 53706. E-mail: kbroman{at}biostat.wisc.edu

## Abstract

Despite advances in genetic mapping of quantitative traits and in phylogenetic comparative approaches, these two perspectives are rarely combined. The joint consideration of multiple crosses among related taxa (whether species or strains) not only allows more precise mapping of the genetic loci (called quantitative trait loci, QTL) that contribute to important quantitative traits, but also offers the opportunity to identify the origin of a QTL allele on the phylogenetic tree that relates the taxa. We describe a formal method for combining multiple crosses to infer the location of a QTL on a tree. We further discuss experimental design issues for such endeavors, such as how many crosses are required and which sets of crosses are best. Finally, we explore the method’s performance in computer simulations, and we illustrate its use through application to a set of four mouse intercrosses among five inbred strains, with data on HDL cholesterol.

- quantitative trait loci (QTL)
- phylogenetic tree
- evolution
- multiple crosses
- combining crosses

THE analysis of experimental crosses to identify the genetic loci (called quantitative trait loci, QTL) that contribute to variation in quantitative traits has become a standard approach in evolutionary biology. The properties of the QTL responsible for phenotypic differences between populations or species—including the number of QTL, their effect sizes, and their modes of action—provide insights into the mechanisms of evolution. QTL data have been brought to bear on a wide range of evolutionary processes, including adaptation (Doebley and Stec 1991; Bradshaw *et al.* 1998; Orr 1998; Mauricio 2001; Peichel *et al.* 2001; Rieseberg *et al.* 2002; Mitchell-Olds *et al.* 2007; Steiner *et al.* 2007; Hall *et al.* 2010) and speciation (Bradshaw *et al.* 1995; Moehring *et al.* 2006; Oka *et al.* 2007; Shaw *et al.* 2007; Moyle and Nakazato 2008; McDermott and Noor 2011; White *et al.* 2011).

By modeling the distribution of trait values across a tree, phylogenetic comparative methods also help to reconstruct the dynamics of phenotypic evolution. These approaches address several key issues, including the values of traits in ancestors (Schluter *et al.* 1997; Pagel 1999; Garland and Ives 2000; Pagel *et al.* 2004), rates of phenotypic evolution (Garland 1992; Venditti *et al.* 2011), the connection between trait evolution and speciation/extinction (Maddison *et al.* 2007; Fitzjohn *et al.* 2009), and the role of natural selection *vs.* genetic drift (Hansen 1997; Freckleton and Harvey 2006).

Despite the successful application of QTL mapping and phylogenetic comparative methods to fundamental questions in evolutionary biology, the two frameworks are rarely integrated. Methods that combine the portraits of genetic architecture obtained from QTL mapping with the logic of phylogenetic comparisons would offer several benefits. First, QTL data would provide a mechanistic basis for the dynamics of phenotypic evolution uncovered by phylogenetic comparative approaches. Although trait shifts along trees are caused by mutations, the methods for reconstructing these shifts do not currently incorporate genetic information.

Second, situating QTL data within a phylogenetic framework would directly account for the statistical dependencies that accompany any mapping comparison among three or more taxa. The tree connecting the species used in genetic mapping constrains the configurations of shared and divergent QTL that are possible, but this information is currently ignored by most QTL mapping methods.

Most importantly, a combined method could reveal the history of genetic differences between species. The mutations that underlie QTL occur along a phylogeny. Assigning these mutations to branches of the tree would pinpoint their evolutionary origins and allow testable predictions regarding the temporal accumulation of mutations (Moyle and Payseur 2009).

The ability to assign QTL to branches of phylogenetic trees would benefit genetic research beyond evolutionary biology. Collectively or individually, researchers often map QTL for the same phenotype in multiple sets of strains, especially in agricultural and biomedical model organisms. In addition to refining QTL position (Li *et al.* 2005), joint analysis of these crosses can pinpoint the genetic backgrounds (strains) on which QTL arose, providing further insights into the genetic architecture of traits involved in food quality or disease.

To envision the problem, consider the tree in Figure 1, and imagine the presence of a single diallelic QTL. The mutant allele at the QTL could have arisen in one of five possible locations on the tree, and each location is associated with a particular partition of the four taxa into two groups (those with the “high” allele and those with the “low” allele). For each such partition, the QTL will segregate in a different subset of the possible crosses between pairs of taxa. Throughout, we focus solely on unrooted trees. The two edges on either side of the root in Figure 1, labeled 5, cannot be distinguished. Also, a mutation arising above the root cannot be distinguished from the null model of no QTL.

With data on multiple crosses, the simplest approach to identifying the location on the tree at which a QTL arose is to compare the pattern of presence and absence of the QTL in the individual crosses and match that to the ideal (see the table in Figure 1). We describe a more formal approach, combining ideas from Li *et al.* (2005), regarding the joint analysis of multiple crosses, with ideas from MacDonald and Long (2007), regarding partitioning multiple QTL alleles into two groups.

We discuss experimental design issues for such endeavors, such as how many crosses are required and which sets of crosses are best, explore the method’s performance in computer simulations, and illustrate its use through application to a set of four mouse intercrosses among five inbred strains, with data on HDL cholesterol.

## Methods

To develop methods for mapping a QTL to a phylogenetic tree, we begin with several simplifying assumptions: The taxa are represented by inbred lines, the tree relating the taxa is known without error, the quantitative trait of interest is affected by a single diallelic QTL, and there are no background effects (*i.e.*, the effect of the QTL is the same in the different crosses in which it is segregating). We consider the case of intercrosses among pairs of taxa, consider only autosomal loci, and assume a common genetic map.

The basic idea, illustrated in Figure 1, is that each possible location for the origin of a diallelic QTL on the tree corresponds to a different partition of the taxa into two groups, with the two groups corresponding to the two QTL alleles. For different partitions, the QTL will segregate in different sets of crosses. In the case of very large crosses, with each having high power to detect the QTL, if present, we could simply consider the crosses individually and use the pattern of presence/absence of QTL to identify the correct partition of the taxa. Note that one does not need data on all possible crosses. For the case illustrated in Figure 1, with four taxa, it would be sufficient to consider the crosses A × B, A × C, and B × D, as with just these three crosses, the five possible partitions have distinct patterns of presence/absence of the QTL. In the following, we focus on partitions of the taxa into two groups, in place of locations of the QTL on the tree.

Given limited resources and crosses of limited size, there will be incomplete power to detect the QTL in a given cross, and so the naive approach based on the presence or absence of the QTL in the different crosses will likely be misleading. A more formal approach, in which the likelihoods for the different possible partitions are evaluated and compared, will provide a clear assessment of the evidence for the different locations for the QTL on the tree.

Consider a particular location in the genome as the site of a putative QTL, and consider a particular partition of the taxa into two QTL alleles. We assume a linear model with normally distributed errors*y _{ij}* is the phenotype for individual

*j*in cross

*i*,

*μ*the average phenotype in cross

_{i}*i*,

*α*and

*δ*are the additive and dominance effects of the QTL, respectively, and the

*ε*are independent and identically distributed normal (0,

_{ij}*σ*

^{2}). The

*a*and

_{ij}*d*denote encodings of the QTL genotypes, with

_{ij}*a*=

_{ij}*d*= 0 if the QTL is not segregating in cross

_{ij}*i*. For convenience, we call the two QTL alleles defined by the partition as the high allele (H) and the low allele (L), although we won’t actually constrain the high allele to increase the phenotype. If the QTL is segregating in cross

*i*, then we take

*a*= −1, 0, or + 1, if individual

_{ij}*j*has QTL genotype

*g*= LL, HL, or HH, respectively, and

_{ij}*d*= 1 if individual

_{ij}*j*has QTL genotype HL and

*d*= 0 otherwise.

_{ij}For most putative QTL locations, the QTL genotypes are not be observed, but we may calculate (*e.g.*, by a hidden Markov model) the conditional probabilities of the QTL genotypes given the available multipoint marker genotype data, *μ _{i}*,

*α*,

*δ*, and

*σ*

^{2}, and calculate a LOD score,

*π*denotes the partition of the taxa and

*λ*denotes the location of the putative QTL. The LOD score is the log

_{10}likelihood comparing the hypothesis of a single QTL at that location to the null hypothesis of no QTL but with the multiple crosses allowed to have separate phenotypic means, that is,

*y*∼ normal (

_{ij}*μ*,

_{i}*σ*

^{2}).

This analysis is just as in Li *et al.* (2005), in that one recodes the genotypes in the crosses in which the QTL is segregating, stacks them on top of one another, as if they were a single intercross, and performs interval mapping with cross indicators as additive covariates. The only difference is that we are considering all possible partitions of the taxa, while Li *et al.* (2005) assumed a particular one. There is one technicality: The crosses in which the QTL does not segregate also need to be included in the likelihood, and they contribute to the estimate of the residual variance.

We thus consider each possible partition, *π*, one at a time, and scan the genome to obtain a set of LOD curves, *π* on chromosome *i*, *i*, *i*.

To evaluate the relative support of the different partitions, we use an approximate Bayes procedure. Assuming the presence of a single diallelic QTL on chromosome *i*, we assign equal prior probabilities to the different possible partitions, *π*, treat the profile log likelihoods

We further use these approximate posterior probabilities to form a 95% Bayesian credible set of partitions. One could assign unequal prior probabilities to the partitions, for example, based on the branch lengths in the assumed phylogenetic tree, giving more weight to longer branches. One might also use a prior on partitions that assigns greater weight to partitions induced by the tree and lesser (but nonzero) weight to the other (possibly more numerous) partitions.

The 95% credible set of partitions is relevant only if there is sufficient evidence for a QTL on that chromosome. To evaluate the evidence for a QTL, we consider the maximum of the *i* and derive a significance threshold, adjusting for the genome scan, by a stratified permutation test (Churchill and Doerge 1994). The permutation test is stratified in that we permute the phenotype data, relative to the genotype data, separately in each cross. For each permutation replicate, we calculate the LOD curve for each possible partition and then take the maximum LOD score across the genome and across partitions. The 95th percentile of these permutation results may be used as a significance threshold, or we may calculate a *P*-value that accounts for the search across partitions and across the genome.

One may restrict the analyses to the set of partitions induced by the assumed phylogenetic tree, or one may consider all possible partitions of the taxa into two groups. For example, for the four-taxon tree in Figure 1, one may consider only the five partitions that correspond to QTL locations on the tree, as in the accompanying table, or one may also consider the two additional partitions,

## Theory

In this section, we address a theoretical question of considerable interest: Which subsets of crosses are sufficient to identify the location of a QTL on the phylogenetic tree? With very large crosses, we can exactly determine which crosses are segregating a QTL and which are not. As discussed in the Introduction, one need not perform all possible crosses. For example, for the case in Figure 1, if one performs only the crosses A × B, A × C, and A × D, the ideal results perfectly discriminate among the possible locations of the QTL on the tree. However, if one performs only the crosses A × B, A × C, and B × C, several of the possible partitions of strains exhibit the same pattern of presence/absence of QTL and so are confounded. Clearly, all taxa must be involved in the chosen crosses.

It is useful, in considering this problem, to represent a set of crosses by a graph, with nodes corresponding to taxa and edges indicating a cross between two taxa. For example, consider Figure 2. A phylogenetic tree relating six taxa is shown in Figure 2A. Three possible choices of a subset of five crosses among the six taxa are displayed in Figure 2, B–D.

A sufficient condition for identifying the true partition of the strains is the use of a set of crosses that *connect* all of the taxa, as in Figure 2B. Choose an arbitrary taxon (*e.g.*, A) and assign it an arbitrary QTL allele. With sufficient numbers of individuals in each cross, we may determine whether the QTL is segregating in a cross, which indicates that the two taxa have different QTL alleles, or is not segregating, which indicates that the two strains have the same QTL allele. Thus, one may move between taxa connected by a cross and assign QTL alleles, and so if the set of crosses connect all of the taxa, one can assign QTL alleles to all taxa and so identify the correct partition of taxa.

If the set of crosses are not connected (as in Figure 2, C and D), then some partitions of taxa will be confounded. For example, for the crosses in Figure 2C, the partition

If one is considering all possible partitions of the taxa (and not just those induced by the tree), then graph connectivity is also a *necessary* condition for identifying the true partition: If the crosses do not connect all taxa there will always be some partitions that are confounded.

However, if one focuses solely on those partitions induced by the tree (that is, partitions that result from a split on an edge in the tree), then it is *not* necessary that the crosses connect all taxa. An example is shown in Figure 2D. For the pairs of partitions that are confounded with this choice of crosses, no more than one of each pair corresponds to a split on the tree in Figure 2A; each possible partition induced by the tree gives a distinct set of QTL results for these crosses. Moreover, in this case one may omit any one of the three crosses, B × C, B × E, C × E: Only four crosses are necessary to distinguish among the nine partitions induced by the tree in Figure 2A.

That the crosses connect all taxa is a necessary and sufficient criterion to distinguish among all possible partitions, but it is not a necessary condition to distinguish among the partitions induced by the tree. Note that a cross between two taxa corresponds to a path along the tree from one leaf to another. Further, the QTL will be segregating in crosses whose paths go through the edge with the QTL, but it will not be segregating in crosses whose paths do not go through that edge. A necessary and sufficient criterion for a set of crosses to distinguish the partitions induced by the tree (*i.e.*, to distinguish the possible locations of the QTL on the tree) is that each edge is covered by at least one cross and that no two edges appear only together.

If an edge was not covered by a cross, then a QTL on that edge could not be distinguished from the null model, of no QTL. If two edges only appear together in crosses, then those two QTL locations cannot be distinguished. Thus, the criterion is *necessary*. For *sufficiency*, note that a cross in which the QTL is segregating will limit the possible QTL locations to the edges on the corresponding path through the tree. As every pair of edges along such a path will appear separately in different crosses, we see that the specific edge containing the QTL may be identified.

For *n* taxa (with

As discussed in the previous section, we recommend that one not restrict oneself to the partitions induced by the tree but rather always consider all possible partitions, possibly with different prior weights. As a result, we recommend that one use, at a minimum, a set of crosses that connect all taxa. However, this is based on the assumption of a small number of taxa. If the number of taxa, *n*, is large, the total number of non-null partitions (2^{n}^{−1} − 1) will vastly exceed the number of partitions induced by the tree (2*n* − 3), and so there is great potential advantage in focusing on the tree partitions.

Of course, in practice crosses are of finite size and so one cannot identify the true partition of the taxa without some degree of uncertainty. In the next section we explore, via computer simulation, the relative performance of the proposed method with different possible choices of crosses.

## Simulations

In this section, we investigate the performance of our approach via computer simulation. We begin by comparing our proposed method to the naive approach of considering the crosses individually and comparing the pattern of presence/absence of a QTL in the crosses to what is expected for different possible partitions. We then compare the performance of our approach with all possible crosses to different choices of a minimal set of crosses.

### Comparison to naive approach

We consider the case of four taxa and use of all six possible intercrosses among pairs of taxa, with 75 individuals per cross (a total sample size of 450). We consider a single autosome of length 127 cM, with markers at an approximately 10-cM spacing, and with a single diallelic QTL placed in the center of an interval between two markers, near the middle of the chromosome. The QTL alleles were assumed to act additively (that is, no dominance), and the percentage phenotypic variance explained by the QTL, in the crosses in which it was segregating, was 10%. We assumed either the partition

For the naive approach, we applied a given significance threshold and inferred the presence or absence of a QTL in a cross if the maximum LOD score on the chromosome was above or below the threshold, respectively. If the presence/absence pattern matched that for a possible partition, that partition was inferred.

For the proposed approach, we applied a given significance threshold on

The results, based on 10,000 simulations, are displayed in Figure 3 as receiver operating characteristic (ROC) curves: the power (the rate of true positives) *vs.* the false positive rate, for varying significance thresholds. We display two sets of curves for the proposed method: For the dashed curves, the power indicates that

The ROC curves for the naive method form interesting shapes, with the lower part of each corresponding to low thresholds and the upper part corresponding to high thresholds, and indicate terrible performance: The false positive rate is well controlled, but power is low. The problem is that, with only moderate power to detect the QTL in a given cross, one has low power to detect the QTL in all of the crosses in which it is segregating, which is necessary to identify the correct partition of the taxa. Lowering the significance threshold below the 5% level helps somewhat, but the power to detect the true partition is no higher than 21%. The naive approach might actually perform better if one considered a smaller set of crosses, but we have not explored this further.

The proposed method performs reasonably well, and the false positive rate is well controlled at the nominal 5% significance threshold (the points in Figure 3). Lowering the threshold could give some improvement in power while maintaining the false-positive rate below the target level, at least in the simulated situations.

### All crosses *vs.* minimal crosses

In the previous section, we noted that it is not necessary to use all possible crosses among taxa. To distinguish among all possible partitions, one need only choose a set of crosses that connect all taxa. Sets of crosses that connect all taxa and are of minimal size (*i.e.*, *n* taxa) are called minimal sets. We now turn to the question of whether it is better to use all crosses, with a smaller number of individuals per cross, or a minimal set of crosses, with a larger number of individuals per cross. We use the same general settings as for the simulations comparing the proposed method to the naive approach, with four taxa and the true partition being either

Figure 4 displays the simulation results, as a function of the effect of the QTL, for the case that the total sample size was 450 (*i.e.*, 75 individuals per cross when considering all crosses and 150 individuals per cross when considering a minimal set of three crosses) and when all possible partitions were considered. The results with other sample sizes and with analysis restricted to the five partitions induced by the tree in Figure 1 are shown in Figure S1, Figure S2, Figure S3, Figure S4, Figure S5, and Figure S6. The top of each figure indicates the power (the chance that

In choosing among the possible minimal sets of crosses, power is highest when a larger number of crosses are segregating the QTL. For a fixed total sample size, the use of all possible crosses (with fewer individuals per cross) has better performance than the worst of the possible minimal sets of crosses, but is not as good as the best of the possible minimal sets of crosses. The use of all possible crosses has greater power when the true partition is

The use of a total sample size of 300 or 600 gives qualitatively similar results (see Figure S1, Figure S2, Figure S3, Figure S4, Figure S5, and Figure S6; Figure S7 and Figure S8 contain the false negative rates), although we note that while a larger sample size results in a great improvement in power, it gives only a slight improvement in the chance that the credible set includes only the true partition.

Restricting the analysis to the five partitions induced by the tree has little effect on power (compare Figure S1 and Figure S2), but improves the chance that the credible set includes only the true partition (compare Figure S3 and Figure S4), and results in a somewhat lower false-positive rate (compare Figure S5 and Figure S6).

The performance of the proposed method with different possible choices of minimal crosses is largely predicted by the number of crosses that are segregating a QTL: The solid curves of a given color (which indicates the number of crosses segregating a QTL) are largely coincident, but there are some differences (red curves in Figure 4, middle right). To explore this further, the results for the individual choices of crosses, when the percentage phenotypic variance explained by the QTL is 10% and the total sample size is 450, are displayed in Figure 5. (For other sample sizes and for the analyses restricted to the partitions induced by the tree in Figure 1, see Figure S9, Figure S10, Figure S11, Figure S12, Figure S13, Figure S14, Figure S15, and Figure S16.)

In the case that the true partition is

To understand the difference, we need to consider the sign of the QTL effect in different crosses for the true partition and the best alternative partition; these are shown in Table 1. If the true partition is

While no such differences among the choices of minimal crosses are seen when the true partition is

## Application

To illustrate our approach, we consider the data from Li *et al.* (2005), originally reported in Lyons *et al.* (2003a,c,b) and Wittenburg *et al.* (2003, 2005) and available at the QTL Archive (http://www.qtlarchive.org). These data concern four intercrosses among five inbred mouse strains, CAST/Ei (C), DBA/2 (D), I/LnJ (I), PERA/Ei (P), and 129S1/SvImJ (S). The four intercrosses performed were C × D, C × S, D × P, and I × P. The C × D and C × S crosses were all males and had 277 and 275 mice, respectively. The D × P and I × P crosses had approximately equal numbers of males and females and had a total of 282 and 322 mice, respectively. As in Li *et al.* (2005), we focus on a single phenotype, the square root of plasma HDL cholesterol. Note that the four intercrosses form a daisy chain, S × C × D × P × I, and so satisfy the connectedness condition necessary for inference of the correct partition of the strains at a diallelic QTL.

We used the genetic map from Cox *et al.* (2009), with marker locations obtained using the Mouse Map Converter at the Jackson Laboratory (http://cgd.jax.org/mousemapconverter). We used standard interval mapping (Lander and Botstein 1989) and considered all 15 possible partitions of the five strains, without attempting to infer a phylogenetic tree relating the strains. To handle the two sexes, we included sex as an additive covariate (that is, we allowed for a shift in the average phenotype between the sexes and assumed no QTL × sex interaction). We used permutation tests with 10,000 replicates to obtain 5% significance thresholds for the individual crosses and for

Following Li *et al.* (2005), we focused on chromosomes 1, 2, 4, 5, 6, and 11. The LOD curves for the individual crosses are displayed in Figure 6, left. The LOD curves for the top five partitions on each chromosome are in Figure 6, middle. The posterior probabilities of the different partitions, assuming the presence of a single diallelic QTL, are on the right. In all cases, the 95% credible set of partitions contains either two or three partitions.

For chromosome 1, significant evidence for a QTL is seen in the crosses C × S and D × P but not in C × D or I × P. By the naive approach, we would infer the partition *et al.* (2005) assumed. Our proposed method does give this partition the highest posterior probability (57%), but also gives reasonable weight to the alternative

For chromosome 2, we see a QTL just in cross C × D. By the naive approach (given the set of crosses performed), we would infer the partition *et al.* (2005) assumed. However, by the proposed method,

For chromosome 4, we have evidence for a QTL in all four crosses (although in the cross I × P, the maximum LOD score was 3.42, just missing the threshold of 3.44). If we assume that there is no QTL segregating in I × P, we would infer the partition *et al.* (2005) assumed. The latter is the partition with the highest posterior probability (78%), while the former has posterior probability 7%, and a third partition,

For chromosome 5, we see a QTL only in cross I × P, and so by the naive approach we would infer the partition *et al.* (2005) assumed. But the maximum LOD score for this partition was 3.98, which doesn’t meet the 5% significance threshold. (The genome-scan-adjusted *P*-value was 0.37.) Thus, by our proposed approach, we would not infer the presence of a QTL. But if we do allow that there is a QTL, two other partitions are contained within the 95% credible set:

For chromosome 6, we have significant evidence for a QTL only in cross C × D (the other three crosses have maximum LOD scores of 1.5–1.9 on chromosome 6), and so the naive method would give the partition CS|DIP, which has posterior probability <0.01% and is not contained in the 95% credible set. The partitions with highest posterior are *et al.* (2005) had assumed the partition

For chromosome 11, there was significant evidence for a QTL only in the cross I × P, although the cross D × P has a maximum LOD score of 3.16 (corresponding to a genome-scan-adjusted *P*-value of 0.093). The naive approach would give the partition *et al.* (2005) assumed. The partition with highest posterior probability is *P*-value was 0.14.)

## Discussion

We have described a formal approach for the joint analysis of multiple crosses to map the origin of QTL alleles to a position on a phylogenetic tree. Our approach unites QTL mapping with phylogenetic comparative methods to provide a view of the genetic mechanism underlying phenotypic evolution. Further, our approach partitions taxa according to their QTL allele, facilitating haplotype analyses for the fine mapping of QTL. In addition, as part of this work, we have begun to evaluate a variety of experimental design issues for such research, which provides some guidance to researchers seeking to take advantage of this approach.

The goal of the work in Li *et al.* (2005) was to combine multiple related crosses to more precisely map QTL. The key difficulty in applying this idea is that one must define a unique partition of the strains into the two QTL alleles, *a priori*. In the presence of multiple QTL, the phenotypes of the strains cannot be trusted for inferring the QTL alleles, and in the current application, the six QTL partition the five strains in diverse ways. Li *et al.* (2005) used the pattern of QTL in the different crosses to infer the appropriate partition, which we have (perhaps overly harshly) characterized as the naive approach. We have proposed a formal method for comparing the different possible partitions. For two of the six loci, we find that the partition with strongest support is different from that assumed by Li *et al.* (2005), and for all six loci there are multiple partitions with reasonable support.

Our approach thus provides an important improvement on the method of Li *et al.* (2005). As seen in Figure 6, middle, the different partitions can have quite different LOD curves and so provide different information on the likely location of the QTL. Thus, our formal approach to identifying the well-supported partitions can improve localization of a QTL. Moreover, one could combine the information from the multiple partitions to better define the location of the QTL, taking account of the uncertainty in the partition.

Furthermore, while the application of these ideas to evolutionary studies remains our primary interest, the more straightforward application is in biomedical or agricultural research, as in Li *et al.* (2005), for the combined use of multiple crosses to more precisely map a QTL and, subsequently, with an inferred partition (or partitions) of strains in hand, to inform the analysis of the haplotypes of the strains (see, for example, Burgess-Herbert *et al.* 2008) in the search for the underlying causal polymorphism. The results are also valuable for the design of future experiments, if additional crosses are to be performed.

Our approach has some similarities to the use of local phylogenetic trees to define possible partitions of multiple alleles (Pan *et al.* 2009; Zhang *et al.* 2012) and to coalescent-based approaches (Zöllner and Pritchard 2005) for genome-wide association studies. The key distinction of our method is that we seek not just to establish association but also to identify the appropriate partition and so define the origin of the mutant QTL allele on the local phylogenetic tree. In our approach, the QTL location on the tree is not a nuisance parameter but rather is the target of inference.

In our simulation studies, we compared the use, for a fixed total sample size, of all possible crosses to different choices of a minimal set of crosses. Depending on the underlying true partition of taxa at a QTL, one can choose a minimal set of crosses with considerably higher power. However, given the prior uncertainty in the true partition, and the possibility of multiple QTL that each partition the taxa differently, it is prudent to consider all or at least a larger number of possible crosses. An even more important experimental design question, which we have not considered here, is how to choose which taxa, out of a large number of related taxa, to consider, in the effort to characterize the genetic architecture of a quantitative trait.

We have focused on a set of intercrosses. The approach could be adapted for the analysis of a set of backcrosses, although these would likely need to be of a special form, with the F_{1} hybrids all crossed to a common parent.

There are a number of additional ways in which our analytical framework could be extended. Most quantitative traits are affected by multiple QTL, rather than single QTL as assumed here. The restriction that a QTL has a common effect in all crosses in which it segregates might be relaxed, particularly for traits that are heavily shaped by epistasis, such as hybrid sterility and hybrid inviability (Coyne and Orr 2004). Prior distributions of QTL partitions could incorporate phylogenetic branch lengths (taxa separated by shorter evolutionary distances are more likely to share QTL alleles) as well as topologies. Finally, future developments might account for variation in the tree. This variation includes both statistical uncertainty associated with phylogenetic inference and real phylogenetic discordance across the genome, which results from incomplete lineage sorting and introgression in recently diverged taxa (Pamilo and Nei 1988; Maddison 1997; Pollard *et al.* 2006; White *et al.* 2009). The power of reconstructing QTL evolution as well as the increasing capacity for genetic mapping of complex traits and phylogenetic reconstruction should provide motivation for these extensions in the evolutionary, biomedical, and agricultural communities.

Software incorporating the proposed methods are available as part of R/qtl (Broman *et al.* 2003, http://www.rqtl.org), an add-on package to the general statistical software R (R Development Core Team 2010).

## Acknowledgments

The authors thank Beverly Paigen for making her data publicly available, Gary Churchill for developing the QTL Archive, from which the data were drawn, and an anonymous reviewer for identifying a gap in the Theory section and encouraging its closure. This work was supported in part by National Institutes of Health grant GM074244 (to K.W.B.) and by National Science Foundation grant DEB 0918000 (to B.A.P.).

## Appendix

In this section, we demonstrate that the minimal number of crosses to distinguish among the *n* taxa is *n*/3. Recall, from the *Theory* section, that such a set of crosses must cover each edge in the tree, and no two edges can appear only together in crosses.

First note that such a minimal set of crosses defines a graph on the set of taxa. The connected components in this graph must each contain three or more taxa. If a component has just one taxon, then it is not covered by any cross. If a component has just two taxa, say A and B, then the edges in the tree leading to A and B will only appear together (in the cross A × B) and so cannot be distinguished. Thus, a minimal set of crosses must divide the taxa into groups of at least three, and so the minimal number of crosses is ≥2*n*/3, since every three taxa will need two crosses to connect them.

We now describe an algorithm to construct a minimal set of crosses, through which we show that the tree partitions can be distinguished with *n* < 6 taxa, form any set of *n* − 1 crosses that connect all taxa. Otherwise, we pull out three taxa at a time, to form nonoverlapping subtrees,

Define a *cherry* to be a pair of taxa that form a clade, and let *T* denote the full tree. Any unrooted tree contains at least two cherries. If there are exactly two cherries, then pick one taxon from each cherry and a third arbitrarily from among taxa not part of a cherry. If there are more than two cherries, then choose one taxon each from three different cherries. The three taxa chosen form *T*_{1}. Let

As an illustration, consider Figure S17; the top depicts a tree with nine taxa, and with four cherries. We first pick three taxa, one each from three different cherries: A, C, and F. We then consider the subtree of the remaining six taxa, displayed in Figure S17, bottom. We now pick one taxon from each of the two cherries and one of the two remaining taxa: B, E, and H. There are three taxa remaining (D, G, and I), and so we stop. We may then pick any two crosses among each group of three taxa, say A × C, C × F, B × E, E × H, D × G, and G × I. These six crosses are sufficient to distinguish all 15 partitions induced by the tree of nine taxa.

We turn to the proof. Again, let *T* denote the full tree, let *T*_{1} denote the subtree formed by the first three taxa chosen by the algorithm above, and let *n* − 3 taxa. Note that any internal edge in *T* is covered by *T*_{1} that form a cherry. Further, any internal edge in *T* is either represented in both *T*_{1} and *f* is covered by *T*_{1}. The other five internal edges are each covered by both *T*_{1} and *T* that is collapsed with another branch in forming *T*_{1}: Any internal edge that gets collapsed with another edge in forming *T*_{1}.

Consider any pair of crosses that cover *T*_{1} and any set of crosses that are sufficient to distinguish the edges in *T*_{1} or *T* will appear only together in these crosses. There are four classes of edges to consider: the external edges that become part of *T*_{1}, the external edges that become part of *T*_{1} and

Now consider two distinct edges from *T*. If one is an external edge in *T*_{1}, then the other is either another external edge in *T*_{1}, in which case they are distinguished by the crosses within *T*_{1}, or it is represented in *T*_{1}, in which case there is a cross in *T*_{1} containing the latter but not the former, or it is a distinct external or internal edge in *T*_{1} and *T*_{1} or *T*_{1} or

Finally, we consider the recursion: If all edges in *T* are distinguished by *T*_{1} and *T* are identified by

At the risk of sounding pedantic, note that our algorithm partitions *T* into subtrees *i* < *p* and |*T _{p}*| = 3, 4 or 5. Since subtrees with three, four, and five taxa will be covered by two, three, and four crosses, respectively, we see that for any tree with

*n*taxa, there is a minimal set of

## Footnotes

*Communicating editor: K. M. Nichols*

- Received May 29, 2012.
- Accepted June 28, 2012.

- Copyright © 2012 by the Genetics Society of America

Available freely online through the author-supported open access option.