This intelligible fast dating using least-squares criteria and algorithms for
Posted in Dating
Our algorithms apply to serial data, where the tips of the tree have been sampled through times. They estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they can provide an estimate for the root position, thus representing a new, practical alternative to the standard rooting methods e.
Our algorithms exploit the tree recursive structure of the problem at hand, and the close relationships between least-squares and linear algebra. We distinguish between an unconstrained setting and the case where the temporal precedence constraint i.
Linear Regression - Least Squares Criterion Part 1
With rooted trees, the former is solved using linear algebra in linear computing time i. With unrooted trees the computing time becomes nearly quadratic i. Using simulated data, we show that their estimation accuracy is similar to that of the most sophisticated methods, while their computing time is much faster. We apply these algorithms on a large data set comprising strains of Influenza virus from the pdm09 H1N1 Human pandemic.
Again the results show that these algorithms provide a very fast alternative with results similar to those of other computer programs. The explosion of genetic data and progress in phylogenetic reconstruction algorithms has resulted in increasing utility and popularity of phylogenetic analyses. Data sets with thousands of taxa are becoming more and more common, especially amongst virus evolution studies.
Moreover, a number of studies have used molecular-dating techniques to tackle a wide range of biological questions, for example, in systematics for timing the tree of life Hedges and Kumar ; Jetz et al. Currently, the most popular dating approaches are based on sophisticated probabilistic models, most often implemented in the Bayesian framework and able to account for complex priors Thorne and Kishino ; Rannala and Yang ; Drummond and Rambaut ; Guindon et al.
Maximum-likelihood methods have also been designed to deal with simpler models Rambaut Corresponding computer programs take a sequence alignment and a set of known dates as input and return a time-scaled tree, with estimates of the substitution rate s and of the dates of all tree nodes. Some programs e. These programs typically contain several submodels, which describe the substitution process e.
We distinguish the strict molecular clock SMC model, where the substitution rate is assumed to be constant across all tree branches, and uncorrelated and correlated relaxed-clock models. With uncorrelated models, the rate associated with each branch is drawn independently from a common underlying distribution; these models are commonly used with fast-evolving species over short time periods, typically with viruses for which there is no strong evidence of rate correlation among branches Drummond et al.
With correlated also called autocorrelated models, the rate distribution for a particular branch depends on the rate value of the neighboring branches; the use of correlated models seems to be the preferred choice with large groups of slowly evolving species, for example mammals, where it has been demonstrated that some subgroups evolve faster than others e.
However, the advantages and limitations of this large variety of models is still a question of debate Drummond et al. All these models and methods have shown to be useful in a number of studies, but they are computationally intensive, making it virtually impossible to deal with the larger data sets available today, even when using sophisticated implementations and powerful computers Ayres et al.
Typically, days of computations are required to analyze a few hundred taxa, although faster approaches are available, using complex algorithmic approaches Akerborg et al. Here we are interested in dating very large phylogenies, typically with a thousand tips or more, a need that is becoming increasingly common, for example, in molecular epidemiology. We propose distance-based algorithms to estimate rates and dates, a mathematical and computational framework that has proven to produce fast and fairly accurate tools in phylogenetics e.
Several distance-based as opposed to sequence-based, see above dating methods have already been proposed. Most of these methods deal with time calibration points, where the dates of certain ancestral nodes in the tree are known, possibly with uncertainty e.
These methods input a rooted tree with time calibration points, and return a time-scaled, ultrametric tree.
PATHd8 Britton et al. Xia and Yang's method assumes a SMC or two different local clocks, and achieves least-squares estimations under these assumptions. Sanderson'sapproach is based on a penalized-likelihood criterion to account for the autocorrelation of rates, combined with standard optimization techniques see also TreePL, Smith and O'Meara Based on computer simulations, these fast methods were shown to be accurate by their authors, producing time-scaled trees similar to those obtained using sequence-based approaches.
The focus of the present study is on serial phylogenies, where the tips of the tree have been sampled through times. Such phylogenies are common with fast-evolving organisms e. Serial phylogenies are also used with ancient DNA Lambert et al. Moreover, close relationships exist between the calibration-points and dated-tips approaches Ronquist et al.
Several methods have been proposed in this framework.
One of the very first is root-to-tip regression RTT Shankarappa et al. This method is very fast and can be extended to unrooted trees by searching among all tree branches for the best root position, according to some numerical criterion e. However, this method does not provide estimates for the dates of internal nodes, and thus does not output time-scaled trees. To obtain date estimates of the internal nodes, sUPGMA Drummond and Rodrigo combines a regression method to estimate the substitution rate in a first step, corrects the non-contemporaneous tips into contemporaneous tips in a second step and then uses UPGMA Sokal and Michener to compute the tree.
Unlike the former approaches, Langley and Fitch's LF; method uses an explicit model. The LF method assumes a SMC with a constant substitution rate, and models the number of substitutions along each branch of the tree by a Poisson distribution.
The estimates of the global substitution rate and of the internal node dates are then obtained by maximizing the likelihood of the input, rooted tree. LF is implemented in r8s Sanderson In this article, we study a model analogous to LF's, but using a normal approximation that allows for a least-squares approach, and show that this model is robust to uncorrelated violations of the molecular clock.
Using the tree recursive structure of the problem at hand, and the close relationships between least-squares and linear algebra, we propose very fast algorithms to estimate the substitution rate and the dates of all internal tree nodes. With rooted trees, the time complexity is nearly linear i.
The article is organized as follows: we first define the model and show its ability to handle uncorrelated rate variations among tree branches, as is commonly assumed with virus data. We then present our two main algorithms, distinguishing the unconstrained setting and the case where the temporal precedence constraints i.
Last, we compare these algorithms to standard approaches using simulated data and a large influenza data set. Our algorithms take as input a binary phylogenetic tree with branch lengths, inferred by any tree building program, and sampling dates associated with the taxa. As our algorithms are very fast, it is consistent to combine them with fast tree-building methods, for example distance-based methods e.
However, we shall see that results obtained with both approaches are close. The algorithms accept a rooted or unrooted tree, and for unrooted trees we propose a method to estimate the root position, though simulations show that the use of an outgroup is generally preferable. Given a set of n serially dated sequences, let R be the input rooted binary phylogenetic tree on these sequences with known branch lengths.
Node 1 corresponds to the root. The date of node i is denoted by t i. For every internal node ilet s 1 i and s 2 i be the two direct descendants of i. With a SMC, the substitution rate i. We use a Gaussian model, which is closely related to that proposed by Langley and Fitch Due to sampling noise and estimation errors, the branch length estimate b i available in input tree R can be expressed as:.
We thus assume:. A limit of this model is that short branches may be negative according to Equation 1but we impose positivity using temporal precedence constraints see below. As evolution is independent from one branch to another, we consistently assume that the noise terms are mutually independent.
The weighted least squares WLS criterion to be minimized proportional to the log-likelihood assuming this model is given by:. Fitch and Margoliash's tree inference method use the square of the pairwise evolutionary distance estimate. We use here another standard approach for discussion, see Gascuel derived from the Poisson nature of the substitution process, where. However, the limit of such variance estimates is that overconfidence is given on very short branches, while their short length may be due to sampling randomness or estimation errors.
To avoid this problem, we use the following additive smoothing for the variance estimates:. The higher c is, the closer we are to equal variances, that is, ordinary least squares OLS.
This model accommodates some violations of the molecular clock. Assume a simple model similar to Drummond et al. In other words, b i follows a normal distribution having a similar form as Equation 1but the error term incorporates an additional factor i.
Moreover, the variance term is an increasing function of b ias in Equation 3meaning that using our algorithms with uncorrelated violations of the molecular clock is still well founded. To summarize, our model Eq. This corresponds to the default option in several programs e. We certainly do not pretend that this model depicts all the complexity of sequence evolution, but it makes possible very efficient calculations with little loss in terms of estimation accuracy, as described later.
This is an obvious requirement, analogous to the positivity of branch lengths in phylogenetic trees. However, not all dating methods comply with this requirement e. The reasons for this are mostly computational. Imposing positivity constraints has a computational cost, as we shall see below in our dating context.
Fast Dating Using Least-Squares Criteria and Algorithms
This function is a convex quadratic form O'Meara and has a unique minimum see Proof in the Online Appendix. Therefore, Equation 2 also has a unique minimum. We propose two different algorithms. One takes into account the temporal precedence constraints, while the other does not.
Fast Dating Using Least-Squares Criteria and Algorithms we propose very fast algorithms to estimate the substitution rate and the dates of all internal tree nodes. With rooted trees.
We present the weighted versions in the following, as the unweighted versions are simply obtained by fixing the w i to 1. The objective function Eq.
The latter equation is equivalent to the following system of equations:. The resolution of Equations 5 can be achieved in linear time i. The technical details of the LD algorithm are given in the Online Appendix.
The main idea is to simplify progressively this system Eq. After the first, bottom-up set of replacements, we have. This algorithm can be extended to non-binary trees. However, nothing guarantees that the date estimates satisfy the temporal precedence constraints.
This is why we designed the QPD quadratic programming dating algorithm, which we describe now. QPD is based on an active-set method, which is commonly used to solve optimization problems with linear constraints Nocedal and Wright With strictly convex quadratic functions, this method is ensured to converge to the unique global minimum Nocedal and Wright Although Equation 2 does not comply with these requirements, a proof of QPD convergence to the unique minimum is provided in the Online Appendix.
The active-set method is especially efficient here, because we can find the stationary point of the Lagrange function Eq. In our experiments described belowQPD performs 3 iterations on average with simulated trees of taxa, and 69 iterations with an H1N1 influenza data set of taxa.
Although, it is difficult to extrapolate from these experiments, it seems that in practice f is much smaller than nand thus the computing time of QPD appears to be nearly linear.
Given an unrooted tree, we estimate the root position by searching for the point in the tree that minimizes the objective function Eq.
In essence, this is the point that makes the tree the most molecular clock-like. Note that we do not use weights variances in the objective function, since weights depend on their associated branch lengths, which are unknown for the two branches containing the assumed root. Optimizing this function without and with constraints can be done by slightly modifying the LD and QPD algorithms, without changing their time complexities. The technical details are given in the Online Appendix.
Since LD is linear, the corresponding rooting algorithm is quadratic. For QPD, to avoid exploring all branches, which could be time consuming with large trees, we pre-estimate the position of the root using LD, and then we use QPD to perform a greedy search for the local minimum around that position.
This rooting method is also applicable when all tips are contemporaneous, thus representing a new alternative to the standard rooting methods midpoint, minimum-variance, etc. We implemented a tree generator based on a simple birth-death model with periodic sampling times, mimicking typical intrahost studies with yearly sampling, or interhost epidemic surveillance through time. Let us start with SMC.
This process is continued until we have individuals. Then we proceed with sampling and death: the evolution of a number of individuals e.
Fast dating using least-squares criteria and algorithms
The process continues with the nonculled and nonsampled individuals in our examplewhich are further divided using the same Yule-type rule until we again have individuals to be sampled, culled, or conserved for the next step. The whole process is continued until we attain the desired number of sampling times. The final set of sampled individuals is exactly the taxon set or leaves of the final tree.
This tree is then rescaled so that the time between the first and the last sampling time is 20 years, with the root date being zero.
Sep 30, Using the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear algebra, we propose very fast algorithms to estimate the substitution rate and the dates of all internal tree shawchapman.com by: Using the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear algebra, we propose very fast algorithms to estimate the substitution rate and the dates of all internal tree shawchapman.com by: least-squares approach, and show that this model is robust to uncorrected violations of the molecular clock. Using the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear algebra, we propose very fast algorithms to estimate the substitution rate and the dates of all internal tree nodes.
An advantage of this scheme is that the time elapsed from one sampling time to the next one is constant, thus emulating the sampling of DNA sequences from an evolving population on a regular basis, as opposed to standard birth-death tree generators Stadler Moreover, with birth-death trees the divergence times vary among replicates, while here we use fixed divergence times for easy estimation of method accuracy and presentation of the results.
We generated two kinds of trees, intended to simulate interhost and intrahost HIV evolution Volz et al. For each, we used 3 sampling times separated by 10 years with 25 selected individuals at each time, and 11 sampling times separated by 2 years with 10 selected individuals at each time. See Figure 1 for examples of trees. Additionally, we added one outgroup to simulate the search for the root position using the standard outgroup-based approach.
The length of the branch from the ingroup root to the outgroup was three times the length from the ingroup root to the nearest ingroup leaf. With each combination of these parameters, trees were randomly generated. Examples of simulated trees. Four examples of trees extracted from our simulated data sets.
Trees a and b have each 3 sampling dates with 25 sampled strains each.
Trees c and d have each 11 sampling dates with 10 sampled strains each. See text and Volz et al. For this purpose, we reused the previous trees, but multiplied every branch length by a random variable following a lognormal distribution with mean 1 and standard deviation 0.
This value is between the estimates we obtained for pol and env HIV genes unpublished results. These parameter values are similar to estimates already observed with the env region of HIV Posada and Crandall To assess the accuracy of the distance-based dating methods, we inferred trees from these alignments.
Fast dating algorithms, based on a Gaussian model closely related to the Langley-Fitch molecular-clock model. LSD model is robust to uncorrelated violations of the molecular clock. These algorithms apply to serial data, where the tips of the tree have been sampled through times. They estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they can Alternative name: Least-Square Dating. Fast Dating Using Least-Squares Criteria and Algorithms. Here, we present very fast dating algorithms, based on a Gaussian model closely related to the Langley-Fitch molecular-clock model. We show that this model is robust to uncorrelated violations of the molecular clock. Our algorithms apply to serial data, where the tips of the tree have Cited by: Fast dating using least-squares criteria and algorithms. Again the results show that these algorithms provide a very fast alternative with results similar to those of other computer programs. These algorithms are implemented in the LSD software (least-squares dating), which can be downloaded from this web page, along with all our data sets.
All these trees were used in two ways: i the outgroup was used to produce rooted trees, from which the outgroup was deleted; ii we simply removed the outgroup to obtain unrooted trees. All of our data sets model trees, alignments, distance matrices, inferred trees, etc. For RTT, we re-implemented the linear regression method, which takes both rooted and unrooted trees as input.
Given unrooted trees, it estimates the position of the root by minimizing the sum of squared residues. For dozens of data sets, we checked that our implementation gives the same result as Path-O-Gen v1. Unlike other methods used here, RTT does not estimate the dates of internal nodes but only the root date and the substitution rate. We used a SMC with an uninformative prior clock rate had a uniform distribution between 0 and 1. For the relaxed-clock data, we also used a lognormal relaxed-clock model i.
These parameter values are standard and default options were used in all of our analyses.
Additional runs with several alternative priors were also performed uniform prior in a much more narrow interval [0, 0. Moreover, other runs of BEAST were carried out to assess the accuracy of internal node date estimations. We then used the true rooted tree topology otherwise date comparisons are meaninglessand forced it to be constant in BEAST, so that only the branch lengths were re-estimated, just as with PhyML see above.
In all of our analyses, we used meanRate estimator for rate estimations with BRMC, since it was more accurate than ucld. With simulated data, the true value of the parameters substitution rate, root and node dates are known. We used standard quadratic error measures to compare the true and estimated values and assess the accuracy of the methods being compared. An advantage of these measures is that they can be decomposed into variance and bias terms, thus indicating whether the estimation method shows some tendency to over- or underestimate the true parameter value, and whether the main source of errors is, or is not, the variance of the estimates.
The accuracy of that method in estimating the substitution rate is measured by the relative error:. A basic result in estimation theory is that the square of the bias plus the variance of the estimates is equal to the mean square error. It follows that our relative bias is less than the relative error and that their difference corresponds to the relative, standard deviation of the estimates.
We calculated the confidence intervals of these error measures using the bootstrap method; for each data set of trees, we re-sampled 10, times with replacement the set of the estimated values and computed the corresponding error; then, the 2. For the dates of internal nodes, we used the absolute error measured in years and thus easily interpreted defined by:. Again we used the bootstrap to build confidence intervals.
For a fair comparison, we also have to account for tree building, as BEAST infers both the tree and the dates. However, PhyML is much faster, requiring 8 min for the largest taxon trees. The computing time difference between distance-based approaches and BEAST is thus very large see Online Appendix Supplementary Table S1 for detailsbut does not correspond to gains in estimation accuracy, as discussed below.
With SMC data Fig. As a general tendency Fig. Surprisingly, the accuracy of rate and root date estimations are not significantly affected by topological errors: although the FastME and PhyML trees contain a substantial amount of erroneous branches, we see very little difference in accuracy between the results obtained with the true and inferred topologies.
This suggests the use of much faster FastME rather than PhyML, when the aim is not to obtain a fully correct tree topology but to quickly estimate rates and dates, or to perform bootstrap analyses. Summary results with simulated data. Panels a and b show the relative error of the substitution rate estimates, panels c and d show the relative error of the root date estimates, panels e and f show the average error in years of the data estimates of all tree nodes.
See text for the definitions of these measures. With RMC data Fig. Again, the topological errors have little impact on the accuracy of rate and date estimations, and cannot explain the differences among the various methods, especially with BEAST the topological accuracy of which is still slightly better than PhyML's Supplementary Table S4.
As expected the main factor is root positioning, which has a high impact on root date estimations. If the root is misplaced, the tree cannot be dated precisely. Among the methods directly inferring the root position i. These cookies are needed in order to better understand how this site is used and to improve the user experience.
Feb 01, Dating ancestral events is one of the first, essential goals with such data. However, current sophisticated probabilistic approaches struggle to handle data sets of this size. Here, we present very fast dating algorithms, based on a Gaussian model closely related to Cited by: Apr 01, A phylogeny dating method using least-squares algorithms and criteria - tothuhien/lsd2. A phylogeny dating method using least-squares algorithms and criteria - tothuhien/lsd2 please cite: "Fast dating using least-squares criteria and algorithms", T-H. To, M. Jung, S. Lycett, O. Gascuel, Syst Biol. Jan;65(1) Fast Dating Using Least-Squares Criteria and Algorithms. October ; Again the results show that these algorithms provide a very fast alternative with results similar to those of other.
I accept Manage Cookies. Decline Accept. At omicX, we believe trust is of the utmost importance. Transparency allows trust.
You are absolutely right. In it something is also idea excellent, agree with you.08.01.2020|Reply
Infinite discussion :)07.01.2020|Reply
The excellent answer, gallantly :)03.01.2020|Reply