1. Trang chủ
  2. » Tất cả

Advances in understanding tumour evolution through single cell sequencing

18 2 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Advances in understanding tumour evolution through single cell sequencing �������� �� ��� �� Advances in understanding tumour evolution through single cell sequencing Jack Kuipers, Katharina Jahn, Nik[.]

    Advances in understanding tumour evolution through single-cell sequencing Jack Kuipers, Katharina Jahn, Niko Beerenwinkel PII: DOI: Reference: S0304-419X(17)30039-2 doi:10.1016/j.bbcan.2017.02.001 BBACAN 88136 To appear in: BBA - Reviews on Cancer Received date: Revised date: Accepted date: November 2016 February 2017 February 2017 Please cite this article as: Jack Kuipers, Katharina Jahn, Niko Beerenwinkel, Advances in understanding tumour evolution through single-cell sequencing, BBA - Reviews on Cancer (2017), doi:10.1016/j.bbcan.2017.02.001 This is a PDF file of an unedited manuscript that has been accepted for publication As a service to our customers we are providing this early version of the manuscript The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain ACCEPTED MANUSCRIPT T Advances in understanding tumour evolution through single-cell sequencing Jack Kuipers1 , Katharina Jahn1 , Niko Beerenwinkel IP Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland NU S CR Swiss Institute of Bioinformatics, Basel, Switzerland Abstract TE D MA The mutational heterogeneity observed within tumours poses additional challenges to the development of effective cancer treatments A thorough understanding of a tumour’s subclonal composition and its mutational history is essential to open up the design of treatments tailored to individual patients Comparative studies on a large number of tumours permit the identification of mutational patterns which may refine forecasts of cancer progression, response to treatment and metastatic potential The composition of tumours is shaped by evolutionary processes Recent advances in next-generation sequencing offer the possibility to analyse the evolutionary history and accompanying heterogeneity of tumours at an unprecedented resolution, by sequencing single cells New computational challenges arise when moving from bulk to single-cell sequencing data, leading to the development of novel modelling frameworks In this review, we present the state of the art methods for understanding the phylogeny encoded in bulk or singlecell sequencing data, and highlight future directions for developing more comprehensive and informative pictures of tumour evolution AC CE P Keywords: Single-cell sequencing, Cancer evolution, Tumour heterogeneity, Phylogenetics Tumour evolution and heterogeneity Cancerous cells experience complex and diverse genomic aberrations which may induce characteristic hallmarks [1, 2] and allow tumour progression The view of a sequence of genetic changes providing a fitness advantage and leading to a clonal expansion of cells inheriting those characteristics was crystallised by Nowell [3], and exemplified for colon cancer [4] The consequences of an evolutionary model of competing clones in a Darwinian framework are complex and heterogeneous tumours, as were also initially observed [5] and seen as a founder of metastases [6] Tumour heterogeneity was quickly established and examined (as reviewed in [7]) but the evolutionary view of competing populations of tumour cells came back into focus with the turn of the millennium [8, 9, 10] with the arrival of genome sequencing The collection of large amounts of genetic data with next generation sequencing (NGS), spearheaded by the compilation of large public databases by consortia like Equal contributors Preprint submitted to BBA Reviews on Cancer The Cancer Genome Atlas (TCGA) [11] or the International Cancer Genome Consortium (ICGC) [12], cemented the view of cancer as an dynamic evolutionary process with clones arising, expanding and descendent cells differentiating into further competing subclones [13, 14, 15] Detailed genomic data have also uncovered the clonal complexity and heterogeneity across many cancer types as recently reviewed [16] The negative effects of clonal diversity on tumour progression were observed clinically for esophageal adenocarcinoma [17], allowing the use of diversity as a biomarker [18] This example spurred the examination of the clinical implications of the genetic diversity resulting from tumour heterogeneity [19] Heterogeneity or diversity is also a cause of drug resistance or relapse [15, 20, 21, 22] The treatment may target the most common clone, which upon its remission, and the new selective pressures of treatment, may allow smaller subclones to emerge, develop resistance and to progress [23, 24, 25] Subclones may also cooperate [26], which connects back to the ideas of Heppner [7] which emphasised that subclones belong to a complex tumour ecosystem The order of mutations can also affect disFebruary 10, 2017 ACCEPTED MANUSCRIPT specifically deal with single-cell data which we review in Section after discussing the advances in single-cell sequencing in Section An overview of the sequencing and phylogentic reconstruction processes for both bulk and single-cell samples is presented in Figure 1.1 Decoding heterogeneity and evolutionary histories Typically, approaches to study heterogeneity and clonal evolution have looked at bulk samples which mix the DNA of thousands or millions of cells before sequencing The resulting output is an estimate of the frequencies of various variants in each sample To understand the diversity and subclone structure, one needs to be able to decode the evolutionary history from such bulk data The problem of moving from variant frequencies to evolutionary histories reduces to one of deconvolving the mutations in the mixture into clones and their phylogenetic relationship We review methods developed for resolving this problem in Section As depicted in Figure there are situations where the frequencies alone cannot distinguish between different histories This can be improved by taking multiple samples [31, 32] or at different times [33] The results from bulk data however tend to provide rather low-resolution indications of the evolutionary history and heterogeneity [34, 35] because low-frequency mutations cannot be reliably separated into new clones and tend to be placed together or in existing clones Again multiple samples can help in improving the resolution To arrive at the highest possible resolution of a tumour’s history, the sequencing of individual cells has been advocated [35] All cells in the body and in tumours descend a binary genealogical tree of which the cells themselves are the taxa, as depicted in Figure Reconstructing the tree then requires no deconvolution It does though require that mutations, once they arise are preserved from generation to generation and that they may only occur once in the evolutionary tree, also known as the infinite sites assumption With this assumption and perfect calling of the mutations in each cell, the phylogeny can be reconstructed very efficiently [36] The challenge with single-cell data though is that the errors in mutation calling can be very large, and unbalanced In particular when the single copy of a cell’s DNA is amplified to allow it to be sequenced, the coverage may be rather uneven so that some genome positions cannot be called and are effectively missing Due to feedback in the amplification, one allele may happen to predominate at certain genomic positions so that mutations on the other allele not appear in the sequencing data Algorithms have therefore been developed to Bulk sequencing phylogeny approaches CR IP T ease progression and response to treatment [27] The large amounts of genomic data have therefore not only shone light on the complex makeup of tumours, but now highlight how a deeper understanding of their diversity and evolutionary history are needed for more effective and precise cancer therapies [15, 16, 25, 28, 29, 30] AC CE P TE D MA NU S Due to the higher prevalence of bulk-sequencing data, most approaches to reconstruct evolutionary histories of individual tumours are based on this data type Sequencing the admixed cell populations of hundred thousands or even millions of cells that compose a bulk sample only reveals the allele frequencies of the individual mutations in the mixture leaving the number of present subclones, their prevalences, their individual mutation profiles and their genealogy undetermined [35] Phrased in terms of classic phylogeny reconstruction, this is a situation where the number of taxa, their relative population sizes, their individual character states, as well as their phylogenetic relationships needs to be established, while the only information available is the set of characters and an estimate of their relative frequencies across the admixed populations This constitutes a highly underdetermined problem for which classic approaches to phylogeny reconstruction are not suited Hence many tools customised to this problem have been developed in the past years 2.1 Phylogeny reconstruction from SNV data sengupta2015bayclone An overview of software tools for reconstructing tumour evolution based on single-nucleotide variant (SNV) data is given in Table We discuss in the following the shared and distinctive features of the underlying methods An important preprocessing step for reconstructing tumour phylogenies from SNV data, is the correction of allele frequencies for ploidy aberrations - due to copy number alterations (CNAs) or loss of heterozygosity (LOH) - to estimate the cellular prevalences of the mutations [38, 47] In practice many SNV based approaches focus on mutations at copy number neutral sites [39, 40, 41, 42, 45], in which case the cellular prevalence of heterozygous mutations is just two times their relative allele frequency A key assumption shared by nearly all approaches focusing on phylogeny reconstruction from SNV data is that of infinite sites which restricts the space of possible mutation histories in two ways: First, no genomic site is hit by more than one mutation throughout the entire evolutionary history of a tumour, and second, once present, ACCEPTED MANUSCRIPT (c) mutation orders compatible with sample (d) T 0.5 0.3 0.2 compatible with both samples 0.1 CR prevalences in sample sample 0.9 0.85 0.75 NU S (b) prevalences in sample sample IP (a) 0.3 0.1 0.1 mutation orders compatible with sample (b) (c) (d) AC CE P TE (a) D MA Figure 1: (a) Schematic representation of the clonal expansion that shaped the heterogenous tumour depicted in (b) The colours of the cells represent their belonging to the different subclones The small stars inside the cells represent the present mutations (c) Two bulk samples admixed with normal cells (empty grey circles) taken from the tumour in (b) The bar plots depicted next to the samples can be derived from variant allele frequency data obtained by bulk sequencing Each bar represents the estimated cellular prevalence of one mutation present in the sample Note that the dark purple mutation on the bottom left of (a) is absent from the frequency plots because it is too low frequency to be detected (d) Mutation histories compatible with the cell prevalences of sample or sample (Not all compatible trees are depicted.) The two trees in the intersection are compatible with both samples It can not be inferred from the given data that the left one is the true history that matches the clonal expansion in (a) Figure 2: From the heterogeneous tumour from Figure depicted in (a) which has evolved following the schematic representation in (b), the 10 single cells shown in (b) are selected for sequencing One cell is normal tissue while the remaining nine cells from the tumour contain additional mutation represented by the stars in the cells The cells belong to a binary genealogical tree as in (c) where they are connected at their common ancestors The exact nature of the branch points cannot necessarily be determined by the mutations each cell possess, for example the three cells on the left can have any arrangement as long as they are all below the purple mutation which distinguishes them from other cells The representation in (c) is a sample genealogical tree focussing on the relationship between the cells themselves while an equivalent representation is presented in (d) Here the mutations are encapsulated in nodes on a tree with the samples attached as leaves to create a mutation tree This representation emphasises the ordering and evolutionary history of the mutations Software Year Reference Phylogeny TrAp Clomial PhyloSub PyClone RecBTP SciClone AncesTree CITUP LICHeE BayClone CTPsingle Cloe 2013 2014 2014 2014 2014 2014 2015 2015 2015 2015 2016 2016 [37] [31] [32] [38] [39] [40] [41] [42] [43] [44] [45] [46] Y N Y N Y N Y Y Y N Y Y Multiple samples N Y Y Y N N Y Y Y Y N Y Inference Exhaustive search Binomial / EM Tree-structured stick-breaking / MCMC Dirichlet process, beta-binomial / MCMC Approximation algorithm Beta mixture model Optimisation / MILP Optimisation / QIP Heuristic Gibbs sampling / Metropolis-Hastings Dirichlet process, beta-binomial / MCMC Metropolis-coupled MCMC Table 1: Clonal reconstruction methods based on SNV bulk data Abbreviations: EM, Expectation Maximisation; MCMC, Markov Chain Monte Carlo; MILP, Mixed Integer Linear Programming; QIP, Quadratic Integer Programming ACCEPTED MANUSCRIPT Single-cell samples DNA extraction DNA extraction NU S CR IP T Bulk sample 0.1 0.2 0.3 0.5 MA number of SNVs DNA sequencing and mutation calling 0.9 TE AC CE P number of SNVs Mutation clustering 0.1 0.2 0.3 DNA sequencing and mutation calling D variant allele frequencies DNA amplification 0.5 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0.9 noisy mutation matrix variant allele frequencies Mutation tree reconstruction Mutation tree reconstruction Figure 3: Left: Overview of the typical work flow for the reconstruction of mutation histories from bulk tumour samples DNA is extracted from a bulk sample and sequenced to reveal the admixed mutation profile Clustering mutations by variant allele frequencies reveals possible subclones and their relative frequency in the admixed sample Based on this information compatible mutation histories are inferred Right: Overview of the typical work flow for the reconstruction of mutation histories from single-cell samples The DNA is extracted from the individual cells and amplified due to the limited starting material This process does not amplify all genomic sites equally well The amplified DNA material is then sequenced and mutations are called The mutation profiles of the individual cells are now combined into a single (noisy) character state matrix that is then used for tree inference ACCEPTED MANUSCRIPT NU S CR IP T stantially restricts the solution space, it is typically not enough to find a unique solution For example, a linear chain of mutations sorted by decreasing prevalence is always consistent with a single sample Biologically motivated constraints, such as minimizing the number of populated subclones or the tree depth can be used to pick plausible topologies [37, 39] Here it is also advantageous that studies increasingly analyse multiple samples per patient These could either be from spatially distinct tumour parts [49], tumour metastasis pairs, or longitudinal studies such as tumour/relapse pairs [20], or xenograft models [50] When multiple samples of the same tumour are available, there is a second constraint, the ‘fork rule’, which states that if among two mutations, the first is more prevalent in one sample and the second in another sample, they need to be placed in separate branches [32] In general, the more samples available the more topologies can be excluded, as long as the their subclone composition differs sufficiently However, in practice this process is complicated by inaccuracies in the estimated cell prevalences and possible errors in the clustering due to which no tree may be consistent with all data One solution here is to find a tree that minimises the errors in the estimated cell prevalences to fit them to a tree [32, 42], or to exclude some mutations from the tree [41] While all SNV based reconstruction approaches make use of the combinatoric constraints, they employ vastly different methodologies Three major lines can be identified: Some perform an exhaustive search enumerating all trees that fulfil the combinatoric constraints plus additional biological restrictions [37] or an approximation thereof [39] Others represent the constraints via a directed ancestry graph, which contains the optimal solutions in the form of spanning trees [41, 43], and finally there is a group of Bayesian approaches that give a posterior distribution over the tree space, thereby quantifying uncertainty in the inference [32, 45] Recently another Bayesian approach for tree inference has been proposed that merely penalises trees for violations of the infinite sites assumptions instead of generally excluding them [46] For high-frequency subclones, tree reconstruction from SNV bulk data has sufficient discriminative power to reveal their evolutionary relationships However for low-frequency populations, the signal in the admixed variant allele frequencies seems to be too weak for a reliable reconstruction [35] Also the clustering by allele frequency is less convincing for low-frequency mutations leaving their correct placement in the tree a largely unsolved problem Advances in the sequencing technology towards longer reads may provide further con- AC CE P TE D MA a mutation persists in the whole lineage founded by the cell where it initially occurred The motivation for this assumption is mainly its plausibility given the size of the genome and the relatively low number of mutations observed in tumour samples However it also has the welcome side-effect of reducing the underdetermination of the deconvolution problem and the tree search space The next step common to most SNV based approaches is a clustering of mutations with approximate allele frequencies Some approaches use Bayesian mixture models for this step [47, 48] The assumption behind the clustering is that variants with identical frequency are either both present or both absent in every subpopulation A scenario for such a connection to arise could be a driver mutation occurring in a cell with a pre-existing passenger mutation Then the increased fitness of the cell with the driver and its descendants may have led to the extinction of all cells carrying only the passenger mutation For mutations sets with a shared cell prevalence > 50% such a connection is the only way they can fit on a single tree This follows from the infinite sites assumption which prevents mutations from being split onto separate tree parts and the the pigeon hole principle by which some cell population of the tumour has to have both mutations as the sum of cell prevalences can not exceed 100% For smaller cell prevalences - especially for low-frequency mutations it is less obvious why the assumption should be generally true Two low frequency mutations could have the same approximate cell prevalence by chance without the driver/passenger link described above and could still be erroneously clustered together It has been shown that the deconvolution problem can be solved without grouping mutations by cellular prevalence [37] However the complexity of the problem increases significantly with increasing numbers of subclones, and indeed Strino et al could only solve instances of up to 25 aberrations [37], such that tree inference would in most cases be restricted to a selection of mutations Once the clustering is fixed, the remaining task is to arrange the mutations in a tree consistent with the cell prevalences of the mutations The mutation states of the subclones and their relative frequencies in the sample follow immediately from the consistent tree Consistency here means that the cellular prevalence of each node is at least as large as the sum of the prevalences of its child nodes This is necessary as the nodes are then interpreted as subclones that contain all the mutations along the path from the root to this node, such that the prevalence of a mutation at a node has to be shared with the whole subtree below the node This constraint is also referred to as the ‘sum rule’ [32] While it sub- ACCEPTED MANUSCRIPT straints in the future, as mutations located on a single read can not be placed in different tree branches T IP CR NU S AC CE P TE D MA 2.2 Phylogeny reconstruction from SNV and CNA data There exist a few approaches such as THetA [51], THetA2 [52] and TITAN [53] that use CNA data alone to infer subclones, but none of them reconstructs tumour phylogenies More recently CNA and SNV data has been combined to increase the discriminative power in the reconstruction process A summary of methods following this strategy and their key features are given in Table The methods CHAT [54] and CloneHD [55] estimate cellular prevalences of both SNVs and CNAs but not set them into a phylogenetic context SubcloneSeeker infers trees based on cellular prevalences of both SNV and CNA data [56] However it relies on other tools to accurately estimate these prevalences in a preprocessing step and and is restricted to two samples such as tumour/relapse pairs SCHISM [57] also relies on preestablished cellular prevalences The inference is then a two-step process: It first uses a hypothesis testing framework to establish subclones and their pairwise relationships and then applies a genetic algorithm to find a matching phylogeny PhyloWGS [58] extends the probabilistic framework of PhyloSub [32] to integrate copy number information It is also the first approach to model overlaps between CNA and SNV data Estimates of CNA copy number status and population frequencies are required as input which are then used to transform sites affected by a CNA, or by a CNA and SNV, into pseudo-SNV sites to apply the SNV based probabilistic tree inference method of PhyloSub All of the tree inference approaches discussed so far make the infinite sites assumption which should be revisited in context of copy number changes Since these events typically affect larger segments, the likelihood of two of them overlapping is not negligible Likewise the chance of a mutated allele being lost by a segmental loss is much higher than that of a point mutation reverting it back to its original state Neither scenario is compatible with the infinite sites model such that it is debatable whether the assumption is still safe to make SPRUCE [59] relaxes the assumption to a model where a mutation can change its state multiple times but can not twice attain the same state independently in the tree This restriction is known as infinite alleles assumption or multi-state perfect phylogeny While this is a step in the right direction, it still overlooks many plausible scenarios, such as a site undergoing a copy number change that is later reverted CANOPY [60] solves the issue of recurrent mutation states in a different way: While it nominally keeps the infinite sites assumption, it restricts the scenarios in which it could be violated to such a small number that the assumption becomes reasonable again For example a mutation event would only be considered as recurrent when it sets the exact same genomic segment to the exact same copy number state in different parts of the phylogeny As the endpoints of the segments are defined at the resolution of nucleotide positions, such a recurrence is unlikely to be observed In contrast to the other methods discussed so far, CANOPY is also the only one to recognise that copy number alterations are interdependent and should be rather modelled as sequences of events than as independent changes of chromosome segments This view on genome evolution will become even more useful once tree inference models start to consider structural rearrangements and their potential in confounding readdepth data Pioneering work in this direction was performed by Greenman et al [61] and Purdom et al [62] Neither of these two studies focuses on tree construction, but they estimate the order of genomic rearrangement events Many of the concepts introduced in these works such as the use of external linkage information, e.g HapMap data, for phasing, the assignment of copy numbers to one of the physical alleles [61], may be worthwhile to integrate in future approaches to reconstruct mutation histories of tumours from bulk sequencing data An approach for phasing using only major and minor allele copy number profiles was recently suggested by Schwarz et al [63] Besides the phasing, it computes the tree topology and assigns genomes to ancestral states based on the minimum evolution criterion Single-cell advances After the arrival of NGS and the accompanying drop in price of obtaining genomic information, efforts to understand tumour diversity were epitomised by the collection and archiving of thousands of tumour samples by TCGA [11] and the ICGC [12] Efforts were later also underway to understand inter-tumour diversity at full resolution by sequencing individual tumour cells The technical advances are reviewed for example in [64, 65] and expounded in [66], and here we focus on their use to uncover tumour heterogeneity from a modelling perspective 3.1 Single-cell sequencing The first results for single-cell genomics were for mRNA sequencing of a mouse blastomere [67] where ACCEPTED MANUSCRIPT Reference Phylogeny CHAT CloneHD SubcloneSeeker PhyloWGS SCHISM SPRUCE CANOPY 2014 2014 2014 2015 2015 2016 2016 [54] [55] [56] [58] [57] [59] [60] N N Y Y Y Y Y Multiple samples N Y Y Y Y Y Y Inference Dirichlet process Gaussian mixture model / MCMC HMM / local optimisation Exhaustive enumeration Tree-structured stick-breaking / MCMC Likelihood ratio tests / genetic algorithm Exhaustive enumeration MCMC T Year IP Software NU S a very important component for any modelling of SCS data Although the false positive error rates are low ( 10−5 ) many base positions can be tested across the whole exome or genome so that the total number of falsely detected SNVs may still be in the hundreds or thousands per cell For cells from the same tumour sample, a simple consensus of SNVs across two or more cells reduces the error rates back to low values, which is fortuitous from a modelling perspective because mutations observed in only one cell are also uninformative for reconstructing the evolutionary history of the tumour Since SNVs are selected for analysis when they are detected, the false positive rate among them may be enriched compared to the per base pair error rate of the SCS technique An exciting alternative to Whole Exome Sequencing (WES), or whole genome sequencing, of each single cell to reduce the cost while offering low error rates was to first perform deep bulk sequencing and to liberally select sites which may possess a mutation A personalised panel was then developed for leukaemia patients to use for the final sequencing and mutation calling [80] The preselection of sites to test reduces the enrichment of false positives, but AD and other false negatives still occur during the amplification A further alternative to amplifying the DNA of single cells is to culture individual cells (as done for organoids [81, 82]) before harvesting a large number and performing standard bulk sequencing with the downside that culturing will bias the sample by selecting for viable cells, and may introduce new mutations Before individual cells can have their DNA amplified and sequenced, the cells themselves need to be isolated first One approach has been to collect Circulating Tumour Cells (CTCs) from blood samples which for DNA experiments first had low coverage for CNA calling [83, 84, 85] and later with WES [86] For primary tumour cells, early experiments focussed on micropipetting [69, 70, 73, 74, 87] or nuclei sorting [68, 78, 88] Higher throughput experiments, combined with panel sequencing, have turned to microfluidics [80] or FACS AC CE P TE D MA the major challenge was to have sensitive enough sequencing for the small amount of primary material For DNA this involves amplifying the initial single copy enough to be passed on to sequencers The first successful results [68] used a modified version of PCR for the initial amplification, before further PCR amplification and sequencing The low resulting coverage (≈ 10%) allowed for the identification of copy number variations, but not high confidence mutation calling Higher coverage was then quickly achieved through the use of Multiple-Displacement Amplification (MDA) [69, 70, 71, 72] allowing the identification of SNVs The MDA process involves the attachment of randomly primed Φ29 enzymes which synthesise DNA to create additional and displaced strands, which may then themselves be further amplified From a modelling perspective the amplification of the two original alleles is more akin to a P´olya urn model: starting with two balls representing the genomic base on each allele, repeatedly one ball is selected at random, duplicated and returned with the duplicate to the urn This feedback in the MDA process can also lead to rather non-uniform coverage Sites with low coverage cannot be reliably used for SNV calling, leading to high levels of missing data in early experiments (≈ 60% in [69]) To obtain higher uniformity, although at the cost of higher error rates, hybrid amplification methods have also been developed and utilised [73, 74, 75, 76, 77] Using cells where the DNA had just duplicated [78] reduced the amount of early amplification needed leading to lower error and missing data rates and can be part of the single nucleus exome sequencing (SNES) protocol of [79] With current techniques, Single-Cell Sequencing (SCS) provides high coverage and low false positive rates, but the largest source of uncertainty comes from allelic dropout (AD) where one strand (or part of it) does not get amplified (or not sufficiently) in the early stages and is not detectable in the final sequencing Although AD, which leads to false negatives, has fallen from highs of 40% or more [69], currently they are in the range of 10–20% False negatives therefore remain CR Table 2: Clonal reconstruction methods based on SNV and CNA bulk data Abbreviations: HMM, Hidden Markov Model; MCMC, Markov Chain Monte Carlo ACCEPTED MANUSCRIPT NU S CR IP T highly prevalent in colon cancer, but they were missing in the minor clone pointing to it having a distinct origin and separate development Advances in SCS technology led to better coverage and lower error rates for two breast cancer samples [78] Phylogenetic histories were reconstructed with NJ Since copy number analysis was also performed on the same single cells, they could uncover an early phase of aneuploid rearrangements followed by clonal expansion dominated by point mutations For one sample they saw a linear progression of clonal expansions, while for the second sample the clones separated into subclones, with one subclone founded by another aneuploidy event This combination of copy number and SNV calling on the same individual cells highlighted how both sets of information can be combined to improve the understanding of the phylogenetic history Single cells were analysed from three leukaemia patients [77] In particular they compared different SNV callers, opting for joint calling across samples, and specifically sequenced doublets samples to test for their contamination in the single-cell data To infer the phylogenetic history, they learnt a maximum likelihood tree from the genetic distances between each pair of single cells The evolution was mostly linear (with major subclones for one patient sample) but also exhibited low frequency heterogeneity and branching Since SNV callers (like [99, 100, 101, 102, 103, 104, 105]) are aimed at uncovering variants of different frequencies from bulk sequencing data, they are less applicable to single-cell data where the underlying number of copies of any variant is a (low) integer but the amplification and sequencing is much more noisy To account particularly for the non-uniform coverage of SCS [106] clustered the reads to correct for errors More recently a mutation caller designed for single-cell data has been developed [107] which treats the underlying mutation states in a single cell allowing it to outperform bulk SNV callers For single cell samples from leukaemia patients (from targeted panel sequencing), [80] looked in the other direction of modifying the phylogenetic reconstruction to account for the particularities of single-cell data With high dropouts from the MDA step before sequencing the error rates in single-cell data are highly unbalanced The distance based approaches employed before (whether in constructing a tree, in hierarchical clustering or NJ) implicitly weigh both kinds of errors equally, which can adversely affect the reconstruction Instead [80] introduced a binomial mixture model to cluster the single-cell genotypes, where the probability of a mutation or its absence varies for each cluster MA [89, 90] Barcoding methods [91] are also promising to increase the scope of SCS at lower costs Microwells or drops combined with barcoded beads [92, 93] now allow the parallel RNA sequencing of thousands of cells A more recent version of barcoding for DNA sequencing [94] offers the possibility to sequence 48–96 cells simultaneously broadening the scope of single cell sequencing experiments High-throughput protocols also offer the joint RNA and DNA sequencing of single cells [95] However the individual cells are isolated, a key point in SCS experiments is to verify that the cells are indeed unique Any doublet samples obviously break the single cell assumption at the heart of methods designed specifically to analyse single-cell data Some cell isolating techniques may have high rates of doublet sampling in the range of 10-40% [96] which are important to control experimentally and to bear in mind when modelling 3.2 Single-cell histories AC CE P TE D Once the single cells have been sequenced, and the mutations or copy number events uncovered with standard bioinformatics pipelines, one focus is on understanding the evolutionary history of tumours and their diversity We highlight some of the key datasets, with their characteristics summarised in Table 3, and how the single-cell phylogenetic history informed their analysis One of the first single-cell datasets comes from a JAK2-negative myeloproliferative neoplasm [69], PCA was employed to uncover a likely monoclonal origin of the tumour Also they found that the patient specific mutations did not coincide with the commonly implicated genes for that tumour type Back-to-back a kidney cancer sample [70] was published and no real evidence of clonal subpopulations was uncovered using Neighbour-joining (NJ) [98] However there was large diversity in mutations suggesting an accumulation of passenger mutations The cancer cells were also close to the non-tumour controls indicating a short time frame for the cancer’s progression The first evidence for a branching mutation history in single-cell data was discovered in a bladder cancer [71] using hierarchical clustering This revealed two main subclones which seemed to be outgrowing the ancestral clone since they appeared late in the tumour development but still made up sizeable proportions of the tumour itself Hierarchical clustering was also employed on a colon cancer sample [87] which uncovered a minor clone alongside a much larger main clone The main clone possessed early mutations in TP53 and APC, which are ACCEPTED MANUSCRIPT Number of samples (2012) [69] 1 (2012) [70] (2012) [71] (2014) [87] (2014) [78] (2014) [77] (2014) [80] (2015) [50] (2016) [97] 1 1 1 1 2/3 4–5 Number of mutations Number of cells False positive rate Allelic drop out rate Missing data 712 58 6.04 × 10−5 0.4309 58% 35 443 176 40 / 519 ≤ 1953† 10 – 105 37 / 45‡ 23 – 33‡ 17 44 63 47 / 16 11 – 12 96 – 150 120 / 90 420 – 672 2.67 × 10−5 6.7 × 10−5 < × 10−4 1.24 × 10−6 – – – – 0.1643 0.4 > 0.5 0.0973 0.12 < 0.3 ≈ 0.2 – 22% 55% – 1% 28% – 7–12% – T Number of patients IP Myeloproliferative neoplasm Kidney Bladder Colon Breast Leukemia Leukemia Breast (and xenografts) Ovarian (intraperitoneal) Year and reference CR Cancer type second xenograft generation to then vanish compared to further generations of the first lineage Likewise utilising SCS to enrich bulk sequencing data, the intraperitoneal spread of high-grade ovarian cancer was examined over 68 samples from patients in [97] For three patients, each with or spatially distinct samples, a total of 1680 single cells were isolated and subjected to targeted sequencing of a small number of genomic sites The clonal composition of those tumours was inferred from the single cells using the clustering method of [108] This augmented the bulk clustering analysis by providing higher quality genotypes From the phylogenetic analysis of the multiple spatial samples for each of the patients, the nature of the clonal spread from the ovaries to the intraperitoneal sites could be uncovered [97] Particularly striking was that along with the five patients exhibiting monoclonal seeding, two patients exhibited reseeding and polyclonal spread As well as indicating different possible modes of peritoneal spread, this could also suggest that the different microenvironment of the peritoneal cavity leads to novel selective pressures on heterogeneous tumours TE D MA according to the data Once clustered, the phylogeny can be found as the minimum spanning tree, which for five of the six patient samples featured coexisting highfrequency clones Often the ancestral clones were also still present in the population Along with the phylogenies, the clustering highlighted cells sharing mutations from different lineages indicating that they were the result of doublet sampling NU S Table 3: Characteristics of some single-cell sequencing datasets The number of samples is per patient The number of cells, also per patient, only includes those which passed quality control and were used for mutation calling The false positive and allelic drop out rate estimates are per genomic position The number of mutations excludes those which only occur in one cell which are uninformative for the phylogenetic reconstruction They may however include mutations occurring (or with missing data) in all cells which are also uninformative These have been removed from the count of [70] and not occur for the ER+ tumour of [78] on in any of the patient samples from [80] † The number of mutations listed for [77] refers to the number of loci sequenced ‡ The number of mutations only indicates those uncovered in targeted panels of 40 / 45 SNVs for [50] and of 43 – 84 SNVs for [97] AC CE P More recently, the clustering in [80] was refined to a variational Bayes approach [108] which could also explicitly model the presence of doublet samples The clustering however, like in [80], was performed without enforcing a phylogeny After performing deep bulk sequencing on primary tumours and derived xenograft lines from 15 patients, and studying their clonal composition and dynamics with PyClone [38], two examples were selected in [50] for high resolution follow up with SCS: one with strong initial selection upon transplantation, and one with complex clonal evolution through the xenograft generations For the SCS a targeted panel was designed for each example based on mutations detected with the bulk sequencing For inferring the tree structure of the single cells, the Bayesian phylogenetic approach of [109] was employed The resulting single-cell phylogenies were mainly used to corroborate the genotype clusters found by PyClone from the bulk sequencing, but with the advantage of also providing the ancestral histories of the clones For the example with strong initial selection, the single cell data indicated complete separation between the primary tumour and a late xenograft sample and that the xenograft clone was founded by a very minor clone of the original tumour The other example showed complex clonal evolution with two main lineages The second lineage expanded heavily during the Single-cell phylogenetic reconstruction Along with approaches to call mutations in single cells [107] and cluster them [80, 108], a different direction has been to modify the phylogenetic inference to account for the specifics of single-cell data All cells in a tumour live on a genealogical tree, Figure 2(c), where they connect with each other at their common ancestors If we take the infinite sites assumption that the genome is essentially so long that there is no chance that the same position may mutate more than ACCEPTED MANUSCRIPT NU S CR IP T dealing with the vast number of trees that exist and in finding optimal trees, or a good set of them The first probabilistic single-cell approach of [113] considered three mutation states for the data of [69]: wildtype, and heterozygous and homozygous variants Homozygous variants are presumed to be the result of an allelic dropout of the normal allele so that only the alternative is amplified The likelihood of [113] consisted of the probability of the three observable states given either of the two underlying states and the allelic dropout and false positive rates For the trees themselves, the representation in terms of mutation trees, Figure 2(d), was employed with the aim of uncovering the mutation ordering and evolutionary history Rather than examining the tree as a whole, first the pairwise ordering of each pair of mutations was considered [113] In particular the likelihood of the data when the pair of mutations are in the same or different lineages was computed By simulating genealogical trees [Figure 2(c)], Monte Carlo estimates of the prior probability of mutations sharing a lineage were obtained resulting in a posterior estimate of the probability of different relationships between each pair of mutations In simulating genealogical trees, a parameter was introduced to model the relative time of the first branching event This parameter, which influences the prior distribution, was inferred from the data (an approach known as empirical Bayes) In order to build the mutation tree, first estimates for the pairwise ancestral relationships of all mutations were obtained The maximal posterior ordering between each pair was encoded as an edge in a directed graph, weighted by the posterior probability The mutation tree is then defined as the maximum spanning tree Specifically, edges were removed to achieve a tree which maximised the remaining weights Although this procedure returns a tree, it is not necessarily the tree with the highest likelihood as a whole model since the ancestral relations inferred earlier behave more like parent-child relationships when embedded in the directed graph For the 18 cancer related mutations in the 58 single cells of [69], for example, the empirical Bayes estimate of the prior tree structure is highly linear while the resulting minimum spanning tree is rather branched BitPhylogeny [114] works on the sample tree representation, but rather than using the single cells as leaves they are clustered together into clones Since the number of clones and their composition is unknown, the number of nodes and branches in the cluster tree is also unknown BitPhylogeny therefore considers in its search space all trees with an arbitrary number of clones A prior for the trees is derived from a nested AC CE P TE D MA once in the entire tumour’s history (which also means that no mutations are lost once they arise), then the mutations in the cells form a perfect phylogeny [36] However, fast and straightforward phylogenetic algorithms, like hierarchical clustering, NJ, perfect phylogeny or distance based tree constructions like a minimum spanning tree can struggle or fail completely when presented with noisy data Extensions of the perfect phylogeny problem exist to handle imperfect data, but typically aim to remove data to remove any inconsistencies For example they may find the minimum number of mutations to remove [110, 111] or the minimum number of sampled cells [112] A further difficulty with singlecell data, and where these approaches still struggle, is that the errors are very unbalanced In single-cell data AD or false negative rates are generally over 10% while false positives are of the order of 10−5 or less To account for this fully, probabilistic approaches have been introduced which select possible phylogenetic trees by how well they explain the single-cell data and which consider the full dataset with all of its inconsistencies and the errors due to the technical challenges of sequencing single cells In particular the methods start with a given tree which allows one to check which cells should exhibit which mutations If a cell is supposed to possess a mutation under the tree model, but it is absent in the observed data this would be considered a false negative, with a probability of occurrence given by the false negative rate Conversely if the tree model predicts no mutations, but one is observed, the model would indicate a false positive Repeating this for all cells provides the joint probability of observing the data for that particular tree and error rates This is the likelihood of obtaining the observed data under the tree model and naturally accounts for differences in the error rates A common approach is to find the tree which maximises the likelihood and fits the data most closely Alternatively, Bayes theorem may be employed to find the probability of the tree from the data as a measure of fit of the tree to the data These underlying ideas link the methods developed for single-cell phylogenetic inference [113, 114, 115, 116] although the exact details of the models and their inference vary, as we summarise in Table and now explore in some detail Despite the elevated error rates, an advantage of single-cell data is that, assuming diploid cells and the infinite sites assumption, mutations should be present in either none or one or the alleles, rather than at arbitrary frequencies, and these are the only two cases that need to be tested Of course the presence of mutations across single-cell samples are not independent, but related by the phylogenetic history and in general the challenge is 10 Method Kim & Simon (2014) [113] BitPhylogeny (2015) [114] OncoNEM (2016) [115] SCITE (2016) [116] Phylogenetic representation Mutation tree Clonal tree Sample/clonal tree Mutation tree† Inference Pairwise ordering and maximum spanning tree Tree-structure stick-breaking MCMC Greedy structure search MCMC T ACCEPTED MANUSCRIPT NU S ing over the attachment of sampled cells The averaging serves to vastly simplify and speed up the tree inference but a complete tree can be obtained from both approaches For the phylogenetic inference, both methods utilise a search-and-score framework: OncoNEM with a greedy search and SCITE with a stochastic MCMC scheme The latter can either provide a single maximum likelihood estimate or a full posterior sample accounting for uncertainty in the inferred trees After the greedy search in the sample tree space, OncoNEM [115] then attempts to cluster similar cells together into clone in a second step to provide a clone tree like BitPhylogeny [114] Both of the more recent methods [115, 116] allow error rates to be learnt from the data and significantly outperform previous single-cell approaches and bulk data methods applied to single-cell data The different choice of representation between sample and mutation trees as in Figure 2(d) is mainly one of interest: if the key question concerns the clonal composition of the tumour then a sample tree is more appropriate, while questions concerning the order and evolutionary history of the mutations are better answered with the mutation trees The choice is also partly dictated by the nature of the single-cell data Mutations which occur in only one cell, or in all of them, are not informative for the tree reconstruction (although they may still inform the inferred error rates) If the number of remaining mutations is much larger than the number of cells, then the sample tree representation can be much more computationally efficient When the number of sampled cells dominates then mutation tree inference is much faster This occurs for example with the leukemia datasets of [80] and especially when a targeted panel is utilised as in [50, 97] SCITE [116] offers the option to change the representation depending on the data In reanalysing previous data, both OncoNEM and SCITE were applied to the 58 sequenced cells of [69] with OncoNEM considering the full set of 712 SNVs and SCITE looking at the 18 cancer-related mutations or the set of 78 non-synonomous ones due to the different representations Both found highly linear or sequential trees suggesting monoclonal evolution and trees with much higher likelihoods than those found previously in AC CE P TE D MA stick-breaking process following [117] A stick, or unit interval, is chopped into many parts Each part is then further divided with the same process, and this is repeated at all scales At each stage the first part denotes a clone which is a child of the clone at the previous stage, providing the tree structure The stick-breaking process involves parameters which influence the shape and number of clones in the prior distribution The process has also been applied to bulk data [32, 58] and BitPhylogeny also includes a model for methylation data [114] Returning to the single-cell treatment, BitPhylogeny employs the Markov chain Monte Carlo (MCMC) inference scheme of [117] Essentially one component, like the composition of the clones or the division of the stick at a particular stage, is updated while keeping the rest fixed In the phylogenetic model of [114] the mutations occur along the edges of the clonal tree with same rate This leads to a transition probability of mutations accumulating across the phylogeny so that the appearance of mutations in descendant clones is treated probabilistically For the inference of the tree itself these probabilistic appearances are averaged over so that the mutations become marginalised out The combining of cells into clones can be seen as a way of correcting for the high error rates of SCS (like [80]) while respecting a phylogeny enforced by the tree framework The MCMC sampling also provides a posterior distribution of trees and parameters, better representing the uncertainty in the phylogeny than a single maximum likelihood estimate However the inference scheme is relatively computationally costly which might cause convergence issues for more intricate or larger clonal trees For the example of the full 712 mutations uncovered in the data of [69], BitPhylogeny [114] finds one large clone consisting of 70% of the cells and some smaller clones that branch off near the root of the tree The more recent approaches [115, 116] returned to the full tree model with likelihoods given by the false positives and negatives From there they take complementary paths: OncoNEM [115] focuses on the sample tree, Figure 2(c), by marginalising or averaging over the placement of mutations along the edges; SCITE [116] focuses on the mutation tree, Figure 2(d), by averag- CR IP Table 4: Overview of single-cell phylogenetic methods † SCITE [116] provides the option of using the sample tree representation Abbreviation: MCMC, Markov Chain Monte Carlo 11 ACCEPTED MANUSCRIPT [113, 114] with the same data OncoNEM [115] additionally considered the bladder cancer data set of [71] finding very similar results to the original paper, but refining the clonal composition SCITE [116] found another highly linear tree for the kidney cancer data of [70], again suggesting monoclonal expansion, but a tree with a long trunk region followed by complex branching lower down for the higher quality ER+ breast tumour sample of [78] This would be consistent with an early build up of mutations which fixate in the tumour before a more recent division into competing subclones T IP CR AC CE P TE D MA Studying the evolutionary history of tumours and their heterogeneity covers computational aspects from processing raw sequencing data to resolving the phylogeny For bulk data, the discovery of the prevalence of mutations in the sample is reasonably accurate, apart from for low-frequency events However low-frequency mutations are common and could account for much of a tumour’s diversity and be relevant for treatment Deeper sequencing can help give better accuracy on distinguishing their prevalence and so in resolving their evolutionary history [118] Apart from the difficulties in resolving low-frequency mutations, the main issue is with untangling the clonal structure from the mixture of DNA from a large number of cells Computational approaches started focusing on the clustering [31, 38, 47] or the phylogenetic [37, 39, 41] aspects before considering their inference jointly [32, 42, 58, 60] For single-cell data, the deconvolution is no longer needed, but the need for extensive amplification of the initial DNA material, and feedback within the amplification process introduces more noise in the sequencing data and makes uncovering mutations harder Computational approaches have each so far focused separately on one facet of single-cell data: mutation calling designed for the specifics of SCS [107], clustering to correct for errors in the calling [80, 108], or probabilistic phylogenetic methods tailored for those high (and unbalanced) errors [113, 114, 115, 116] Mirroring the advances for bulk data, we can expect the next advances for singlecell based approaches to offer a holistic treatment for the process from sequencing to phylogeny, while also considering a larger range of mutation types A first step would be to account for the uncertainty in the mutation calling (as performed by [111] for bulk data and as can be extracted from [107] for single cells) in the input for the phylogenetic inference [115, 116], but overall the aim would be joint inference of the mutations and their phylogenetic structure Along with NU S Discussion combining the raw sequencing data with the tree reconstruction, models will also need to account for further technical errors in single-cell data, like the inadvertent sampling of doublets (as was recently considered in the clustering approach of [108]) Another aspect concerns copy number and aneuploidy changes, which often occur in cancer evolution and can inform the tumour phylogeny These raise a number of interesting challenges for single-cell data, both for the mutation calling where the underlying frequencies can differ from {0, 12 } and for the tree reconstruction where such events can impact several mutations at once For copy number variations in single cells this problem also arises for copy number changes at the different scales of the gene and chromosome level Algorithms have been developed to find the most parsimonious set of aberration events consistent with the data [119] The data concerned were obtained using fluorescent imaging rather than sequencing but sequencing data will only add higher resolution of small scale events down to SNVs Since it has already been shown that CNAs and SNVs can be discovered from the same SCS data [78], we expect further and corresponding modelling frameworks to arise to deal with such data A further aspect that CNAs thrust into the spotlight is the infinite sites assumption, that mutations or aberrations only occur once in the evolutionary history and persist afterwards Although a priori reasonable for sparse point mutations, this is not compatible with back mutations due to a LOH Indeed, developing and employing a probabilistic model allowing for deletions and loss of mutations, bulk sequencing of ovarian cancer uncovered different CNAs affecting the same genomic regions providing routes to convergent evolution [97] The copy number changes were still assumed to only occur once, a generalisation of the infinite sites assumption to infinite alleles [59] Convergent evolution has also been observed at a gene level, with the same driver gene affected in different evolutionary lineages and spatial areas of tumours [120, 121], albeit with mutations at distinct genomic sites consistent with the infinite sites assumption At the level of point mutations, the resolution of SCS actually allows one to test the persistence of mutations and for convergent recurrence [122] Results from SCS datasets strongly indicate that the infinite sites assumption is frequently violated [122] Although employed in the current single-cell phylogenetic methods [113, 114, 115, 116], and bulk methods, as it greatly simplifies the inference, this will need to be relaxed for more general models which capture the full complexity of tumour evolution These can build on models allowing (and penalising) a single recurrence [122], allowing 12 ACCEPTED MANUSCRIPT T indeed single cells before sequencing to avoid contamination from doublets NU S CR IP A related aspect is to consider the spatial resolution and heterogeneity of tumours, as recently performed by [124], and the temporal evolution for example by following tumour progression through xenograft generations [50] Spatiotemporal dynamics also play a key role for the spread of tumours [97] and the link between the primary tumour and metastases [111] Here a key question, and one with great treatment relevance, is whether the metastases were seeded early in the tumour’s development or are derived from later cells Again we can consider which sorts and combinations of data would best help to answer such questions To understand where metastases fit in the evolutionary history of the primary tumour and their origin, ideally we would posses a high resolution understanding of the primary tumour with single-cell and deep bulk data Assuming a single seeding event of each metastasis suggests that their bulk sequencing would suffice (as in the data of [111]), but to test this assumption would also require high resolution of the heterogeneity within the metastases themselves As well as answering the question of the origin of metastases, SCS and its ability to provide clear understanding of a tumour’s evolutionary history offers great potential for examining tumour development under the action of clinical therapies through serial biopsies or even time course collection of CTCs AC CE P TE D MA the loss of mutations [97], or with substitution models allowing arbitrary recurrence and loss as in [123] and the methylation model of BitPhylogeny [114] Alternatively phylogenetic clustering approaches which not need to enforce the infinite sites, like [46], can be further explored Important when relaxing the infinite sites assumption will be to account for and appropriately penalise the increase in complexity of more general models One general limitation of SCS is that from a relatively small sample of cells it is difficult to obtain an accurate picture of the prevalence of clones and their mutations, especially for highly heterogeneous tumours Low frequency clones are unlikely to be sampled, and those which happen to be sampled would appear more frequent than they really are Sequencing more cells obviously gives a clearer picture, but at a higher cost and likely to recapitulate high frequency clones while providing little extra information about the low frequency ones Deep sequencing of bulk samples, however, can give complementary information on these frequencies, which could also inform the phylogenetic reconstruction This is highlighted by [50, 97] where selected and targeted SCS was employed to enrich bulk analyses The challenge would be to combine both single-cell and bulk data, with their individual characteristics, into a coherent modelling framework Several bulk samples may help in particular (as for the bulk phylogeny problem [32, 41, 42, 43, 56, 57, 59, 60, 58]) and importantly this sort of framework could inform experiments on which combinations of bulk and single-cell data would offer the most detailed picture of the tumour’s history and heterogeneity For single-cell data with high coverage and current error rates [78] we can expect a good reconstruction of the mutation order and history with a couple of cells sampled per relevant mutation [116] For the 40 mutations uncovered in an ER+ breast tumour, even the 47 cells sequenced by [78] offer a detailed picture of the clonal expansion and subsequent separation into subclones [116] since probabilistic phylogenetic models account for the uncertainties in the mutations observed or missed in each cell and combine this information when inferring the tree structure By considering current single-cell datasets, it would seem that sequencing 50–100 single cells should give a high resolution picture of the tumour Sequencing more cells obviously improves the resolution, but at a higher cost and may be of less marginal value than several very deep bulk sequences Better estimates will however arise once methods arrive to combine single-cell and bulk data Experimentally it is also worthwhile verifying that samples are Looking to a future where high quality single-cell (and bulk) data is available across many patient samples, as is currently the case for the TCGA and ICGC databases for bulk samples, such data and its analysis will not only help in the identification of further driver mutations but will also allow the identification of recurring mutational patterns These may be informative for cancer treatment and in predicting cancer progression Furthermore, combining evolutionary histories from real patient data with evolutionary models (like [125, 126, 127]) offers the possibility to infer the fitness landscape of the tumour’s aberrations Different evolutionary models result in different phylogenetic patterns so that single-cell analysis could further help to distinguish between different models of tumour evolution like clonal expansion [15], neutral evolution [124, 128], ‘Big Bang’ models [129] of a sudden selective change followed by mostly neutral evolution, and punctuated evolution [130] of flurries of aberrations followed by clonal expansion 13 ACCEPTED MANUSCRIPT MA Author contribution TE JK, KJ and NB wrote the manuscript Funding AC CE P JK was supported by ERC Synergy Grant 609883 (http://erc.europa.eu/) KJ was supported by SystemsX.ch RTD Grant 2013/150 (http://www.systemsx.ch/) References T IP NU S Alleic Dropout Copy Number Alteration Circulating Tumour Cells Expectation Maximisation Fluorescence-activated Cell Sorting International Cancer Genome Consortium Loss of Heterozygosity Markov Chain Monte Carlo Multi-Displacement Amplification Mixed Integer Linear Programming Next Generation Sequencing Quadratic Integer Programming Principal Component Analysis Polymerase Chain Reaction Single Cell Sequencing Single Nucleus Exome Sequencing Single Nucleotide Variant The Cancer Genome Atlas Whole Exome Sequencing D AD CNA CTCs EM FACS ICGC LOH MCMC MDA MILP NGS QIP PCA PCR SCS SNES SNV TCGA WES [10] J W Pepper, C Scott Findlay, R Kassen, S L Spencer, C C Maley, SYNTHESIS: cancer research meets evolutionary biology, Evolutionary Applications (2009) 62–70 [11] R McLendon, et al., Comprehensive genomic characterization defines human glioblastoma genes and core pathways, Nature 455 (2008) 1061–1068 [12] T J Hudson, et al., International network of cancer genome projects, Nature 464 (2010) 993–998 [13] L R Yates, P J Campbell, Evolution of the cancer genome, Nature Reviews Genetics 13 (2012) 795–806 [14] S Nik-Zainal, P Van Loo, D C Wedge, L B Alexandrov, C D Greenman, K W Lau, K Raine, D Jones, J Marshall, M Ramakrishna, et al., The life history of 21 breast cancers, Cell 149 (2012) 994–1007 [15] M Greaves, C C Maley, Clonal evolution in cancer, Nature 481 (2012) 306–313 [16] R A Burrell, C Swanton, Re-evaluating clonal dominance in cancer evolution, Trends in Cancer (2016) 263–276 [17] C C Maley, P C Galipeau, J C Finley, V J Wongsurawat, X Li, C A Sanchez, T G Paulson, P L Blount, R A Risques, P S Rabinovitch, B J Reid, Genetic clonal diversity predicts progression to esophageal adenocarcinoma, Nature Genetics 38 (2006) 468–73 [18] L M Merlo, N A Shah, X Li, P L Blount, T L Vaughan, B J Reid, C C Maley, A comprehensive survey of clonal diversity measures in Barrett’s esophagus as biomarkers of progression to esophageal adenocarcinoma, Cancer Prevention Research (2010) 1388–1397 [19] A Marusyk, K Polyak, Tumor heterogeneity: causes and consequences, BBA Reviews on Cancer 1805 (2010) 105–117 [20] L Ding, T J Ley, D E Larson, C A Miller, D C Koboldt, J S Welch, J K Ritchey, M A Young, T Lamprecht, M D McLellan, et al., Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing, Nature 481 (2012) 506–510 [21] A Marusyk, V Almendro, K Polyak, Intra-tumour heterogeneity: a looking glass for cancer?, Nature Reviews Cancer 12 (2012) 323–334 [22] R A Burrell, N McGranahan, J Bartek, C Swanton, The causes and consequences of genetic heterogeneity in cancer evolution, Nature 501 (2013) 338–345 [23] R J Gillies, D Verduzco, R A Gatenby, Evolutionary dynamics of carcinogenesis and why targeted therapy does not work, Nature Reviews Cancer 12 (2012) 487–493 [24] R A Burrell, C Swanton, Tumour heterogeneity and the evolution of polyclonal drug resistance, Molecular Oncology (2014) 1095–1111 [25] N McGranahan, C Swanton, Biological and therapeutic impact of intratumor heterogeneity in cancer evolution, Cancer Cell 27 (2015) 15–26 [26] R Bonavia, M M Inda, W K Cavenee, F B Furnari, Heterogeneity maintenance in glioblastoma: a social network, Cancer Research 71 (2011) 4055–4060 [27] C A Ortmann, D G Kent, J Nangalia, Y Silber, D C Wedge, J Grinfeld, E J Baxter, C E Massie, E Papaemmanuil, S Menon, et al., Effect of mutation order on myeloproliferative neoplasms, New England Journal of Medicine 372 (2015) 601–612 [28] M R Stratton, P J Campbell, P A Futreal, The cancer genome, Nature 458 (2009) 719–724 [29] C Swanton, Intratumor heterogeneity: evolution through space and time, Cancer Research 72 (2012) 4875–4882 [30] K H Allison, G W Sledge, Heterogeneity and cancer, Oncology 28 (2014) 772–8 [31] H Zare, J Wang, A Hu, K Weber, J Smith, D Nickerson, CR List of abbreviations [1] D Hanahan, R A Weinberg, The hallmarks of cancer, Cell 100 (2000) 57–70 [2] D Hanahan, R A Weinberg, Hallmarks of cancer: the next generation, Cell 144 (2011) 646–674 [3] P C Nowell, The clonal evolution of tumor cell populations, Science 194 (1976) 23–28 [4] B Vogelstein, E R Fearon, S R Hamilton, S E Kern, A C Preisinger, M Leppert, Y Nakamura, R White, A M Smits, J L Bos, Genetic alterations during colorectal tumor development, New England Journal of Medicine 319 (1988) 525–532 [5] D L Dexter, H M Kowalski, B A Blazar, Z Fligiel, R Vogel, G H Heppner, Heterogeneity of tumor cells from a single mouse mammary tumor, Cancer Research 38 (1978) 3174–81 [6] I J Fidler, Tumor heterogeneity and the biology of cancer invasion and metastasis, Cancer Research 38 (1978) 2651–60 [7] G H Heppner, Tumor heterogeneity, Cancer Research 44 (1984) 2259–65 [8] F Michor, Y Iwasa, M A Nowak, Dynamics of cancer progression, Nature Reviews Cancer (2004) 197–205 [9] L M Merlo, J W Pepper, B J Reid, C C Maley, Cancer as an evolutionary and ecological process, Nature Reviews Cancer (2006) 924–935 14 ACCEPTED MANUSCRIPT [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] CR IP T renal cell carcinomas defined by multiregion sequencing, Nature Genetics 46 (2014) 225–233 P Eirew, A Steif, J Khattra, G Ha, D Yap, H Farahani, K Gelmon, S Chia, C Mar, A Wan, et al., Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution, Nature 518 (2015) 422–426 L Oesper, A Mahmoody, B J Raphael, THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data, Genome Biology 14 (2013) R80 L Oesper, G Satas, B J Raphael, Quantifying tumor heterogeneity in whole-genome and whole-exome sequencing data, Bioinformatics 30 (2014) 3532–3540 G Ha, A Roth, J Khattra, J Ho, D Yap, L M Prentice, N Melnyk, A McPherson, A Bashashati, E Laks, et al., TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data, Genome Research 24 (2014) 1881–1893 B Li, J Z Li, A general framework for analyzing tumor subclonality using SNP array and DNA sequencing data, Genome Biology 15 (2014) 473 A Fischer, I V´azquez-Garc´ıa, C J Illingworth, V Mustonen, High-definition reconstruction of clonal composition in cancer, Cell Reports (2014) 1740–1752 Y Qiao, A R Quinlan, A A Jazaeri, R G Verhaak, D A Wheeler, G T Marth, SubcloneSeeker: a computational framework for reconstructing tumor clone structure for cancer variant interpretation and prioritization, Genome Biology 15 (2014) 443 N Niknafs, V Beleva-Guthrie, D Q Naiman, R Karchin, Subclonal hierarchy inference from somatic mutations: automatic reconstruction of cancer evolutionary trees from multiregion next generation sequencing, PLoS Computional Biology 11 (2015) e1004416 A G Deshwar, S Vembu, C K Yung, G H Jang, L Stein, Q Morris, PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors, Genome Biology 16 (2015) 35 M El-Kebir, G Satas, L Oesper, B J Raphael, Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures, Cell Systems (2016) 43–53 Y Jiang, Y Qiu, A J Minn, N R Zhang, Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing, Proceedings of the National Academy of Sciences 113 (2016) E5528– E5537 C D Greenman, E D Pleasance, S Newman, F Yang, B Fu, S Nik-Zainal, D Jones, K W Lau, N Carter, P A Edwards, et al., Estimation of rearrangement phylogeny for cancer genomes, Genome Research 22 (2012) 346–361 E Purdom, C Ho, C S Grasso, M J Quist, R J Cho, P Spellman, Methods and challenges in timing chromosomal abnormalities within cancer samples, Bioinformatics 29 (2013) 3113–20 R F Schwarz, A Trinh, B Sipos, J D Brenton, N Goldman, F Markowetz, Phylogenetic quantification of intra-tumour heterogeneity, PLoS Computional Biology 10 (2014) e1003535 Y Wang, N E Navin, Advances and applications of single-cell sequencing technologies, Molecular Cell 58 (2015) 598–609 C Gawad, W Koh, S R Quake, Single-cell genome sequencing: current state of the science, Nature Reviews Genetics 17 (2016) 175–188 N E Navin, The first five years of single-cell cancer genomics and beyond, Genome Research 25 (2015) 1499–1507 F Tang, C Barbacioru, Y Wang, E Nordman, C Lee, N Xu, X Wang, J Bodeau, B B Tuch, A Siddiqui, et al., mrna-seq NU S [36] [53] [54] [55] MA [35] [52] D [34] [51] [56] [57] TE [33] [50] [58] AC CE P [32] C Song, D Witten, C A Blau, W S Noble, Inferring clonal composition from multiple sections of a breast cancer, PLoS Computational Biology 10 (2014) e003703 W Jiao, S Vembu, A G Deshwar, L Stein, Q Morris, Inferring clonal evolution of tumors from single nucleotide somatic mutations, BMC Bioinformatics 15 (2014) 35 A Schuh, J Becq, S Humphray, A Alexa, A Burns, R Clifford, S M Feller, R Grocock, S Henderson, I Khrebtukova, et al., Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns, Blood 120 (2012) 4191–4196 P Van Loo, T Voet, Single cell analysis of cancer genomes, Current Opinion in Genetics & Development 24 (2014) 82–91 N E Navin, Cancer genomics: one cell at a time, Genome Biology 15 (2014) D Gusfield, Algorithms on strings, trees and sequences: computer science and computational biology, Cambridge university press, Cambridge, 1997 F Strino, F Parisi, M Micsinai, Y Kluger, TrAp: a tree approach for fingerprinting subclonal tumor composition, Nucleic Acids Research 41 (2013) e165–e165 A Roth, J Khattra, D Yap, A Wan, E Laks, J Biele, G Ha, S Aparicio, A Bouchard-Cˆot´e, S P Shah, PyClone: statistical inference of clonal population structure in cancer, Nature Methods 11 (2014) 396–398 I Hajirasouliha, A Mahmoody, B J Raphael, A combinatorial approach for analyzing intra-tumor heterogeneity from highthroughput sequencing data, Bioinformatics 30 (2014) i78– i86 C A Miller, B S White, N D Dees, M Griffith, J S Welch, O L Griffith, R Vij, M H Tomasson, T A Graubert, M J Walter, et al., SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution, PLoS Computional Biology 10 (2014) e1003665 M El-Kebir, L Oesper, H Acheson-Field, B J Raphael, Reconstruction of clonal trees and tumor composition from multisample sequencing data, Bioinformatics 31 (2015) i62–i70 S Malikic, A W McPherson, N Donmez, C S Sahinalp, Clonality inference in multiple tumor samples using phylogeny, Bioinformatics 31 (2015) 1349–1356 V Popic, R Salari, I Hajirasouliha, D Kashef-Haghighi, R B West, S Batzoglou, Fast and scalable inference of multisample cancer lineages, CoRR, abs/1412.8574 (2014) S Sengupta, J Wang, J Lee, P Măuller, K Gulukota, A Banerjee, Y Ji, Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data., in: Pacific Symposium on Biocomputing, volume 20, 2015, pp 467–478 N Donmez, S Malikic, A W Wyatt, M E Gleave, C C Collins, S C Sahinalp, Clonality inference from single tumor samples using low coverage sequence data, in: International Conference on Research in Computational Molecular Biology, Springer, 2016, pp 83–94 F Marass, F Mouliere, K Yuan, N Rosenfeld, F Markowetz, A phylogenetic latent feature model for clonal deconvolution, Annals of Applied Statistics 10 (2016) 2377–2404 S P Shah, A Roth, R Goya, A Oloumi, G Ha, Y Zhao, G Turashvili, J Ding, K Tse, G Haffari, et al., The clonal and mutational evolution spectrum of primary triple-negative breast cancers, Nature 486 (2012) 395–399 N B Larson, B L Fridley, PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data, Bioinformatics 29 (2013) 1888–1889 M Gerlinger, S Horswell, J Larkin, A J Rowan, M P Salm, I Varela, R Fisher, N McGranahan, N Matthews, C R Santos, et al., Genomic architecture and evolution of clear cell [59] [60] [61] [62] [63] [64] [65] [66] [67] 15 ACCEPTED MANUSCRIPT [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] CR IP T number variation patterns among single circulating tumor cells of lung cancer patients, Proceedings of the National Academy of Sciences 110 (2013) 21083–21088 A E Dago, A Stepansky, A Carlsson, M Luttgen, J Kendall, T Baslan, A Kolatkar, M Wigler, K Bethel, M E Gross, et al., Rapid phenotypic and genomic change in response to therapeutic pressure in prostate cancer inferred by high content analysis of single circulating tumor cells, PloS One (2014) e101777 J G Lohr, V A Adalsteinsson, K Cibulskis, A D Choudhury, M Rosenberg, P Cruz-Gordillo, J Francis, C.-Z Zhang, A K Shalek, R Satija, et al., Whole exome sequencing of circulating tumor cells provides a window into metastatic prostate cancer, Nature Biotechnology 32 (2014) 479 C Yu, J Yu, X Yao, W K Wu, Y Lu, S Tang, X Li, L Bao, X Li, Y Hou, et al., Discovery of biclonal origin and a novel oncogene SLC12A5 in colon cancer by single-cell sequencing, Cell Research 24 (2014) 701–712 M J McConnell, M R Lindberg, K J Brennand, J C Piper, T Voet, C Cowing-Zitron, S Shumilina, R S Lasken, J R Vermeesch, I M Hall, F H Gage, Mosaic copy number variation in human neurons, Science 342 (2013) 632–637 N E Potter, L Ermini, E Papaemmanuil, G Cazzaniga, G Vijayaraghavan, I Titley, A Ford, P Campbell, L Kearney, M Greaves, Single-cell mutational profiling and clonal phylogeny in cancer, Genome Research 23 (2013) 2115–2125 E Papaemmanuil, I Rapado, Y Li, N E Potter, D C Wedge, J Tubio, L B Alexandrov, P Van Loo, S L Cooke, J Marshall, et al., RAG-mediated recombination is the predominant driver of oncogenic rearrangement in ETV6-RUNX1 acute lymphoblastic leukemia, Nature Genetics 46 (2014) 116–125 T Baslan, J Kendall, B Ward, H Cox, A Leotta, L Rodgers, M Riggs, S D’Italia, G Sun, M Yong, et al., Optimizing sparse sequencing of single cells for highly multiplex copy number profiling, Genome Research 25 (2015) 714–724 H C Fan, G K Fu, S P A Fodor, Combinatorial labeling of single cells for gene expression cytometry, Science 347 (2015) E Z Macosko, A Basu, R Satija, J Nemesh, K Shekhar, M Goldman, I Tirosh, A R Bialas, N Kamitaki, E M Martersteck, et al., Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets, Cell 161 (2015) 1202–1214 M L Leung, Y Wang, C Kim, R Gao, J Jiang, E Sei, N E Navin, Highly multiplexed targeted DNA sequencing from single nuclei, Nature Protocols 11 (2016) 214–235 I C Macaulay, M J Teng, W Haerty, P Kumar, C P Ponting, T Voet, Separation and parallel sequencing of the genomes and transcriptomes of single cells using G&T-seq, Nature Protocols 11 (2016) 2081–2103 Fluidigm, Doublet rate and detection on the C1 IFCs, 2016 White Paper PN 101–2711 A1 A McPherson, A Roth, E Laks, T Masud, A Bashashati, A W Zhang, G Ha, J Biele, D Yap, A Wan, et al., Divergent modes of clonal spread and intraperitoneal mixing in high-grade serous ovarian cancer, Nature Genetics 48 (2016) 758–767 N Saitou, M Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees., Molecular biology and evolution (1987) 406–425 M Gerstung, C Beisel, M Rechsteiner, P Wild, P Schraml, H Moch, N Beerenwinkel, Reliable detection of subclonal single-nucleotide variants in tumour cell populations, Nature Communications (2012) 811 M A DePristo, E Banks, R Poplin, K V Garimella, J R Maguire, C Hartl, A A Philippakis, G Del Angel, M A Ri- NU S [72] [87] [88] MA [71] [86] [89] [90] D [70] TE [69] [85] [91] AC CE P [68] whole-transcriptome analysis of a single cell, Nature Methods (2009) 377–382 N Navin, J Kendall, J Troge, P Andrews, L Rodgers, J McIndoo, K Cook, A Stepansky, D Levy, D Esposito, et al., Tumour evolution inferred by single-cell sequencing, Nature 472 (2011) 90–94 Y Hou, L Song, P Zhu, B Zhang, Y Tao, X Xu, F Li, K Wu, J Liang, D Shao, et al., Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm, Cell 148 (2012) 873–885 X Xu, Y Hou, X Yin, L Bao, A Tang, L Song, F Li, S Tsang, K Wu, H Wu, et al., Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor, Cell 148 (2012) 886–895 Y Li, X Xu, L Song, Y Hou, Z Li, S Tsang, F Li, K M Im, K Wu, H Wu, et al., Single-cell sequencing analysis characterizes common and cell-lineage-specific mutations in a muscle-invasive bladder cancer, GigaScience (2012) 1–14 J Wang, H C Fan, B Behr, S R Quake, Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm, Cell 150 (2012) 402–412 C Zong, S Lu, A R Chapman, X S Xie, Genome-wide detection of single-nucleotide and copy-number variations of a single human cell, Science 338 (2012) 1622–1626 S Lu, C Zong, W Fan, M Yang, J Li, A R Chapman, P Zhu, X Hu, L Xu, L Yan, et al., Probing meiotic recombination and aneuploidy of single sperm cells by whole-genome sequencing, Science 338 (2012) 1627–1630 Y Hou, W Fan, L Yan, R Li, Y Lian, J Huang, J Li, L Xu, F Tang, X S Xie, J Qiao, Genome analyses of single human oocytes, Cell 155 (2013) 1492–1506 T Voet, P Kumar, P Van Loo, S L Cooke, J Marshall, M.L Lin, M Zamani Esteki, N Van der Aa, L Mateiu, D J McBride, et al., Single-cell paired-end genome sequencing reveals structural variation per cell cycle, Nucleic Acids Research 41 (2013) 6119–6138 A E Hughes, V Magrini, R Demeter, C A Miller, R Fulton, L L Fulton, W C Eades, K Elliott, S Heath, P Westervelt, et al., Clonal architecture of secondary acute myeloid leukemia defined by single-cell sequencing, PLoS Genetics 10 (2014) e1004462 Y Wang, J Waters, M L Leung, A Unruh, W Roh, X Shi, K Chen, P Scheet, S Vattathil, H Liang, et al., Clonal evolution in breast cancer revealed by single nucleus genome sequencing, Nature 512 (2014) 155–160 M L Leung, Y Wang, J Waters, N E Navin, SNES: single nucleus exome sequencing, Genome Biology 16 (2015) 1–10 C Gawad, W Koh, S R Quake, Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics, Proceedings of the National Academy of Sciences 111 (2014) 17947–17952 N Sachs, H Clevers, Organoid cultures for the analysis of cancer phenotypes, Current Opinion in Genetics & Development 24 (2014) 68–73 S F Boj, C.-I Hwang, L A Baker, I I C Chio, D D Engle, V Corbo, M Jager, M Ponz-Sarvise, H Tiriac, M S Spector, et al., Organoid models of human and mouse ductal pancreatic cancer, Cell 160 (2015) 324–338 E Heitzer, M Auer, C Gasch, M Pichler, P Ulz, E M Hoffmann, S Lax, J Waldispuehl-Geigl, O Mauermann, C Lackner, et al., Complex tumor genomes inferred from single circulating tumor cells by array-CGH and next-generation sequencing, Cancer Research 73 (2013) 2965–2975 X Ni, M Zhuo, Z Su, J Duan, Y Gao, Z Wang, C Zong, H Bai, A R Chapman, J Zhao, et al., Reproducible copy [92] [93] [94] [95] [96] [97] [98] [99] [100] 16 ACCEPTED MANUSCRIPT [105] [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] T IP CR NU S MA [104] D [103] TE [102] single-cell data, Genome Biology 17 (2016) 86 [117] R P Adams, Z Ghahramani, M I Jordan, Tree-structured stick breaking for hierarchical data, in: Advances in Neural Information Processing Systems, volume 23, 2010, pp 19–27 [118] M Griffith, C A Miller, O L Griffith, K Krysiak, Z L Skidmore, A Ramu, J R Walker, H X Dang, L Trani, D E Larson, et al., Optimizing cancer genome sequencing and analysis, Cell Systems (2015) 210–223 [119] S A Chowdhury, E M Gertz, D Wangsa, K HeselmeyerHaddad, T Ried, A A Schffer, R Schwartz, Inferring models of multiscale copy number evolution for single-tumor phylogenetics, Bioinformatics 31 (2015) i258–i267 [120] M Gerlinger, A J Rowan, S Horswell, J Larkin, D Endesfelder, E Gronroos, P Martinez, N Matthews, A Stewart, P Tarpey, et al., Intratumor heterogeneity and branched evolution revealed by multiregion sequencing, New England Journal of Medicine 366 (2012) 883–892 [121] M Kovac, C Navas, S Horswell, M Salm, C Bardella, A Rowan, M Stares, F Castro-Giner, R Fisher, E C De Bruin, et al., Recurrent chromosomal gains and heterogeneous driver mutations characterise papillary renal cancer evolution, Nature Communications (2015) 6336 [122] J Kuipers, K Jahn, N Beerenwinkel, A statistical test on single-cell data reveals widespread recurrent mutations in tumor evolution, bioRxiv (2016) 094722 [123] H Zafar, A Tzen, N Navin, K Chen, L Nakhleh, SiFit: A method for inferring tumor trees from single-cell sequencing data under finite-site models, bioRxiv (2016) 091595 [124] S Ling, Z Hu, Z Yang, F Yang, Y Li, P Lin, K Chen, L Dong, L Cao, Y Tao, et al., Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution, Proceedings of the National Academy of Sciences 112 (2015) E6496–E6505 [125] N Beerenwinkel, T Antal, D Dingli, A Traulsen, K W Kinzler, V E Velculescu, B Vogelstein, M A Nowak, Genetic progression and the waiting time to cancer, PLoS Computional Biology (2007) e225 [126] I Bozic, T Antal, H Ohtsuki, H Carter, D Kim, S Chen, R Karchin, K W Kinzler, B Vogelstein, M A Nowak, Accumulation of driver and passenger mutations during tumor progression, Proceedings of the National Academy of Sciences 107 (2010) 18545–18550 [127] B Waclaw, I Bozic, M E Pittman, R H Hruban, B Vogelstein, M A Nowak, A spatial model predicts that dispersal and cell turnover limit intratumour heterogeneity, Nature 525 (2015) 261–264 [128] M J Williams, B Werner, C P Barnes, T A Graham, A Sottoriva, Identification of neutral tumor evolution across cancer types, Nature Genetics 48 (2016) 238–244 [129] A Sottoriva, H Kang, Z Ma, T A Graham, M P Salomon, J Zhao, P Marjoram, K Siegmund, M F Press, D Shibata, et al., A Big Bang model of human colorectal tumor growth, Nature Genetics 47 (2015) 209–216 [130] R Gao, A Davis, T O McDonald, E Sei, X Shi, Y Wang, P.-C Tsai, A Casasent, J Waters, H Zhang, et al., Punctuated copy number evolution and clonal stasis in triple-negative breast cancer, Nature Genetics 48 (2016) 1119–1130 AC CE P [101] vas, M Hanna, et al., A framework for variation discovery and genotyping using next-generation DNA sequencing data, Nature Genetics 43 (2011) 491–498 A Roth, J Ding, R Morin, A Crisan, G Ha, R Giuliany, A Bashashati, M Hirst, G Turashvili, A Oloumi, et al., JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data, Bioinformatics 28 (2012) 907–913 K Cibulskis, M S Lawrence, S L Carter, A Sivachenko, D Jaffe, C Sougnez, S Gabriel, M Meyerson, E S Lander, G Getz, Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples, Nature Biotechnology 31 (2013) 213–219 H Li, A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data, Bioinformatics 27 (2011) 2987–2993 D E Larson, C C Harris, K Chen, D C Koboldt, T E Abbott, D J Dooling, T J Ley, E R Mardis, R K Wilson, L Ding, SomaticSniper: identification of somatic point mutations in whole genome sequencing data, Bioinformatics 28 (2012) 311–317 D C Koboldt, Q Zhang, D E Larson, D Shen, M D McLellan, L Lin, C A Miller, E R Mardis, L Ding, R K Wilson, Varscan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing, Genome Research 22 (2012) 568–576 S I Nikolenko, A I Korobeynikov, M A Alekseyev, BayesHammer: Bayesian clustering for error correction in single-cell sequencing, BMC Genomics 14 (2013) S7 H Zafar, Y Wang, L Nakhleh, N Navin, K Chen, Monovar: single-nucleotide variant detection in single cells, Nature Methods 13 (2016) 505–507 A Roth, A McPherson, E Laks, J Biele, D Yap, A Wan, M A Smith, C B Nielsen, J N McAlpine, S Aparicio, A Bouchard-Cote, S P Shah, Clonal genotype and population structure inference from single-cell tumor sequencing, Nature Methods 13 (2016) 573–576 F Ronquist, M Teslenko, P van der Mark, D L Ayres, A Darling, S Hăohna, B Larget, L Liu, M A Suchard, J P Huelsenbeck, MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space, Systematic biology 61 (2012) 539–542 D Chen, O Eulenstein, D Fernandez-Baca, M Sanderson, Minimum-flip supertrees: Complexity and algorithms, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) (2006) 165–173 J G Reiter, A P Makohon-Moore, J M Gerold, I Bozic, K Chatterjee, C A Iacobuzio-Donahue, B Vogelstein, M A Nowak, Reconstructing phylogenies of metastatic cancers, bioRxiv (2016) 048157 D Gusfield, Y Frid, D Brown, Integer programming formulations and computations solving phylogenetic and population genetic problems with missing or genotypic data, in: Computing and Combinatorics, Springer, Berlin, 2007, pp 51–64 K I Kim, R Simon, Using single cell sequencing data to model the evolutionary history of a tumor, BMC Bioinformatics 15 (2014) 27 K Yuan, T Sakoparnig, F Markowetz, N Beerenwinkel, BitPhylogeny: a probabilistic framework for reconstructing intratumor phylogenies, Genome Biology 16 (2015) 36 E Ross, F Markowetz, OncoNEM: Inferring tumour evolution from single-cell sequencing data, Genome Biology 17 (2016) 69 K Jahn, J Kuipers, N Beerenwinkel, Tree inference for 17 ... Integer Linear Programming Next Generation Sequencing Quadratic Integer Programming Principal Component Analysis Polymerase Chain Reaction Single Cell Sequencing Single Nucleus Exome Sequencing. .. apply to the journal pertain ACCEPTED MANUSCRIPT T Advances in understanding tumour evolution through single- cell sequencing Jack Kuipers1 , Katharina Jahn1 , Niko Beerenwinkel IP Department of... resolution, by sequencing single cells New computational challenges arise when moving from bulk to single- cell sequencing data, leading to the development of novel modelling frameworks In this review,

Ngày đăng: 19/11/2022, 11:40