Lee et al BMC Genomics (2019) 20:833 https://doi.org/10.1186/s12864-019-6215-y RESEARCH ARTICLE Open Access Complete chloroplast genomes of all six Hosta species occurring in Korea: molecular structures, comparative, and phylogenetic analyses Soo-Rang Lee1, Kyeonghee Kim2, Byoung-Yoon Lee2 and Chae Eun Lim2* Abstract Background: The genus Hosta is a group of economically appreciated perennial herbs consisting of approximately 25 species that is endemic to eastern Asia Due to considerable morphological variability, the genus has been well recognized as a group with taxonomic problems Chloroplast is a cytoplasmic organelle with its own genome, which is the most commonly used for phylogenetic and genetic diversity analyses for land plants To understand the genomic architecture of Hosta chloroplasts and examine the level of nucleotide and size variation, we newly sequenced four (H clausa, H jonesii, H minor, and H venusta) and analyzed six Hosta species (including the four, H capitata and H yingeri) distributed throughout South Korea Results: The average size of complete chloroplast genomes for the Hosta taxa was 156,642 bp with a maximum size difference of ~ 300 bp The overall gene content and organization across the six Hosta were nearly identical with a few exceptions There was a single tRNA gene deletion in H jonesii and four genes were pseudogenized in three taxa (H capitata, H minor, and H jonesii) We did not find major structural variation, but there were a minor expansion and contractions in IR region for three species (H capitata, H minor, and H venusta) Sequence variations were higher in non-coding regions than in coding regions Four genic and intergenic regions including two coding genes (psbA and ndhD) exhibited the largest sequence divergence showing potential as phylogenetic markers We found compositional codon usage bias toward A/T at the third position The Hosta plastomes had a comparable number of dispersed and tandem repeats (simple sequence repeats) to the ones identified in other angiosperm taxa The phylogeny of 20 Agavoideae (Asparagaceae) taxa including the six Hosta species inferred from complete plastome data showed well resolved monophyletic clades for closely related taxa with high node supports Conclusions: Our study provides detailed information on the chloroplast genome of the Hosta taxa We identified nucleotide diversity hotspots and characterized types of repeats, which can be used for developing molecular markers applicable in various research area Keywords: Hosta, Chloroplast genome, Repeats, Codon usage, Sequence divergence, Phylogeny * Correspondence: chaelim@korea.kr National Institute of Biological Resources, 42 Hwangyeong-ro, Seo-gu, Incheon 22689, South Korea Full list of author information is available at the end of the article © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated Lee et al BMC Genomics (2019) 20:833 Background The genus Hosta Tratt (Asparagaceae) is a group of economically important perennial herbs and distributed exclusively in eastern Asia [1–3] As the plants have showy flowers and foliage, many Hosta species and the cultivars (~ 2500) are heavily exploited for gardening throughout all temperate regions [4] The plants in Hosta are commonly called as plantain lily (bibichu in Korean) and have grown the popularity in gardens due to the advantages in cultivating due to the tolerance to shade and high soil moisture contents [5, 6] Coupled with the horticultural importance, Hosta species provide critical values in medical areas Recent studies revealed that the species are rich in saponins and amaryllidaceae alkaloids that are inhibiting tumor related and inflammatory activities [7, 8] The Hosta plants also have been used as a folk medicine for treating multiple symptoms including multiple inflammatory diseases such as urethritis and pharyngolaryngitis in China and Japan [8] The genus Hosta is placed in the family Asparagaceae since it was moved to the family from Liliaceae in the 1930s based on the cytological characteristics (2n = 60) [5] There are approximately 22–25 species in the genus [1, 4], although the number of species (43 in Schmid) [5] and the relationships among the taxa have been problematic due to the extensive variability in morphology The challenges in taxonomy of Hosta are also attributed to the confusions brought from the abundance of cultivars (number of cultivars reported > 2500) [2, 4] The taxonomic difficulties are further complicated by the dearth of diagnostic characters as well as lack of comparative investigations on taxonomic keys between the dried herbarium specimens and the living plants from natural populations across varying environments [9] In Korea, approximately 14 Hosta (11 species, varieties, cultivar) taxa have been reported thus far, however the number of species varies from to 11 depending on the scholars working on the genus [10] Organization of CP genomes are conserved throughout higher plants at the structural and genic level [11, 12] Generally, in nearly all land plants, CP genomes are consisting of a single circular DNA molecule [11] and show quadripartite structure, i.e a large single-copy region (LSC) and a small single-copy region (SSC) separated by inverted repeats (IRs) Although the extent of variation is not very large across flowering plants, the genome sizes of chloroplasts differ between species ranging from 107 kb (Cathaya argyrophylla) to 280 kb (Pelargonium) [11, 12] There are approximately 120 to 130 genes in chloroplast genomes contributing to photosynthesis, transcription and translation [12] The CP genomes are usually transmitted from one of the parents (supposedly no recombination occurring), mostly the mother in angiosperms [13] The sequences of the CP genomes are conserved among taxa, Page of 13 thus the genomes often provide robust markers for phylogenetic analysis and divergence time estimation particularly at a higher taxonomic level [14] Over a dozen of regions within the CP genome e.g ndhF, matK, and trnS-trnG have been widely amplified for the purpose of species identification, barcoding and phylogenies [15, 16] Certainly, there is no universal region of CP genome that works best for all plant taxa Also, despite the wide utilities of CP markers for taxonomic studies, the taxonomy of the most closely related taxa based on those markers often remains unresolved in many taxa due to the limited variation [15] With the advent of next generation sequencing (NGS) technology, sequencing the whole CP genomes (plastome) for multiple taxa is feasible at a low cost Recently the complete plastome sequences have been applied to reconstruct phylogenies on problematic taxa and has successfully resolved the enigmatic relationships [14, 17, 18] Currently, four Hosta plastomes have been sequenced and two of those are publicly available in NCBI Organelle Genome Resources (http:// www.ncbi.nlm.nih.gov/genomes) [3, 19, 20] In this study, we investigated the plastomes of all six Korean Hosta summarized by Chung and Kim [2] We newly sequenced and assembled the whole plastomes of four species (H clausa, H jonesii, H minor, and H venusta) The plastome of H yingeri (MF990205.1) [19] and H capitata (MH581151) [20] were downloaded and added to the comparative analysis The aims of our study were: 1) to determine the complete structure of plastomes for the four Korean Hosta species; 2) to compare sequence variation and molecular evolution among the six Korean Hosta; 3) to infer the phylogenetic relationship among the six Korean Hosta and reconstruct the phylogeny of the six species within the subfamily Agavoideae Results Chloroplast genome assembly The genomic libraries from the four Korean Hosta species sequenced in our study produced ~ 7.8 to 13GB The average number of reads after quality-based trimming was about 10 million and the mean coverage of the four plastome sequences is ~ 222 (Table 1) The percent of GC content did not vary much across the four plastome sequences and the average was 37.8% (Table 1) The complete CP genome size of the four Hosta ranged from 156,624 bp (H clausa) to 156,708 bp (H jonesii) As shown in most CP genomes, the four Hosta assembled in the study exhibited the typical quadripartite structure comprising of the four regions, a pair of inverted repeats (IRs 26,676–26,698 bp), LSC (85,004– 85,099 bp) and SSC (18,225–18,244 bp; Fig 1; Table 1) Lee et al BMC Genomics (2019) 20:833 Page of 13 Table Sample information and summary of chloroplast genome characteristics for four Hosta species in Korea The species acronyms are as followings: CLA- H clausa; MIN- H minor; VEN- H venusta; JON- H jonesii Category CLA MIN VEN JON Collection site Mt Daeam, Gangwon-do Mt Gaejwa, Busan-si Seoguipo, Jeju-do Yanga-ri, Gyeongsangnam-do Voucher No NIBR-VP 000063279 NIBR-VP 0000538762 NIBR-VP 0000632798 NIBR-VP 0000538843 NCBI accession No MK732315 MK732316 MK732314 MK732318 Reads after trimming 6,690,938 12,171,518 5,497,667 10,194,165 Mean coverage 258.7 246.7 278.1 166.9 Total length (bp) 156,624 156,671 156,676 156,708 LSC length (bp) 85,004 85,094 85,099 85,088 SSC length (bp) 18,228 18,225 18,225 18,244 IRa length (bp) 26,696 26,676 26,676 26,698 IRb length (bp) 26,696 26,676 26,676 26,698 Total GC content (%) 37.81 37.80 37.80 37.80 Total number of genes 132 132 132 131 Chloroplast genome annotation Including H yingeri and H capitata (the CP genome sequences were downloaded from GenBank), the four Korean Hosta plastomes contained 132 genes, which consisted of 78 protein coding genes, 31tRNA- and rRNA-coding genes (Table 2) There was a single tRNA gene (trnT-UGA) deletion found in H jonesii resulting in 137 genes with 30 tRNAs for the species Except for the one tRNA gene, all remaining genes and the composition found in the H jonesii plastome was identical to those of the other five species Of 138 genes, 20 genes (all rRNAs, of tRNAs, of ribosomal protein coding genes and of the other genes) were duplicated and placed in the IR regions (Table 2) Fifteen genes including nine protein coding genes (atpF, ndhA, ndhB, petB, petD, rpoC1, rpl2, rpl16, rps12) and six tRNAs contained one intron while two genes (clpP and ycf3) contained two introns (Table 2) About 42% of plastome sequences of the six Korean Hosta species came out as the coding region encoding tRNAs, rRNAs, and proteins We found four pseudogenes infAψ, ycf15ψ, rps16ψ and rps11ψ in three species H capitata, H minor and H jonesii (Table 2) Comparative chloroplast genome structure and polymorphism The comparative sequence analysis of the six Korean Hosta revealed that the plastome sequences were fairly conserved across the six taxa with a few regions with variation (Fig 2) Overall the sequences were more conserved in the coding regions, whereas, most of variation detected were found in non-coding sequence (CNS in Fig 2) areas The sequences of exons and UTRs were nearly identical throughout the six taxa except for ycf1 for H capitata, H minor and H venusta (Fig 2) There was a slight variation detected on rps19 for H minor and H venusta We found the most projecting sequence polymorphism in H capitata on the intergenic region between trnK-UUU and trnQ-UUG due to a 278 bp sequence deletion (Fig 2) The amplicon size of H capitata for the region was 231 bp, whereas the size of amplicons for the remaining five taxa was 509 bp (Additional file 1: Figure S1) The length difference between H capitata and the other five Hosta taxa was 278 bp We further examined sequence variability by computing nucleotide polymorphism (pi) among the six taxa The average sequence diversity was 0.0007 and the pi ranged from to 0.012 (Fig 3) Overall the sequence diversities of IRs were more conserved (average pi = 0.0002) than the one calculated for LSC (average pi = 0.0008) and SSC region (average pi = 0.0016; Fig 3) The average pi for non-coding region (0.0011) was higher than the one (0.0006) estimated for coding sequences The most highly variable regions (pi > 0.05) include one tRNA (trnL-UAG: 0.012), two protein coding genes (psbA: 0.010, ndhD: 0.012) and one intergenic region (ndhF/rpl32 IGS: 0.12) Based on the results of DNA sequence polymorphism we examined, the intra-specific polymorphisms were nearly zero except for ndhD gene in H clausa (Additional file 1: Table S3 and Table S4) Overall, the ndhD gene showed the highest sequence polymorphism (pi = 0.01033), whereas the remaining three genes exhibited limited variation (Additional file 1: Table S3 and Table S4) We compared the IR and SC boundaries of the six Korean Hosta Overall, the organization of gene content and the size of genes shared high similarities among the six taxa although there were some distinctive variations We found expansion and contraction of IR regions The largest size of IR was shown in H capitata despite the smallest overall plastome size (Table 1) Although the rps19 genes of all six taxa were placed in the IR region, Lee et al BMC Genomics (2019) 20:833 Page of 13 Fig Chloroplast map of six Hosta species in Korea The colored boxes represent conserved chloroplast genes Genes shown inside the circle are transcribed clockwise, whereas genes outside the circle are transcribed counter-clockwise The small grey bar graphs inner circle shows the GC contents the location of the gene in H capitata was the most distant from the boundary between the IR and LSC (Fig 4) rpl22 gene were positioned within the LSC with an 28 bp overlaps with the IRa for the five Korean Hosta species except for H capitata (Fig 4) The overlap was 14 bp longer in H capitata indicating expansion of IR in the species The border across IRb and SSC was placed in the region of ycf1 gene with 926–928 bp tail section of the gene located in the IRb for most of the Korean Hosta (Fig 4) However, the size of the tail section was reduced by ~ 20 bp length for H minor and H venusta suggesting contraction of the IR section in the two taxa (Fig 4) Codon usage pattern According to the codon usage analysis, overall 64 codons were present across of the six Korean Hosta species encoding 20 amino acids (AAs) Total number of codons for protein coding genes found was 26,505 in all six Korean Hosta The effective number of codons were as followings: 3158 (H clausa); 4002 (H capitata); 4006 Lee et al BMC Genomics (2019) 20:833 Page of 13 Table List of genes within chloroplast genomes of six Hosta species in Korea ×2 refers to genes duplicated in the IR regions Category Group of genes Names of genes Transcription & Translation Ribosomal protein, LSU rpl33, rpl20, rpl36, rpl14, rpl16 (× 2)a, rpl22, rpl2(× 2)a, rpl23(× 2), rpl32 Ribosomal protein, SSU rps16, rps2, rps14, rps4, rps18, rps12(× 2)a, rps11, rps8, rps3, rps19(× 2), rps7(× 2), rps15 RNA polymerase rpoC2, rpoC1a, rpoB, rpoA Ribosomal RNAs rrn16(×2), rrn23(× 2), rrn4.5(× 2), rrn5(× 2) Transfer RNAs trnL-UAAa, trnF-GAA, trnV-UACa, trnM-CAU, trnW-CCA, trnP-UGG, trnH-GUG(×2), trnI-CAU(× 2), trnL-CAA(× 2), trnV-GAC(× 2), trnI-GAU(× 2)a, trnA-UGC(× 2)a, trnR-ACG(× 2), trnN-GUU(× 2), trnL-UAG, trnR-UCU, trnD-GUC, trnC-GCA, trnQ-UUG, trnE-UUC, trnG-UCC, trnK-UUUa, trnfM-CAU, trnS-GCU, trnS-UGA, trnS-GGA, trnT-GGU, trnT-UGA, trnY-GUA, trnG-GCCa, trnT-UGU Photosynthesis Photosystem I psaB, psaA, psaI, psaJ, psaC Photosystem II psbA, psbK, psbI, psbM, psbD, psbC, psbZ, psbJ, psbL, psbF, psbE, psbB, psbT, psbN, psbH, petN NADH dehydrogenase ndhJ, ndhK, ndhC, ndhB(×2)a, ndhF, ndhD, ndhE, ndhG, ndhI, ndhAa, ndhH Cytochrome b6/f complex petN, petA, petL, petG, petBa, petDa ATP synthase atpA, atpFa, atpH, atpI, atpE, atpB Rubisco large subunit rbcL clpPa ATP-dependent protease subunit P Other genes Chloroplast envelope membrane protein cemA Maturase matK c-type ccsA Subunit Acetyl- CoA-Carboxylate accD Photosystem I assembly& stability ycf3b, ycf4 Conserved ORFs ycf1, ycf2(×2) Pseudogenes infAψ (MIN/CAP), ycf15ψ (MIN/CAP), rps16ψ (JON), rps11ψ (JON) Abbreviations: LSU rRNA Large subunit ribosomal ribonucleic acid, SSU rRNA Small subunit ribosomal ribonucleic acid Gene marked with asterisks are the gene with a single (a) or double (b) introns ψPseudogenes are presented in the species indicated with parentheses See Table legend for the species acronyms (H minor); 5007 (H venusta); 5018 (H yingeri) and 4004 (H jonesii) The most abundant AA among the 20 AAs was leucine (number of codons encoding leucine = 2735, 10.3%) followed by isoleucine (number of codons encoding isoleucine = 2287, 8.6%) Alanine was the least frequent AA in the Korean Hosta, which is encoded only by 309 codons (1.2%) The codon usage based on relative synonymous codon usage values (RSCU) did not vary among the six Korean Hosta species except for some decreases found in three AAs of H venusta and H yingeri (Additional file 1: Figure S2) Of the six Hosta species, H venusta and H yingeri had 47 codons more frequently used than the expected usage at equilibrium (RSCU > 1) while the rest of four Hosta species showed the codon usage bias (RSCU > 1) in 59 codons All six Hosta had 59 codons less frequently used than the expected usage at equilibrium (RSCU < 1) Codons with A and/or U in the third position take up ~ 30% and ~ 24% of all codons respectively The frequency of use for the start codons AUG and UGG, encoding methionine and tryptophan, showed no bias (RSCU = 1) in all Korean Hosta taxa Tandem repeat and SSR The total number of simple sequence repeats (SSRs) found in six Korean Hosta ranged from 51 to 59 (Table 3) Of these, the most abundant type of SSRs were the mono-nucleotide repeats with size of 10 to 16 Except for the mono-nucleotide SSR with C located in ndhF gene, nearly all mono repeat was composed of A or T in all six taxa Over 60% of di-nucleotide SSRs were shown in the form of “AT” and the repeat number Lee et al BMC Genomics (2019) 20:833 Page of 13 Fig Plots of percent sequence identity of the chloroplast genomes of six Korean Hosta species with H ventricosa (NCBI accession number: NC_032706.1) as a reference The percentage of sequence identities were estimated and the plots were visualized in mVISTA Fig Plot of sliding window analysis on the whole chloroplast genome for nucleotide diversity (pi) compared among six Hosta species in Korea The dashed lines are the borders of LSC, SSC and IR regions Lee et al BMC Genomics (2019) 20:833 Page of 13 Fig Comparisons of LSC, SSC and IR region boundaries among the chloroplast genomes of six Korean Hosta species variation ranged from 10 to 18 We found four types of tetra-nucleotide SSRs in four of the six taxa, whereas H venusta and H minor had five different types of tetranucleotide SSRs (Table 3) There was no tri- and hexanucleotide SSRs in the six Korean Hosta The type of compound SSRs differ across the six Hosta taxa In addition to the SSR repeats, we further investigated the long repeats and identified 49 repeats consisting of on average 26 palindromic, 15 forward, reverse and Table Distribution of simple sequence repeats (SSRs) in six Hosta species in Korea c denotes for compound SSR of which comprised more than two SSRs adjacent to each other The number of polymorphic SSRs were counted when the SSRs are polymorphic at least in one species Number of SSRs (No of polymorphic SSRs) Species Unit size Total c Hosta clausa 34 (11) 10 (1) (1) 51 (12) Hosta capitata 36 (14) 12 (4) (1) (3) 57 (22) Hosta minor 35 (14) 10 (2) (1) (1) (1) 53 (19) Hosta venusta 36 (15) 10 (2) (1) (1) 53 (19) Hosta yingeri 40 (19) 10 (2) (2) (2) 59 (25) Hosta jonesii 39 (18) 10 (2) (2) (1) 57 (23) complement repeats (Additional file 1: Table S1) The smallest unit size of the repeat was 18 while the largest unit size was 46 The majority of the repeats (ca 88%) were size of less than 30 and nearly half of the repeats (ca 47%) were situated in or at the border of genic regions Among those repeats within the coding region, palindromic and forward repeats were located on ycf2 (Additional file 1: Table S1) Phylogenetic inference We examined the phylogenetic relationships among 20 taxa in subfamily Agavoideae including the six Korean Hosta species using the whole plastome sequences The overall topology of the phylogeny computed from both Maximum likelihood (ML) and Neighbor joining (NJ) was identical (Fig 5) On average, the statistical supports for each node were fairly high except for a few tip nodes (Fig 5) In the phylogeny, all seven Hosta taxa (see Table and Additional file 1: Table S2 for the taxa names and GenBank accessions) formed a monophyletic group that is a sister to the group of most taxa in Agavoideae (Fig 5) The genus Anemarrhena (A asphodeloides) was positioned at the basal node Among the seven Hosta taxa, H capitata was the most closely related to H ventricosa while H minor formed another ... six Korean Hosta species encoding 20 amino acids (AAs) Total number of codons for protein coding genes found was 26,505 in all six Korean Hosta The effective number of codons were as followings:... among the six Korean Hosta; 3) to infer the phylogenetic relationship among the six Korean Hosta and reconstruct the phylogeny of the six species within the subfamily Agavoideae Results Chloroplast. .. (2019) 20:833 Page of 13 Table List of genes within chloroplast genomes of six Hosta species in Korea ×2 refers to genes duplicated in the IR regions Category Group of genes Names of genes Transcription