Báo cáo khoa học: Human-blind probes and primers for dengue virus identification Exhaustive analysis of subsequences present in the human and 83 dengue genome sequences doc
Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 11 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
11
Dung lượng
379,22 KB
Nội dung
Human-blind probes and primers for dengue virus identification Exhaustive analysis of subsequences present in the human and 83 dengue genome sequences Catherine Putonti1, Sergei Chumakov2, Rahul Mitra3, George E Fox4, Richard C Willson4,5 and Yuriy Fofanov1,4 Department of Computer Science, University of Houston, Houston, TX, USA Department of Physics, University of Guadalajara, Guadalajara, Jalisco, Mexico Genomics USA, Houston, TX, USA Department of Biology and Biochemistry, University of Houston, Houston, TX, USA Department of Chemical Engineering, University of Houston, Houston, TX, USA Keywords dengue; diagnostic assay; flavivirus; microarray; pathogen identification Correspondence C Putonti, University of Houston, 218 PGH, Houston, TX 77204–3058, USA Fax: +1 713 7431250 Tel: +1 713 7433992 E-mail: putonti@bioinfo.uh.edu (Received September 2005, revised 22 November 2005, accepted 23 November 2005) doi:10.1111/j.1742-4658.2005.05074.x Reliable detection and identification of pathogens in complex biological samples, in the presence of contaminating DNA from a variety of sources, is an important and challenging diagnostic problem for the development of field tests The problem is compounded by the difficulty of finding a single, unique genomic sequence that is present simultaneously in all genomes of a species of closely related pathogens and absent in the genomes of the host or the organisms that contribute to the sample background Here we describe ‘host-blind probe design’ – a novel strategy of designing probes based on highly frequent genomic signatures found in the pathogen genomes of interest but absent from the host genome Upon hybridization, an array of such informative probes will produce a unique pattern that is a genetic fingerprint for each pathogen strain This multiprobe approach was applied to 83 dengue virus genome sequences, available in public databases, to design and perform in silico microarray experiments The resulting patterns allow one to unequivocally distinguish the four major serotypes, and within each serotype to identify the most similar strain among those that have been completely sequenced In an environment where dengue is indigenous, this would allow investigators to determine if a particular isolate belongs to an ongoing outbreak or is a previously circulating version Using our probe set, the probability that misdiagnosis at the serotype level would occur is % : 10150 Members of the Flavivirus genus are responsible for a number of diseases, including yellow fever, West Nile, St Louis encephalitis, and dengue fever One or more of the four serotypes of the dengue virus are endemic in many parts of the world, including all of south-east Asia, parts of Africa, and Southern and Central America The Aedes aegypti mosquito, which prefers to feed on humans, is a carrier of the dengue virus and is commonly found on the US Gulf Coast according to the CDC (Centers for Disease Control and Prevention) (http://www.cdc.gov/ncidod/dvbid/ 398 dengue/index.htm) Although the USA has had relatively few reported cases of dengue, epidemics have occurred in northern Mexico and hence dengue is a growing concern for bordering states As no vaccine or treatments are available for dengue, early detection of the viral infection is critical to avoid a potential epidemic Dengue diagnosis has historically relied on either (a) isolation and growth of the virus in cell cultures in vitro or (b) serological tests The former, while able to provide a more definitive diagnosis, is time consuming and ill-suited for use in FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS C Putonti et al the field Thus, serology has emerged as the primary method for dengue diagnosis Serological tests are easy to use and able to accommodate a great number of samples, both necessities when confronting an epidemic These benefits, however, come at a cost; tests such as hemagglutination inhibition, IgG-ELISA and MAC-ELISA cannot easily distinguish dengue at the serotype level and are likely to misidentify other flaviviruses as dengue [1,2] Recently, specific tests have been developed for dengue identification using nucleic acid-based technologies [3] such as the PCR [4–14] and nucleic acid sequence-based amplification (NASBA) assays [15,16], and microarrays of cDNA [17] and oligonucleotides [18] These methods are both quick and easy to use, while offering reliable serotype-specific detection The probability of false positives, however, still remains a concern [4,15,16] Regardless of which technology is used, identification is typically based on the presence of one or a few unique subsequences [15,16,19] as indicators of the target of interest Several inherent problems exist in basing detection and ⁄ or identification on recognition of unique sequences First, to select a candidate one must know the pathogen’s genomic sequence Moreover, even if appropriate unique sequences can be found for the entire group, they will not be able to distinguish the various subgroups of the target organism This would require unique sequences for every subgroup of interest However, an important observation was made previously, by McGill et al [20], in that sequences need not be universally present in a group of interest or always absent from other groups to be informative about phylogenetic relationships Recently it was shown that large numbers of such ‘characteristic’ sequences exist in the 16S ribosomal RNA [21] Hence, an alternative approach [21] is to rely on multiple sequences that may individually not be uniquely or universally found in any particular grouping, but which are highly characteristic of particular groups Recognition is then based on a set of such characteristic sequences that together form a signature [19] for a particular organism or grouping In either approach, analysis is further complicated because viruses are obligate intracellular parasites; they are found in conjunction with host cells whose DNA might contain sequences that would interfere with the test As separation of viral from host nucleic acids is quite difficult, it is important that the sequences used for virus detection are absent from any potentially contaminating DNA We have recently developed a set of novel algorithms that make it possible to efficiently calculate the frequency of all subsequences (n-mers) of length Human-blind sequences for dengue identification 5–25+ nucleotides in any sequenced genome within no more than a few hours, depending on the genome size This allows exclusion of all subsequences that are present in a selected host ⁄ background genome (e.g human) in the PCR primer ⁄ microarray probe design step, which has greatly increased speed, predictability and effectiveness compared with current design methods The microarray format is particularly attractive as it permits testing for multiple pathogens simultaneously (e.g the set of viral pathogens causing similar symptoms in hosts or those rampant in the same regions in which the infection has occurred) We refer to the sequences that are present in the genome of interest and absent from the host genome as being ‘host-blind’ (human-blind, mosquito-blind, mouseblind, rat-blind, etc.) sequences The greater the number of changes necessary to ‘convert’ such a host-blind sequence to a sequence found in the host genome, the less likely the host-blind sequence, when used as a PCR primer and ⁄ or microarray probe, is to mispair with the host’s genomic sequence Thus, our algorithms can also exclude, in the design step, all hostblind sequences one, two, three, etc changes away from the nearest host sequence We refer to such sequences as being host-blind and one [two, three, etc.] change away from the nearest host sequence (Fig 1) This new approach can readily be extended to develop assays that are insensitive to the background of a host, such as a food (animal or plant) species, pathogen vector, or any other environmental background for which genomic sequence information is available By using sequences three or four changes away, we can reduce and possibly eliminate false positives in the presence of background contaminating genomes Fig Sensitivity of host-blind sequences The pathogen sequences are examples of host-blind sequences that are one possible change (left) and two possible changes (right) away from the nearest human host sequence (above) FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 399 Human-blind sequences for dengue identification C Putonti et al In the work presented here, 83 complete dengue virus genomes (representative of all four serotypes) were analyzed in conjunction with the available draft sequence of the human genome in order to find all potential probes ⁄ primers that could be used to detect or identify this pathogen in a human-derived sample The analysis was conducted for all n-mers up to 22 nucleotides long Our analysis focuses on those n-mers present in dengue and human-blind for all possible changes of one, two, three or four nucleotides Several hundred human-blind sequences were identified, including those that were (a) present in each individual viral strain’s genome, (b) present in all 83 dengue strains regardless of their serotype, (c) unique to each serotype of the virus (present in all strains of the serotype), and (d) unique to each individual viral strain’s genome (present in the strain and absent from all other strains) The results demonstrate that any method of identification based solely on hybridization with a particular unique sequence or a small set (typically less than six) of sequences, as used in the existing tests of dengue diagnosis, would not be able to reliably accommodate potential mispriming To minimize the probability of misdiagnosis, sequences that require three, four or more bases to be altered for a mispriming to occur are considered ideal for identification purposes A multiple probe approach was taken in which detection and identification of any dengue virus strain in the presence of human DNA was developed using characteristic sequences A sample probe set that could be used in a microarray format was developed and tested by in silico hybridization This probe set was designed to contain the minimal number of probes necessary to detect and identify dengue at the strain level and the ability to unequivocally distinguish between the four major serotypes the human genome lie within two changes of all possible sequences of 14 nucleotides It is only when n ¼ 16 that the human genome does not include a sequence within three changes of any selected n-mer Therefore, in our search for human-blind n-mers, only values of n for which some sequences are actually absent from the human genome should be considered Furthermore, it is important that the number of sequences absent from the human genome are large enough such that there is a reasonable probability that some will occur within the much smaller viral genome For large values of n, however, specificity becomes a greater concern Thus, calculations and analysis were confined to n-mers of size 16–22 Human-blind sequences present in each individual viral genome For each of the 83 dengue virus genomes, calculations identified each n-mer, as 16 £ n £ 22, that is at least one, two, three, or four changes away from the nearest human sequence (Fig 2) The results of which are provided in the Supplementary data The presence or absence of each n-mer was calculated, rather than the frequency of occurrence There were no 16-mers three changes away from the nearest human sequence in any of the viral genomes and only a single 17-mer (and its complementary 17-mer), which was found in Results Human n-mers In order to generate the set of all probes ⁄ primers that are present in the dengue virus genome(s) and absent in (and distant from) the human genome, analysis of the human genome was first necessary An analytical model was designed to provide us with an estimate of the absence of subsequences from the complete human genome given any one, two, three, or four changes In addition, calculations were performed for the complete human genomic sequence for n < 18 (The results are included in the Supplementary material.) For n-mers of size less than 15 nucleotides, sequences present in 400 Fig Number of unique human-blind sequences found in each of the 83 complete dengue virus strain genomes considering different sizes of n and number of changes away from the nearest human sequence Also listed is the average number of 22-mers present in an individual genome and absent from the human sequence given any one, two, three or four changes The ideal set of probes would be 22-mers that are four changes away; 16-, 17- and 18-mers with one change away will lead to false-positive results because the mismatches could be tolerated in the hybridization between the host target and dengue probes FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS C Putonti et al Human-blind sequences for dengue identification just two of the dengue strains It is not until 19-mers were considered that all of the dengue genomes were found to have some human-blind sequences at least three changes away from the nearest human sequence The sequences four changes away are ideal candidates for use in recognizing dengue because it is unlikely that mispriming will occur and a false positive will be reported Each of the 83 strains had sequences at least four changes away when considering n-mers for n ‡ 21 human sequence and shared by all 83 dengue genomes This 18-mer could be used to detect the presence of dengue in a human sample; however, it is possible, and even likely, that this sequence could mispair to the host sequence or related flavivirus genomes Our results lead us to the conclusion that there are no human-blind sequences common to all 83 dengue strains that are at least three changes away from the nearest human sequence Sequences unique for serotype and Sequences present in all 83 dengue genomes regardless of serotype Prior to our calculations, we hypothesized that there would be some human-blind n-mers that are present in all 83 dengue genomes Such sequences could then serve as a reliable indicator of the presence of the virus in a complex sample (e.g an infected individual) Using all 83 genomic sequences, the number of such n-mers was calculated (Table 1) The number of unique sequences was quite small There appear to be several reasons for this First, as n increases, the number of common n-mers decreases Second, because the number of human-blind sequences in general is smaller for small n-mer sizes (no human-blind n-mers for n < 11 and no human-blind n-mers two changes away from the nearest human sequence for n £ 15), the number of human-blind sequences decreases rapidly It is also obvious that by requiring characteristic sequences be at least two, three, etc., changes away from the nearest human sequence, one dramatically reduces the number of available sequences (Table 1) Such sequences are ideal primers ⁄ probes for identification of dengue because of their decreased probability of a false positive, yet there are no n-mers present in any of the 83 dengue sequences and absent from human given any 3+ changes There is only one 18-mer (and its complement) two changes away from the nearest Table The number of n-mers present simultaneously in all 83 dengue genomes The first row does not consider if the sequences are absent from the human genome, just that they are present in all of the dengue genomes n 16 Present in all dengue; absence in human not considered Human-blind one change away Human-blind two changes away Human-blind three changes away Human-blind four changes away 17 18 19 20 21 22 20 14 0 0 12 0 0 0 0 0 0 0 0 We also calculated the number of unique sequences for each dengue type (DENV-1) or DENV-2 serotype, as these types comprise the great majority of the 83 genomes considered (Table 2) It is likely that when a more extensive sample of DENV-3 and DENV-4 genomes become available that the results will be similar It is observed that while there are far more human-blind n-mers shared within each group, as the sequence length and stringency increase the number of common n-mers decreases In the case of DENV-2, there are no n-mers four changes away from the nearest human sequence shared amongst all 46 virus genomes Further analysis of all serotype-specific sequences is required to verify that they are unique with respect to other flavivirus genomes as well Selecting host-blind primers ⁄ probes that are unique to the serotype and host-blind with the most changes possible Table Human-blind sequences present simultaneously in all DENV-1 and DENV-2 genomes DENV-3 and DENV-4 are not included, because the few sequences that are available are so similar to each other that the vast majority of the n-mers present in the sequence are unique to the serotype n 16 17 18 19 20 21 DENV-1 (28 genomes) Present in all dengue; 664 558 458 392 336 284 absence in human not considered Human-blind one change away 218 372 408 382 334 284 Human-blind two changes away 38 94 172 250 252 Human-blind three changes away 0 12 54 118 Human-blind four changes away 0 0 DENV-2 (46 genomes) Present in all dengue; 62 54 44 34 24 16 absence in human not considered Human-blind one change away 24 38 40 34 24 16 Human-blind two changes away 0 16 16 Human-blind three changes away 0 0 0 Human-blind four changes away 0 0 0 FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 22 254 254 242 182 38 10 10 10 401 Human-blind sequences for dengue identification C Putonti et al from the nearest human sequence would ensure a more reliable method of detection with a lower false-positive rate than the currently available techniques Sequences unique for each individual viral genome For each n-mer present in a dengue genome, the number of other dengue genomes that also contain this particular n-mer was calculated On average, 4.4% (16-mers) to 8.3% (22-mers) of the viral genome is comprised of n-mers that are not present in any of the other dengue genomes For example, in the genome of the DENV-4 China Guangzhou B5 strain (AF289029), 75.4% of the 22-mers are unique to this genome Three genomes (one DENV-1, two DENV-2, and one DENV-4) not have any 16- to 22-mers that not occur in any other dengue strain’s genomic sequence Thus, no single sequence could be used as a primer ⁄ probe to identify one of these strains Figure shows the distribution of the percentage of unique n-mers per genome for 16- to 22-mers This analysis was next extended to those n-mers that are humanblind The average number of host-blind n-mers that are unique to a particular genome is less than 8%, and many genomes have no human-blind n-mers at least two changes away from the nearest human sequence Despite this low average, there are several genomes that have a higher number of human-blind sequences then would be expected In AF289029, 30 of the 34 16-mers that are two changes away from the nearest human sequence are unique, and 16 014 of its 21 248 22-mers one change away from the nearest human sequence are unique Figure reflects the distribution of host-blind 22-mers one, two, three or four changes Fig Distribution of the percentage of n-mers per genome that are unique (i.e not contained in any of the other dengue genomes considered) 402 Fig Distribution of the percentage of human-blind 22-mers per genome that are unique (i.e not contained in any of the other dengue genomes considered) away from the nearest human sequence A complete report of our calculations of unique n-mers for all 83 genomes is available in the Supplementary data In silico array hybridization studies To reduce the likelihood of false positives, host-blind sequences that are 3+ changes away from the nearest human sequence are ideal for diagnostic purposes The fact that there is no single sequence meeting this criteria for all of the 83 dengue strains considered, suggests the use of multiple probes in a parallel (e.g array) assay in which a particular unique subset will hybridize with each dengue virus strain Thus, in the case of a microarray assay, identification would be based not on a single unique sequence but rather on a unique pattern The set of probes can be designed such that each serotype, or even each strain, will produce a unique pattern The proposed approach can easily be extended to unsequenced strains of dengue because novel patterns can be compared with all known patterns and the affinity of the new isolate to the known strains can be inferred by clustering techniques Identification of the particular strain of infection would, for example, allow epidemiologists and public health officials to rapidly determine if an isolate causing hemorrhagic fever represents a new outbreak or belongs to known circulating versions of the virus The ability to quickly, inexpensively, and reliably diagnose dengue at the strain level in such a manner is not possible with existing techniques Based upon the results, presented above, for humanblind n-mers in the 83 dengue genomes, a set of 216 probes (22-mers, at least three changes away from the nearest human sequence) was designed for in silico FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS C Putonti et al experiments The 216-probe set was computed as the minimum number of probes possible to uniquely identify each of the 83 genomes such that each genome was required to contain a subset of at least 28% (in this case 61) of the 216 22-mers or probe sequences Furthermore, for any two strain’s genomes, the subsets contained in each must differ by at least two sequences If two strains differ only by two sequences and mutations occur in these two sequences, the strains will be indistinguishable The likelihood of such an occurrence can be reduced by demanding more sequences for distinguishing between any two individual strains; this, however, will necessitate a larger probe set size We further stipulated that serotypes are distinguishable from each other such that any strain in one serotype differs significantly from any strain in any of the other three serotypes To this end, it was required that serotypes be distinguishable by at least 20% of the 216 22-mers or probe sequences contained in any of their strain members For the 216-probe set, the minimum number of probes differentiating a DENV-1 strain from the all other strains belonging to one of the three other serotypes was 70, 56 for DENV2, 65 for DENV-3, and 56 for DENV-4 Thus, in the event that identification is not possible at the strain level as a result of mutations, identification at the serotype level is possible To estimate the probability that a misdiagnosis occurs at the serotype level, we assume, in the worst case scenario, that a target sequence in dengue will no longer hybridize with its complementary probe if just one point mutation occurs For a given sequence of length l, there are l ) n n-mers As the length of the dengue genomic sequence is significantly larger than the sizes of n considered here, l ) n % l, such that the probability that m specific n-mers are mutated can be estimated as m! ⁄ lm To misdiagnose the infection at the serotype level in the 216-probe set would require at least 56 mutations (m ¼ 56) to occur within a dengue genome of % 10 000 bp (l ¼ 10 000) Thus, the probability that such an event would occur is % : 10150 The microarray of 216 probes represents what many researchers can produce in-house at low cost We determined the pattern that would appear on the microarray given a particular genome’s ability to hybridize with the probe sequences Figure shows the overlapping expression patterns for two pairs of genomes for the set of 216 probes The distribution of the number of probes present on the 216-probe set microarray for each of 83 genomes ranges from 61 to 95 Because dengue infections occur in regions in which other flaviviruses are also prevalent, it is imperative that a diagnostic tool is able to discriminate Human-blind sequences for dengue identification Fig Overlapping in silico expression patterns obtained using the 216 probes with pairs of dengue virus genomes (A) DENV-1 strain BR ⁄ 90 AF226685 (green) and DENV-2 M29095 (red); (B) DENV-2 from Cambodia AF309641 (green) and DENV-3 DENCME (red) Probes present in both genomes are shown in black, while probes absent in both genomes are shown in white between the different viruses [2] Considering the close relative of dengue virus, West Nile virus, we computed the number of probes expected to be present in 26 publicly available strains Of the 216 dengue probes, at most only three would hybridize with a West Nile strain In fact, 24 of the West Nile virus strains share these same three 22-mers with the dengue virus strains In the event that the clinical sample contained West Nile virus and not dengue virus, the expression pattern is expected to show only 1% of the probes hybridized, far less than the 28% required during the set design Thus, it is highly unlikely that a misidentification of the presence of dengue will be made using the 216probe set, even in the presence of another flavivirus To estimate the ability of such arrays to distinguish between different strains of a virus, as well as its possible genomic modifications, we introduce the distance (D) between any two patterns: D¼1À n12 ; minðn1 ; n2 Þ where n1 and n2 are the numbers of probes present in each of genomes being compared and n12 is number of probes present in both genomes simultaneously While there are many different ways to define such a distance between patterns, we chose this definition because of FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 403 Human-blind sequences for dengue identification C Putonti et al its simplicity; the distance is if both genomes produce the same pattern and if they not share any of the same probes By computing the distances between each pair of 83 patterns (the distance matrix), we were able to group virus isolates using phylip’s kitsch (University of Washington, Seattle, WA, USA) [22] and visualize these groups using publicly available software packages [23,24] based on the distances between the patterns observed on the microarray (Fig 6) The trees generated clearly separate DENV-1 strains from the remainder of the serotypes DENV-3 and DENV-4 are most closely clustered within their own respective serotypes but are nested within the DENV-2 branch While this may be attributed to the fact that there are far fewer DENV-3 and DENV-4 available to be included in this analysis, it is much more probable that it is a result of the design process itself Because each strain must contain a percentage of the overall probe set, sequences that are unique to a strain are, in essence, selected against Fig Dengue groupings based on the similarity of the observed hybridization patterns for the 216-probe hypothetical microarray of 22-mers at least three changes away from the nearest human sequence 404 A second probe set was designed containing a random sampling of 4000 sequences (18-mers, two away from the nearest human sequence) Because members of this set were chosen at random, many more sequences unique to a single strain or to just a few strains are included A tree displaying similarity between isolates was also created using this set (Fig 7) This allows the dengue stains to be grouped by their origin and the time at which the samples were taken, as well as by serotype Although a true evolutionary history of the virus can probably not be obtained in this way, the results suggest that an unknown isolate can be characterized with respect to its closest relatives by a comparison of hybridization patterns Well-developed methods such as k-means [25], self-organising map (SOM) [26] and hierarchical clustering [27] might improve determinations of how a new isolate compares to the previously studied isolates Discussion We found no single sequence 16–22 bases in length present in all 83 dengue sequences and absent given three base changes from the human genome Therefore, our approach was to use a unique pattern made by a group of oligos (minimum 216) to identify a particular virus strain A probe set of 216 human-blind sequences was designed that can both diagnose and identify the most similar strain among those whose genome has previously been sequenced The tests currently being used in the field can at best only distinguish between serotypes With this decreased specificity, the probability of misdiagnosis remains a major concern The assay proposed here will essentially reduce the error in misdiagnosing the serotype of dengue to : 10150 With microarray technology, the 216-probe set can easily be accommodated in a single diagnostic device Here, just one experiment provides both diagnosis and phylogenetic tree construction This assay will be able, without necessitating viral isolation, to quickly detect a new pattern signifying a new strain of dengue almost akin to sequencing the genome The ability to identify strains very similar to an unknown isolate in the data set of sequenced dengue genomes may be especially valuable in epidemiological studies where one would like to rapidly understand the origins of an outbreak of hemorrhagic fever For example, if such an outbreak were to occur in a location where dengue fever is indigenous, it may be the result of a new variant of the virus which is common in that region, a re-emergence of an earlier version, a continuation of an outbreak from the previous season or the introduction of a new strain as the result of FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS C Putonti et al Human-blind sequences for dengue identification Fig Dengue groupings obtained from the similarity of the observed hybridization patterns for the hypothetical 4000-probe microarray of randomly sampled 18-mers at least two changes away from the nearest human sequence travel If the needed complete sequences are obtained initially, hybridization arrays will allow these alternative explanations to be monitored on an ongoing basis Such monitoring might be conducted routinely to detect changes in the local virus population before cases of hemorrhagic fever occur If reduction of cost and size of this test are critical, the ability to identify dengue at the strain level can be sacrificed such that specificity is available only at the serotype level The host-blind technology provides a much more reliable solution than is currently available by greatly decreasing the likelihood that a primer ⁄ probe sequence will mispair with the host sequence We are confident in the ability of this technology for reliable detection Human-blind sequences have been successfully used as PCR primers generating amplicons matching those predicted computationally (M Anez, R C Willson, et al unpublished results) The development of host-blind diagnostic microarrays is underway The human-blind dengue primer ⁄ probe sequences can be additionally improved to not only be blind to humans but also to single nucleotide polymorphisms and organisms known to be associated with humans (e.g microflora), in addition to pathogens known to be transmitted by the same vector Furthermore, the computation-based host-blind approach can easily be extended to include not only human hosts but also mouse, rat, chicken, chimpanzee, mosquito, and any other sequenced host genome Experimental procedures Data Version 3.2.2 of the human genome was used This partially assembled human genome, located in 944 files containing 860 215 662 base pairs of sequence, is available from GenBank (http://www.ncbi.nlm.nih.gov/mapview/map_ search.cgi?taxid ¼ 9606) This version contains 794 007 unknown ⁄ unidentified bases For simplicity, all n-mers containing such characters were excluded from the calculations Moreover, because the file structure of the genome assembly does not allow the assembly of each chromosome without gaps, all n-mers having a subsequence belonging to one file and the remaining sequence in another file were not included in our calculations All calculations on the human genome utilized both the original and complementary strand sequences Eighty-three complete sequences of the dengue virus (28 DENV-1, 46 DENV-2, two DENV-3, and seven DENV-4) were considered This set of sequences, including their accession numbers, is provided in the Supplementary data The dengue genome is % 10 kb with minor variations in length Although dengue is a single-stranded RNA positive- FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 405 Human-blind sequences for dengue identification C Putonti et al strand virus with no DNA stage, both the original and complementary strand sequences were used in our calculations as a precautionary measure Calculations We have recently developed a set of novel algorithms that make it possible to analyze the occurrence frequency of all short subsequences (n-mers) of length 5–25+ nucleotides in any sequenced genome within a reasonable time (hours) [28–30] The unique properties of this new approach are: l exact consideration of each subsequence of size n (n-mer) in contrast to traditional blast-based approaches (no approximate heuristics – no missing cases); l consideration of all sequences that can be derived from each n-mer by up to any four changes; l extremely good time efficiency: calculations for up to 19-mers can be performed on a regular desktop PC; calculations for 20 £ n < 25+ can be performed using a standard high-performance cluster; l the large number of background (or host) genomic sequences can be taken into consideration in one run, as needed to avoid possible false positives; and l it can be used for genomic sequences of all sizes of practical interest, including the human genome (3 Gb) The basic idea is to set in correspondence to each of the 4n n-mers a particular element of a counting array, A, and define the procedure to convert the n-mer character sequence to an index of an element in such an array It currently takes less than one minute to find the set of all 16-mers present in dengue and absent in the human genome For PCR assay applications, the spacing of pairs and sets of primers is important; however, these can be designed much faster if consideration can be limited to only the subset of unique n-mers present in the genome of interest; extension of the algorithms to a PCR primer set design is now in progress Probe selection It is our intent to define the minimum optimal set of subsequences, smin, that can both identify the presence of a particular pathogen and distinguish between different strains of the pathogen To ensure the sensitivity needed to properly identify a genomic sequence, each genome under consideration must contain at least a subsequences from smin If applicable, each subclass or type must be distinguishable from any other subclass or type by at least b subsequences The set of subsequences present in each genome must differ by at least c subsequences from the set present in every other genome Furthermore, for each element k in smin, its complement k¢ must not be a member of smin In designing this optimal set, an evolutionary programming approach was taken While many sets, s, may meet 406 the criteria above, a fitness function is needed to measure how ‘good’ a particular set s is in order to determine whether it is, in fact, the optimal solution For instance, a particular set may exceed the minimum values required of a, b and c and, in fact, have values A, B and G, where A ‡ a, B ‡ b and G ‡ c While these values contribute to the fitness of a set, the size of the set plays a much greater role in an effort to reduce the number of probes needed and thus the cost of the array Therefore, we chose to evaluate the fitness of a particular set as f(s) ¼ (A + B + G) ⁄ set size, such that for the optimal set, smin, there exists no other set with a greater fitness value For sets consisting of hostblind sequences, the number of changes away from the host genome must also be integrated into the assessment of fitness Acknowledgements We would like to express our gratitude to the Texas Learning and Computation Center (TLCC) and to NASA (Grant NNJ04HF43G to GEF and RCW) for partial support of this work CP’s work was supported by a training fellowship from the Keck Center for Computational and Structural Biology of the Gulf Coast Consortia (NLM Grant no 5T15LM07093) The authors would also like to thank Dr R Pad Padmanabhan for many interesting discussions and suggestions References De Paula SO & da Fonseca BAL (2004) Dengue: a review of the laboratory tests a clinician must know to achieve a correct diagnosis Braz J Infect Dis 8, 390–398 Kao CL, King CC, Chao DY, Wu HL & Chang GJJ (2005) Laboratory diagnosis of dengue virus infection: current and future perspectives in clinical diagnosis and public health J Microbiol Immunol Infect 38, 5–16 Relman DA (1998) Detection and identification of previously unrecognized microbial pathogens Emerg Infect Dis 4, 382–389 Lanciotti RS, Calisher CH, Gubler DJ, Chang GJ & Vorndam AV (1992) Rapid detection and typing of dengue viruses from clinical samples by using reverse transcriptase-polymerase chain reaction J Clin Microbiol 30, 545–551 Harris E, Roberts TG, Smith L, Selle J, Krammer LD, Valle S, Sandoval E & Balmaseda A (1998) Typing of dengue viruses in clinical specimens and mosquitoes by single-tube multiplex reverse trascriptase PCR J Clin Microbiol 36, 2634–2639 De Paula SOD, Lima CDM, Torres MP, Pereira MR & da Fonseca BAL (2004) One-step RT-PCR protocols FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS C Putonti et al 10 11 12 13 14 15 16 17 improve the rate of dengue diagnosis compared to twostep RT-PCR approaches J Clin Virol 30, 297–301 Wang WK, Sung TL, Tsai YC, Kao CL, Chang SM & King CC (2002) Detection of dengue virus replication in perifperal blood mononuclear cells from dengue virus type 2-infected patients by a reverse transcription-realtime PCR assay J Clin Microbiol 40, 4472–4478 Sudiro TM, Zivny J, Ishiko H, Green S, Vaughn DW, Kalayanorooj S, Nisalak A, Norman JE, Ennis FA & Rothman AL (2001) Analysis of plasma viral RNA levels during acute dengue virus infection using quantitative competitor reverse transcription-polymerase chain reaction J Med Virol 63, 29–34 Houng HH, Hritz D & Kanesa-thasan N (2000) Quantitative detection of dengue virus using fluorogenic RT-PCR based on 3¢-noncoding sequence J Virol Methods 86, 1–11 Drosten C, Gottig S, Schilling S, Asper M, Panning M, Schmitz H & Gunther S (2002) Rapid detection and quantification of RNA of Ebola and Marburg viruses, Lassa virus, Crimean-Congo hemorrhagic fever virus, Rift Valley fever virus, dengue virus, and yellow fever virus by real-time reverse transcription-PCR J Clin Microbiol 40, 2323–2330 Shu PY, Chang SF, Kuo YC, Yueh YY, Chien LJ, Sue CL, Lin TH & Huang JH (2003) Development of group- and seortype-specific one-step SYBR green I-based real-time reverse transcription-PCR assay for dengue virus J Clin Microbiol 41, 2408–2416 Tanaka M (1993) Rapid identification of flavivirus using the polymerase chain reaction J Virol Methods 41, 311–322 Figueiredo LT, Batista WC, Kashima S & Nassar ES (1998) Identification of Brazilian flaviviruses by a simplified reverse transcription-polymerase chain reaction method using flavivirus universal primers Am J Trop Med Hyg 59, 357–362 Ito M, Takasaki T, Yamada KI, Nerome R, Tajima S & Kurane I (2004) Development and evaluation of fluorogenic TaqMan reverse-transciptase PCR assays for detection of dengue virus types 1–4 J Clin Microbiol 42, 5935–5937 Wu SJL, Lee EM, Pubatana R, Shurtliff RN, Porter KR, Suharyono W, Watts DM, King CC, Murphey GS, Hayes CG et al (2001) Detection of dengue viral RNA using a nucleic acid sequence-based amplification assay J Clin Microbiol 39, 2794–2798 Baeumner AJ, Schlesinger NA, Slutzki NS, Romano J, Lee EM & Montagna RA (2002) Biosensor for dengue virus detection: sensitive, rapid, and serotype specific Anal Chem 74, 1442–1448 Schena M, Shalon D, Davis RW & Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray Science 270, 467–470 Human-blind sequences for dengue identification 18 Lipshutz RJ, Fodor SP, Gingeras TR & Lockhart DJ (1999) High density synthetic oligonucleotide arrays Nat Genet 21, 20–24 19 Woese CR, Maniloff J & Zablen LB (1980) Phylogenetic analysis of the mycoplasmas Proc Natl Acad Sci USA 77, 494–498 20 McGill TR, Jurka J, Sobieski JM, Pickett MH, Woese CR & Fox GE (1986) Characteristic Archaebacterial 16S rRNA Oligonucleotides Syst Appl Microbiol 7, 194–197 21 Zhang Z, Willson RC & Fox GE (2002) Identification of characteristic oligonucleotides in the 16S ribosomal RNA sequence dataset Bioinformatics 18, 244–250 22 Felsenstein J (2005) PHYLIP (Phylogeny Inference Package), Version 3.6 Distributed by the Author Department of Genome Sciences University of Washington, Seattle 23 Choi J-H, Jung H-Y, Kim H-S & Cho HG (2000) PhyloDraw: a phylogenetic tree drawing system Bioinformatics 16, 1056–1058 ` 24 Perriere G & Gouy M (1996) WWW-Query: An on-line retrieval system for biological sequence banks Biochimie 78, 364–369 25 MacQueen J (1967) Methods for classification and analysis of multivariate observations In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (LeCam LM & Neyman J, eds), pp 281–297 California Press, Berkeley, California 26 Dopazo J & Carazo JM (1997) Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree J Mol Evol 44, 226–233 27 Ward JH (1963) Hierarchical grouping to optimize an objective function J Am Stat Assoc 58, 236–244 28 Fofanov Y, Belapurkar C, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Fofanov V, Li T-B, Chumakov S et al (2004) How independent are the appearances of n-mers in different genomes? Bioinformatics 20, 2421–2428 29 Chumakov S, Putonti C, Pettitt BM, Fox GE, Willson RC & Fofanov Y (2004) Using statistical properties of short subsequences in microbial identification In Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (Valafar F & Valafar H, eds), pp 363–367 CSREA Press, Las Vegas, NV 30 Fofanov V, Putonti C, Chumakov S, Pettitt BM & Fofanov Y (2005) Fast Algorithm for the Analysis of the Presence of Short Oligonucleotide Sequences in Genomic Sequences UH Technical Report #UH-CS-05–11, University of Houston, Houston, Texas [Online http:// www.cs.uh.edu/Preprints/preprint/uh-cs-05-11.pdf] FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS 407 Human-blind sequences for dengue identification C Putonti et al Supplementary material The following material is available online: Table S1 Data set of publicly available dengue strains considered Table S2 Estimated number of sequences absent in a genome of size Gb (the approximate size of the human genome excluding highly repeated elements) Table S3 Number of n-mers absent from the human genome 408 Table S4 Number of human-blind n-mers, one, two, three or four changes away, present in each dengue genome Table S5 Number of unique human-blind n-mers, one, two, three or four changes away This material is available as part of the online article at http://www.blackwell-synergy.com FEBS Journal 273 (2006) 398–408 ª 2005 The Authors Journal compilation ª 2005 FEBS ... refer to the sequences that are present in the genome of interest and absent from the host genome as being ‘host-blind’ (human- blind, mosquito-blind, mouseblind, rat-blind, etc.) sequences The greater... (22-mers) of the viral genome is comprised of n-mers that are not present in any of the other dengue genomes For example, in the genome of the DENV-4 China Guangzhou B5 strain (AF289029), 75.4% of the. .. al Human- blind sequences for dengue identification just two of the dengue strains It is not until 19-mers were considered that all of the dengue genomes were found to have some human- blind sequences