NGS Big Data Reference Architecture Considerations

Một phần của tài liệu Big data analysis algorithms society 5425 (Trang 283 - 296)

Figure3 depicts the next generation sequencing big data architecture that summa- rizes the above discussion. The bottom layer presents computing nodes the resources of which can be virtualized with either one of hardware or operating system-level solutions. A two-step approach (i.e. first hardware, than os virtualization strategy is also possible and beneficial in cases when a different version of kernel or oper- ating system then installed on the host is required). The next layer constitutes the abstraction of resources: storage (as the distributed filesystem—HDFS), computing resources (Apache YARN) and other resources (e.g. network interfaces, operating system—Docker). One layer above is dedicated to the NGS data access components and file formats: Hadoop-BAM and ADAM file formats family. Two upper layers are computing/storage engines and user interfaces. The architecture layout is accompa- nied by a data exchange layer (i.e. NFS and HTTPFs gateways) at the bottom together with management and security components to the right of the diagram.

NFS Gateway/HTTPFS Gateway

DataNode DataNode NameNode DataNode DataNode

Node Manager

Node Manager

Node Manager

Node Manager

Fig. 3 Scalable cloud ready reference architecture for big data from next generation sequencing

This is by no means a complete picture of all the components that are involved but the most important ones are highlighted. In case of many of the components there is more than option to choose from: e.g. instead of HBase, one can opt for Cassandra, instead of Hadoop YARN Apache Mesos can be used, etc.

5 Open Challenges and Near Future Direction for Cloud-Based Genomics

Currently, the systems and APIs such as SparkSeq, SeqPig or ADAM standards are mostly prototypical. Still the experience from it have shown that only scaling the data processing into multiple machines in cluster and computational clouds may bring the answer to the unmet need of more efficient and more precise data analysis

of genomic data. The cost of sequencing is significant, however the cost and effort on data analysis is even a more important issue to solve. The emerging areas of applications of NGS, such as personalized medicine, but also pharmacogenomics, biomaterials, genomics of plants and animals in agriculture, or biosafety are waiting for the solutions in efficient data analysis.

The scalable solutions that can be run in a reliable way on single machines, on clusters or in the proprietary or public clouds should be the future of genomic data analysis which is shaped by the growing size of datasets but also by the need for precise and confidential processing of highly valuable patients’ data. In order to successfully apply existing cloud-based software in genome bioinformatics still many open issues need to be addressed. The bioinformatic software scene is rich with various types of software, but at the moment they do not have many common points with Hadoop ecosystem and cloud-based technologies. To make this two technology worlds meet, many of the algorithms must be re-implemented with the use of modern computing frameworks, and bioinformatic academic and commercial software needs to switch to emerging NGS big data formats, such as the ADAM formats family. In order to successfully apply existing cloud-based software in genome bioinformatics still many open issues need to be addressed.

In this chapter there were discussed many challenges at various stages of the NGS data processing. They may be addressed with various tools and solutions. The cloud-based solutions are among the novel and promising ones, still for a large part of genome bioinformatics the working-horse techniques remain at the moment those rooted in classic high-performance computing. The software for quality con- trol, alignment and genomic features extraction is mainly run in multi-threaded way under control of distributed resource management tools such as SGE or LSF. The complexity of genomic data processing pipelines and multi-node scalability can be achieved using bioinformatic-oriented workflow management systems such as SnakeMake [72] or Big Data Script [73]. It is likely, and will be interesting to see in the near future the use of those or similar tools to integrate the data processing in both HPC and cloud environments.

It is also important, that at the level of human skills there is a need for really cross-disciplinary expertise. Bioinformaticians should become aware of the cloud technologies, biological and medical experts should know the full potential of the informational value of the big genomic datasets and medicine and life sciences should educate a new generation of physicians and researchers being able to formulate the requirements for the data scientists. All this processes have been already initiated and they are directed towards the synergy between big data, cloud computing and genomics, which will in turn give boost to the novel medical and biotechnology applications.

References

1. Shendure, J., Ji, H. (eds.): Next-generation DNA sequencing. In: Shendure, J., Ji, H., (eds.) Nature Biotechnology, vol. 26. Nature Publishing Group (2008)

2. DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., Daly, M.J.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011)

3. Duitama, J., Quintero, J.C., Cruz, D.F., Quintero, C., Hubmann, G., Foulquié-Moreno, M.R., Verstrepen, K.J., Thevelein, J.M., Tohme, J.: An integrated framework for discovery and geno- typing of genomic variants from high-throughput sequencing experiments. Nucleic Acids Res.

42, e44 (2014)

4. Ozsolak, F., Milos, P.M.: RNA sequencing: advances, challenges and opportunities. Nat. Rev.

Genet.12, 87–98 (2011)

5. Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., Robinson, M.D.: Count-based differential expression analysis of RNA sequencing data using R and Bio- conductor. Nat. Protoc.8, 1765–1786 (2013)

6. Bird, A.P.: Cpg-rich islands and the function of dna methylation. Nature321, 209–213 (1985) 7. Suzuki, M.M., Bird, A.: Dna methylation landscapes: provocative insights from epigenomics.

Nat. Rev. Genet.9, 465–476 (2008)

8. Tatusova, T.A., Madden, T.L.: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett.174, 247–250 (1999)

9. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. In: Proceed- ings of the National Academy of Sciences of the United States of America (1988)

10. DNA sequencing with chain-terminating inhibitors. In: Proceedings of the National Academy of Sciences of the United States of America, National Academy of Sciences of the United States of America (1977)

11. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods9, 357–359 (2012)

12. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al.: De novo assembly of human genomes with massively parallel short read sequencing.

Genome Res.20, 265–272 (2010)

13. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res.18, 821–829 (2008)

14. Frazee, A.C., Sabunciyan, S., Hansen, K.D., Irizarry, R.A., Leek, J.T.: Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics (Oxford, England) (2014) 15. Anders, S., Huber, W.: Differential expression analysis for sequence count data. Nature Pre-

cedings (2010)

16. Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics26, 139–140 (2010) 17. Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., Robinson,

M.D.: Count-based differential expression analysis of RNA sequencing data using R and Bio- conductor (2013)

18. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Evangelista, C., Kim, I.F., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Holko, M., Ayanbule, O., Yefanov, A., Soboleva, A.: NCBI GEO: archive for functional genomics data sets-10 years on. Nucleic Acids Res.39, D1005–D1010 (2011)

19. Kodama, Y., Shumway, M., Leinonen, R.: The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res.40, D54–D56 (2012)

20. Cochrane, G., Akhtar, R., Bonfield, J., Bower, L., Demiralp, F., Faruque, N., Gibson, R., Hoad, G., Hubbard, T., Hunter, C., Jang, M., Juhos, S., Leinonen, R., Leonard, S., Lin, Q., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Plaister, S., Radhakrishnan, R., Robinson, S.,

Sobhany, S., Hoopen, P.T., Vaughan, R., Zalunin, V., Birney, E.: Petabyte-scale innovations at the European Nucleotide Archive. Nucleic Acids Res.37, D19–25 (2009)

21. Kwok, P.Y.: Single Nucleotide Polymorphisms. Humana, Totowa, NJ (2003)

22. Okoniewski, M.J., Meienberg, J., Patrignani, A., Szabelska, A., Mátyás, G., Schlapbach, R.:

Precise breakpoint localization of large genomic deletions using PacBio and Illumina next- generation sequencers. BioTechniques54, 98–100 (2013)

23. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., et al.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol.10, R25 (2009) 24. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform.

Bioinformatics25, 1754–1760 (2009)

25. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: 1000 Genome Project Data Processing Subgroup: The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England)25, 2078–2079 (2009)

26. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res.18, 1851–1858 (2008)

27. Saunders, C.T., Wong, W.S.W., Swamy, S., Becq, J., Murray, L.J., Cheetham, R.K.: Strelka:

accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinfor- matics (Oxford, England)28, 1811–1817 (2012)

28. Thomas, M.F., Ansel, K.M.: Construction of small RNA cDNA libraries for deep sequencing.

Methods Mol. Biol. (Clifton, N.J.)667, 93–111 (2010)

29. Kornblihtt, A.R., Schor, I.E., Allo, M., Dujardin, G., Petrillo, E., Muủoz, M.J.: Alternative splicing: a pivotal step between eukaryotic transcription and translation. Nat. Rev. Mol. Cell Biol.14, 153–165 (2013)

30. Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., Salzberg, S.L.: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol.14, R36 (2013)

31. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 29, 15–21 (2013)

32. Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg, S.L., Rinn, J.L., Pachter, L.: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Natu. Protoc.7, 562–578 (2012)

33. Li, B., Dewey, C.N.: Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC Bioinf.12, 323 (2011)

34. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A.: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat.

Biotechnol.29, 644–652 (2011)

35. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc. (2012) 36. Franklin, M.: Spark Becomes Top Level Apache Project

37. Ousterhout, K., Wendell, P., Zaharia, M., Stoica, I.: Sparrow: Distributed, low latency schedul- ing. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69–84. SOSP ’13, New York, NY, USA, ACM (2013)

38. Bykov, S., Geller, A., Kliot, G., Larus, J., Pandya, R., Thelin, J.: Orleans: Cloud computing for everyone. In: ACM Symposium on Cloud Computing (SOCC 2011), ACM (2011)

39. O’Driscoll, A., Daugelaite, J., Sleator, R.D.: ‘big data’, hadoop and cloud computing in genomics. J. Biomed. Inf. 774–781 (2013)

40. Dove, E.S., Joly, Y., Tassé, A.M.: Genomic cloud computing: legal and ethical points to con- sider. Eur. J. Hum. Genet. (2014)

41. Kuo, A.M.H.: Opportunities and challenges of cloud computing to improve health care services.

J. Med. Internet Res.13(2011)

42. Dai, L., Gao, X., Guo, Y., Xiao, J., Zhang, Z.: Bioinformatics clouds for big data manipulation.

J. Med. Internet Res.36(6), 4031–4036 (2012)

43. Jimerson, B.: Software Architecture for High Availability in the Cloud 44. Apache: Spark programming guide (2014)

45. Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., Qin, X.: Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9 (2010)

46. Kumar, V.: Running Hadoop in the Cloud 47. Apache: Spark sql programming guide (2014) 48. Apache: Parquet (2014)

49. He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: Rcfile: A fast and space- efficient data placement structure in mapreduce-based warehouse systems. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 1199–1208 (2011)

50. Niemenmaa, M., Kallio, A., Schumacher, A., Klemelọ, P., Korpelainen, E., Heljanko, K.:

Hadoop-bam: directly manipulating next generation sequencing data in the cloud. Bioinfor- matics28, 876–877 (2012)

51. Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., Hel- janko, K.: Seqpig: simple and scalable scripting for large sequencing data sets in hadoop.

Bioinformatics30, 119–120 (2014)

52. Wiewiórka, M.S., Messina, A., Pacholewska, A., Maffioletti, S., Gawrysiak, P., Okoniewski, M.J.: Sparkseq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 2652–2653 (2014)

53. Nordberg, H., Bhatia, K., Wang, K., Wang, Z.: Biopig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics29, 3014–3019 (2013)

54. Leo, S., Santoni, F., Zanetti, G.: Biodoop: bioinformatics on hadoop. In: IEEE International Conference on Parallel Processing Workshops, 2009. ICPPW’09, pp. 415–422 (2009) 55. Massie, M., Nothaft, F., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson,

D.A.: Adam: Genomics formats and processing patterns for cloud scale computing. Technical Report UCB/EECS-2013-207, EECS Department, University of California, Berkeley (2013) 56. McCabe, C.: How Improved Short-Circuit Local Reads Bring Better Performance and Security

to Hadoop.http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads- bring-better-performance-and-security-to-hadoop/(2013)

57. Callaghan, B., Pawlowski, B., Staubach, P.: Nfs version 3 protocol specification. Technical report, RFC 1813, Network Working Group (1995)

58. Dove, E.S., Joly, Y., Tassé, A.M., Burton, P., Chisholm, R., Fortier, I., Goodwin, P., Harris, J., Hveem, K., Kaye, J., et al.: Genomic cloud computing: legal and ethical points to consider.

Eur. J. Hum. Genet. (2014)

59. Beck, M., Haupt, V.J., Roy, J., Moennich, J., Jọkel, R., Schroeder, M., Isik, Z.: Genecloud:

Secure cloud computing for biomedical research. In: Trusted Cloud Computing, pp. 3–14.

Springer (2014)

60. Hortonworks. Manage Security Policy for Hive & HBase with Knox & Ranger. http://

hortonworks.com/hadoop-tutorial/manage-security-policy-hive-hbase-knox-ranger/(2014) 61. Sharma, P.P., Navdeti, C.P.: Securing big data hadoop: a review of security issues, threats and

solution. Int. J. Comput. Sci. Inf. Technol.5(2014)

62. Merelli, I., Pérez-Sánchez, H., Gesing, S., D’Agostino, D.: Managing, analysing, and integrat- ing big data in medical bioinformatics: open problems and future perspectives. BioMed Res.

Int.2014(2014)

63. Cock, P.J., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamel- ryck, T., Kauff, F., Wilczynski, B., et al.: Biopython: freely available python tools for compu- tational molecular biology and bioinformatics. Bioinformatics25, 1422–1423 (2009) 64. Holland, R.C., Down, T.A., Pocock, M., Prli´c, A., Huen, D., James, K., Foisy, S., Drọger, A.,

Yates, A., Heuer, M., et al.: Biojava: an open-source framework for bioinformatics. Bioinfor- matics24, 2096–2097 (2008)

65. Wadkar, S., Siddalingaiah, M.: Apache ambari. In: Pro Apache Hadoop, pp. 399–401. Springer (2014)

66. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: Yet another resource negotiator. In:

Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013) 67. Franklin, M.: The berkeley data analytics stack: Present and future. In: 2013 IEEE International

Conference on Big Data, pp. 2–3 (2013)

68. Xiao, W., Ji, C.L., Li, J.D.: Design and implementation of massive data retrieving based on cloud computing platform. Appl. Mech. Mater.303, 2235–2240 (2013)

69. Turnbull, J.: The Docker Book: Containerization is the new virtualization. James Turnbull (2014)

70. Team, R.C., et al.: R: A language and environment for statistical computing (2012)

71. Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gau- tier, L., Ge, Y., Gentry, J., et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol.5, R80 (2004)

72. Kaster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinfor- matics28, 2520–2522 (2012)

73. Cingolani, P., Sladek, R., Blanchette, M.: Bigdatascript: a scripting language for data pipelines.

Bioinformatics31, 10–16 (2015)

Features in High-Dimensional Problems

Michał Drami ´nski, Michał J. Dáabrowski, Klev Diamanti, Jacek Koronacki and Jan Komorowski

Abstract The availability of very large data sets in Life Sciences provided earlier by the technological breakthroughs such as microarrays and more recently by vari- ous forms of sequencing has created both challenges in analyzing these data as well as new opportunities. A promising, yet underdeveloped approach to Big Data, not limited to Life Sciences, is the use of feature selection and classification to discover interdependent features. Traditionally, classifiers have been developed for the best quality of supervised classification. In our experience, more often than not, rather than obtaining the best possible supervised classifier, the Life Scientist needs to know which features contribute best to classifying observations (objects, samples) into distinct classes and what the interdependencies between the features that describe the observation. Our underlying hypothesis is that the interdependent features and rule networks do not only reflect some syntactical properties of the data and classi- fiers but also may convey meaningful clues about true interactions in the modeled biological system. In this chapter we develop further our method of Monte Carlo Feature Selection and Interdependency Discovery (MCFS and MCFS-ID, respec- tively), which are particularly well suited for high-dimensional problems, i.e., those

We thank the reviewer for providing valuable and detailed comments.

M. Drami´nskiãM.J. DáabrowskiãJ. Koronacki

Institute of Computer Science, Polish Acad. Sci, Ordona 21, Warsaw, Poland e-mail: Michal.Draminski@ipipan.waw.pl

M.J. Dáabrowski

e-mail: Michal.Dabrowski@ipipan.waw.pl J. Koronacki

e-mail: Jacek.Koronacki@ipipan.waw.pl K. Diamanti

Department of Cell and Molecular Biology, Uppsala University, Box 596, Uppsala, Sweden

e-mail: Klev.Diamanti@icm.uu.se J. Komorowski (B)

Department of Cell and Molecular Biology, Uppsala University and Institute of Computer Science, Polish Acad. Sci, Uppsala, Sweden

e-mail: Jan.Komorowski@icm.uu.se

© Springer International Publishing Switzerland 2016

N. Japkowicz and J. Stefanowski (eds.),Big Data Analysis: New Algorithms for a New Society, Studies in Big Data 16, DOI 10.1007/978-3-319-26989-4_12

285

where each observation is described by very many features, often many more fea- tures than the number of observations. Such problems are abundant in Life Science applications. Specifically, we define Inter-Dependency Graphs (termed, somewhat confusingly, ID Graphs) that are directed graphs of interactions between features extracted by aggregation of information from the classification trees constructed by the MCFS algorithm. We then proceed with modeling interactions on a finer level with rule networks. We discuss some of the properties of the ID graphs and make a first attempt at validating our hypothesis on a large gene expression data set for CD4+ T-cells. The MCFS-ID and ROSETTA including the Ciruvis approach offer a new methodology for analyzing Big Data from feature selection, through identification of feature interdependencies, to classification with rules according to decision classes, to construction of rule networks. Our preliminary results confirm that MCFS-ID is applicable to the identification of interacting features that are functionally rele- vant while rule networks offer a complementary picture with finer resolution of the interdependencies on the level of feature-value pairs.

Keywords MCFS-IDãROSETTAãCiruvisãHigh-dimensional problemsãGene

expression data

1 Introduction

Technical developments of the last decades enabling researchers to deal with Big Data allow one to look much deeper into the workings of complex systems, in par- ticular those from Life Sciences. Specifically, within genomic studies Encyclopedia of DNA Elements (ENCODE, cf. [1,2]), Genome-wide association study (GWAS;

cf. [3]), NIH Roadmap project (cf. [4]) and 1000 Genomes project (A Deep Catalog of Human Genetic Variation; cf. [5]) are the ongoing projects that offer continu- ously updated information about transcribed and non-transcribed genomic regions, epigenetic marks, RNA-seq, SNPs, the biological traits they are associated with, and integrated maps of genetic variation from 1092 human genomes. Combining, processing and analyzing these data provides a conceptual bridge to investigate fea- tures and functions of the human genome. Associating the genetic background of a specific disease with, for instance, histone modifications, transcription factors, chro- matin states and mutations has become one of the major streams of the investigations in genomics today. However, it requires a deep understanding of the mechanisms and the development of complex computational techniques for managing and merging data sources so as to enable biological interpretation of the results.

Another major challenge in the analysis of such biological data is due to their sizes:

a small number of objects (records, samples) versus several orders of magnitude greater number of attributes or features for each record. Such problems are usually referred to as “smalln large pproblems”, which we renamed to “smalln larged problems” (wherenstands for the number of records andpordas that of features).

By far, it is not only in Life Sciences, where problems of this type appear and have to be dealt with. Indeed, in our own work, we met challenging problems of commercial origin, including transactional data from a major multinational fast- moving consumer goods company and geological data from oil wells operated by a major oil company.

Independently of whether the data are to explain a quantitative—as in regression—

or categorical—as in classification—trait, such problems are quite different from typical data mining ones in which the number of features is much smaller than the number of samples. In a sense, these are ill-posed problems. This fact is immediately clear in the case of linear regression fitted by ordinary least-squares, where one gets a few linear equations with many more unknowns.

For two-class classification, at least from the geometrical point of view, the task is trivial, since in ad-dimensional space, as many asd+1 points can be divided into two arbitrary and disjoint subsets by some hyperplane, provided that these points do not lie in a proper subspace of the d-dimensional space. This is a well-known result on the Vapnik-Chervonenkis dimension for the class of halfspaces in Rd. It is another matter that the found hyperplane, or any other classification rule, should have the ability to generalize.

In conclusion, since it is rather a rule than an exception that most features in the data are not informative, but are essentially a noise or are redundant, it is of utmost importance to select the few ones that are informative and that may form a basis for class prediction, or for a proper regression model. Accordingly, before building a classifier or a regression model, or while building any of them, we would like to find out which features are specifically linked to the problem at hand and should be included in the solution.

Mathematically, properly formulated sparsity constraints should be included when seeking a solution. This requirement can be fulfilled by regularization or randomiza- tion. In the later sections of this exposition, we shall confine ourselves to the latter approach.

Regarding classification, and from now on we shall deal with classification only, one more important issue should be emphasized. More often than not, rather than obtaining the best possible classifier, the Life Scientist and, we claim, any other user of classifiers need to know which features contribute best to classifying observations (samples) into distinct classes and what are the interdependencies between such informative features.

In the area of feature ranking and selection, very significant progress has been achieved. For a brief account, up to 2002, see [6] and for an extensive survey and somewhat later developments see [7].

Without coming to details let us note that feature selection can bewrappedaround the classifier construction or directly built (embedded) into the classifier construc- tion, and not performed prior to addressing the classification task per se byfiltering out noisy features first and keeping only informative ones for building a classifier.

An early and successful method with embedded feature selection included, not men- tioned by [7], was developed by Tibshirani et al. (see [8,9]). More recently and within non-filter approaches, a Bayesian technique of automatic relevance determination,

Một phần của tài liệu Big data analysis algorithms society 5425 (Trang 283 - 296)

Tải bản đầy đủ (PDF)

(334 trang)