Advanced statistical methods for the analysis of large data sets

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	402
Dung lượng	31,24 MB

Nội dung

Studies in Theoretical and Applied Statistics Selected Papers of the Statistical Societies For further volumes: http://www.springer.com/series/10104 Series Editors Spanish Society of Statistics and Operations Research (SEIO) Ignacio Garcia Jurado Société Française de Statistique (SFdS) Avner Bar-Hen Società Italiana di Statistica (SIS) Maurizio Vichi Sociedade Portuguesa de Estat´ıstica (SPE) Carlos Braumann Agostino Di Ciaccio Mauro Coli Jose Miguel Angulo IbaQnez Editors Advanced Statistical Methods for the Analysis of Large Data-Sets 123 Editors Agostino Di Ciaccio University of Roma “La Sapienza” Dept of Statistics P.le Aldo Moro 00185 Roma Italy agostino.diciaccio@uniroma1.it Mauro Coli Dept of Economics University “G d’Annunzio”, Chieti-Pescara V.le Pindaro 42 Pescara Italy coli@unich.it Jose Miguel Angulo IbaQnez Departamento de Estad´ıstica e Investigación Operativa, Universidad de Granada Campus de Fuentenueva s/n 18071 Granada Spain jmangulo@ugr.es This volume has been published thanks to the contribution of ISTAT - Istituto Nazionale di Statistica ISBN 978-3-642-21036-5 e-ISBN 978-3-642-21037-2 DOI 10.1007/978-3-642-21037-2 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2012932299 c Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer Violations are liable to prosecution under the German Copyright Law The use of general descriptive names, registered names, trademarks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Editorial Dear reader, on behalf of the four Scientific Statistical Societies: SEIO, Sociedad de Estad´ıstica e Investigación Operativa (Spanish Statistical Society and Operation Research); SFC, Société Française de Statistique (French Statistical Society); SIS, Società Italiana di Statistica (Italian Statistical Society); SPE, Sociedade Portuguesa de Estat´ıstica (Portuguese Statistical Society), we inform you that this is a new book series of Springer entitled Studies in Theoretical and Applied Statistics, with two lines of books published in the series “Advanced Studies”; “Selected Papers of the Statistical Societies.” The first line of books offers constant up-to-date information on the most recent developments and methods in the fields of Theoretical Statistics, Applied Statistics, and Demography Books in this series are solicited in constant cooperation among Statistical Societies and need to show a high-level authorship formed by a team preferably from different groups to integrate different research points of view The second line of books proposes a fully peer-reviewed selection of papers on specific relevant topics organized by editors, also in occasion of conferences, to show their research directions and developments in important topics, quickly and informally, but with a high quality The explicit aim is to summarize and communicate current knowledge in an accessible way This line of books will not include proceedings of conferences and wishes to become a premier communication medium in the scientific statistical community by obtaining the impact factor, as it is the case of other book series such as, for example, “lecture notes in mathematics.” The volumes of Selected Papers of the Statistical Societies will cover a broad scope of theoretical, methodological as well as application-oriented articles, surveys, and discussions A major purpose is to show the intimate interplay between various, seemingly unrelated domains and to foster the cooperation among scientists in different fields by offering well-based and innovative solutions to urgent problems of practice On behalf of the founding statistical societies, I wish to thank Springer, Heidelberg and in particular Dr Martina Bihn for the help and constant cooperation in the organization of this new and innovative book series Maurizio Vichi v • Preface Many research studies in the social and economic fields regard the collection and analysis of large amounts of data These data sets vary in their nature and complexity, they may be one-off or repeated, and they may be hierarchical, spatial, or temporal Examples include textual data, transaction-based data, medical data, and financial time series Today most companies use IT to support all business automatic function; so thousands of billions of digital interactions and transactions are created and carried out by various networks daily Some of these data are stored in databases; most ends up in log files discarded on a regular basis, losing valuable information that is potentially important, but often hard to analyze The difficulties could be due to the data size, for example thousands of variables and millions of units, but also to the assumptions about the generation process of the data, the randomness of sampling plan, the data quality, and so on Such studies are subject to the problem of missing data when enrolled subjects not have data recorded for all variables of interest More specific problems may relate, for example, to the merging of administrative data or the analysis of a large number of textual documents Standard statistical techniques are usually not well suited to manage this type of data, and many authors have proposed extensions of classical techniques or completely new methods The huge size of these data sets and their complexity require new strategies of analysis sometimes subsumed under the terms “data mining” or “predictive analytics.” The inference uses frequentist, likelihood, or Bayesian paradigms and may utilize shrinkage and other forms of regularization The statistical models are multivariate and are mainly evaluated by their capability to predict future outcomes This volume contains a peer review selection of papers, whose preliminary version was presented at the meeting of the Italian Statistical Society (SIS), held 23–25 September 2009 in Pescara, Italy The theme of the meeting was “Statistical Methods for the analysis of large datasets,” a topic that is gaining an increasing interest from the scientific community The meeting was the occasion that brought together a large number of scientists and experts, especially from Italy and European countries, with 156 papers and a vii viii Preface large number of participants It was a highly appreciated opportunity of discussion and mutual knowledge exchange This volume is structured in 11 chapters according to the following macro topics: • • • • • • • • • • • Clustering large data sets Statistics in medicine Integrating administrative data Outliers and missing data Time series analysis Environmental statistics Probability and density estimation Application in economics WEB and text mining Advances on surveys Multivariate analysis In each chapter, we included only three to four papers, selected after a careful review process carried out after the conference, thanks to the valuable work of a good number of referees Selecting only a few representative papers from the interesting program proved to be a particularly daunting task We wish to thank the referees who carefully reviewed the papers Finally, we would like to thank Dr M Bihn and A Blanck from Springer-Verlag for the excellent cooperation in publishing this volume It is worthy to note the wide range of different topics included in the selected papers, which underlines the large impact of the theme “statistical methods for the analysis of large data sets” on the scientific community This book wishes to give new ideas, methods, and original applications to deal with the complexity and high dimensionality of data Sapienza Università di Roma, Italy Università G d’Annunzio, Pescara, Italy Universidad de Granada, Spain Agostino Di Ciaccio Mauro Coli José Miguel Angulo Ibanez Q Contents Part I Clustering Large Data-Sets Clustering Large Data Set: An Applied Comparative Study Laura Bocci and Isabella Mingo Clustering in Feature Space for Interesting Pattern Identification of Categorical Data Marina Marino, Francesco Palumbo and Cristina Tortora Clustering Geostatistical Functional Data Elvira Romano and Rosanna Verde Joint Clustering and Alignment of Functional Data: An Application to Vascular Geometries Laura M Sangalli, Piercesare Secchi, Simone Vantini, and Valeria Vitelli Part II 13 23 33 Statistics in Medicine Bayesian Methods for Time Course Microarray Analysis: From Genes’ Detection to Clustering Claudia Angelini, Daniela De Canditiis, and Marianna Pensky Longitudinal Analysis of Gene Expression Profiles Using Functional Mixed-Effects Models Maurice Berk, Cheryl Hemingway, Michael Levin, and Giovanni Montana A Permutation Solution to Compare Two Hepatocellular Carcinoma Markers Agata Zirilli and Angela Alibrandi 47 57 69 ix ... presented at the meeting of the Italian Statistical Society (SIS), held 23–25 September 2009 in Pescara, Italy The theme of the meeting was Statistical Methods for the analysis of large datasets,”... is often used for analyzing and summarizing information within these large data sets The growing size of data sets and databases has led to increase demand for good clustering methods for analysis. .. due to the data size, for example thousands of variables and millions of units, but also to the assumptions about the generation process of the data, the randomness of sampling plan, the data quality,

Ngày đăng: 08/08/2018, 16:54