Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 402 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
402
Dung lượng
31,24 MB
Nội dung
Studies in Theoretical and Applied Statistics Selected Papers oftheStatistical Societies For further volumes: http://www.springer.com/series/10104 Series Editors Spanish Society of Statistics and Operations Research (SEIO) Ignacio Garcia Jurado Soci´et´e Franc¸aise de Statistique (SFdS) Avner Bar-Hen Societ`a Italiana di Statistica (SIS) Maurizio Vichi Sociedade Portuguesa de Estat´ıstica (SPE) Carlos Braumann Agostino Di Ciaccio Mauro Coli Jose Miguel Angulo IbaQnez Editors AdvancedStatisticalMethodsfortheAnalysisofLarge Data-Sets 123 Editors Agostino Di Ciaccio University of Roma “La Sapienza” Dept of Statistics P.le Aldo Moro 00185 Roma Italy agostino.diciaccio@uniroma1.it Mauro Coli Dept of Economics University “G d’Annunzio”, Chieti-Pescara V.le Pindaro 42 Pescara Italy coli@unich.it Jose Miguel Angulo IbaQnez Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Granada Campus de Fuentenueva s/n 18071 Granada Spain jmangulo@ugr.es This volume has been published thanks to the contribution of ISTAT - Istituto Nazionale di Statistica ISBN 978-3-642-21036-5 e-ISBN 978-3-642-21037-2 DOI 10.1007/978-3-642-21037-2 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2012932299 c Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright All rights are reserved, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks Duplication of this publication or parts thereof is permitted only under the provisions ofthe German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer Violations are liable to prosecution under the German Copyright Law The use of general descriptive names, registered names, trademarks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Editorial Dear reader, on behalf ofthe four Scientific Statistical Societies: SEIO, Sociedad de Estad´ıstica e Investigaci´on Operativa (Spanish Statistical Society and Operation Research); SFC, Soci´et´e Franc¸aise de Statistique (French Statistical Society); SIS, Societ`a Italiana di Statistica (Italian Statistical Society); SPE, Sociedade Portuguesa de Estat´ıstica (Portuguese Statistical Society), we inform you that this is a new book series of Springer entitled Studies in Theoretical and Applied Statistics, with two lines of books published in the series “Advanced Studies”; “Selected Papers oftheStatistical Societies.” The first line of books offers constant up-to-date information on the most recent developments and methods in the fields of Theoretical Statistics, Applied Statistics, and Demography Books in this series are solicited in constant cooperation among Statistical Societies and need to show a high-level authorship formed by a team preferably from different groups to integrate different research points of view The second line of books proposes a fully peer-reviewed selection of papers on specific relevant topics organized by editors, also in occasion of conferences, to show their research directions and developments in important topics, quickly and informally, but with a high quality The explicit aim is to summarize and communicate current knowledge in an accessible way This line of books will not include proceedings of conferences and wishes to become a premier communication medium in the scientific statistical community by obtaining the impact factor, as it is the case of other book series such as, for example, “lecture notes in mathematics.” The volumes of Selected Papers oftheStatistical Societies will cover a broad scope of theoretical, methodological as well as application-oriented articles, surveys, and discussions A major purpose is to show the intimate interplay between various, seemingly unrelated domains and to foster the cooperation among scientists in different fields by offering well-based and innovative solutions to urgent problems of practice On behalf ofthe founding statistical societies, I wish to thank Springer, Heidelberg and in particular Dr Martina Bihn forthe help and constant cooperation in the organization of this new and innovative book series Maurizio Vichi v • Preface Many research studies in the social and economic fields regard the collection and analysisoflarge amounts ofdata These datasets vary in their nature and complexity, they may be one-off or repeated, and they may be hierarchical, spatial, or temporal Examples include textual data, transaction-based data, medical data, and financial time series Today most companies use IT to support all business automatic function; so thousands of billions of digital interactions and transactions are created and carried out by various networks daily Some of these data are stored in databases; most ends up in log files discarded on a regular basis, losing valuable information that is potentially important, but often hard to analyze The difficulties could be due to thedata size, for example thousands of variables and millions of units, but also to the assumptions about the generation process ofthe data, the randomness of sampling plan, thedata quality, and so on Such studies are subject to the problem of missing data when enrolled subjects not have data recorded for all variables of interest More specific problems may relate, for example, to the merging of administrative data or theanalysisof a large number of textual documents Standard statistical techniques are usually not well suited to manage this type of data, and many authors have proposed extensions of classical techniques or completely new methodsThe huge size of these datasets and their complexity require new strategies ofanalysis sometimes subsumed under the terms “data mining” or “predictive analytics.” The inference uses frequentist, likelihood, or Bayesian paradigms and may utilize shrinkage and other forms of regularization Thestatistical models are multivariate and are mainly evaluated by their capability to predict future outcomes This volume contains a peer review selection of papers, whose preliminary version was presented at the meeting ofthe Italian Statistical Society (SIS), held 23–25 September 2009 in Pescara, Italy The theme ofthe meeting was “Statistical Methodsfortheanalysisoflarge datasets,” a topic that is gaining an increasing interest from the scientific community The meeting was the occasion that brought together a large number of scientists and experts, especially from Italy and European countries, with 156 papers and a vii viii Preface large number of participants It was a highly appreciated opportunity of discussion and mutual knowledge exchange This volume is structured in 11 chapters according to the following macro topics: • • • • • • • • • • • Clustering largedatasets Statistics in medicine Integrating administrative data Outliers and missing data Time series analysis Environmental statistics Probability and density estimation Application in economics WEB and text mining Advances on surveys Multivariate analysis In each chapter, we included only three to four papers, selected after a careful review process carried out after the conference, thanks to the valuable work of a good number of referees Selecting only a few representative papers from the interesting program proved to be a particularly daunting task We wish to thank the referees who carefully reviewed the papers Finally, we would like to thank Dr M Bihn and A Blanck from Springer-Verlag forthe excellent cooperation in publishing this volume It is worthy to note the wide range of different topics included in the selected papers, which underlines thelarge impact ofthe theme “statistical methodsfortheanalysisoflargedata sets” on the scientific community This book wishes to give new ideas, methods, and original applications to deal with the complexity and high dimensionality ofdata Sapienza Universit`a di Roma, Italy Universit`a G d’Annunzio, Pescara, Italy Universidad de Granada, Spain Agostino Di Ciaccio Mauro Coli Jos´e Miguel Angulo Ibanez Q Contents Part I Clustering Large Data-Sets Clustering LargeData Set: An Applied Comparative Study Laura Bocci and Isabella Mingo Clustering in Feature Space for Interesting Pattern Identification of Categorical Data Marina Marino, Francesco Palumbo and Cristina Tortora Clustering Geostatistical Functional Data Elvira Romano and Rosanna Verde Joint Clustering and Alignment of Functional Data: An Application to Vascular Geometries Laura M Sangalli, Piercesare Secchi, Simone Vantini, and Valeria Vitelli Part II 13 23 33 Statistics in Medicine Bayesian Methodsfor Time Course Microarray Analysis: From Genes’ Detection to Clustering Claudia Angelini, Daniela De Canditiis, and Marianna Pensky Longitudinal Analysisof Gene Expression Profiles Using Functional Mixed-Effects Models Maurice Berk, Cheryl Hemingway, Michael Levin, and Giovanni Montana A Permutation Solution to Compare Two Hepatocellular Carcinoma Markers Agata Zirilli and Angela Alibrandi 47 57 69 ix ... presented at the meeting of the Italian Statistical Society (SIS), held 23–25 September 2009 in Pescara, Italy The theme of the meeting was Statistical Methods for the analysis of large datasets,”... is often used for analyzing and summarizing information within these large data sets The growing size of data sets and databases has led to increase demand for good clustering methods for analysis. .. due to the data size, for example thousands of variables and millions of units, but also to the assumptions about the generation process of the data, the randomness of sampling plan, the data quality,