1. Trang chủ
  2. » Công Nghệ Thông Tin

Computational intelligence in data mining

895 618 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 895
Dung lượng 28,22 MB

Nội dung

Advances in Intelligent Systems and Computing 711 Himansu Sekhar Behera Janmenjoy Nayak Bighnaraj Naik Ajith Abraham Editors Computational Intelligence in Data Mining Proceedings of the International Conference on CIDM 2017 Advances in Intelligent Systems and Computing Volume 711 Series editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland e-mail: kacprzyk@ibspan.waw.pl The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses They cover significant recent developments in the field, both of a foundational and applicable character An important characteristic feature of the series is the short publication time and world-wide distribution This permits a rapid and broad dissemination of research results Advisory Board Chairman Nikhil R Pal, Indian Statistical Institute, Kolkata, India e-mail: nikhil@isical.ac.in Members Rafael Bello Perez, Universidad Central “Marta Abreu” de Las Villas, Santa Clara, Cuba e-mail: rbellop@uclv.edu.cu Emilio S Corchado, University of Salamanca, Salamanca, Spain e-mail: escorchado@usal.es Hani Hagras, University of Essex, Colchester, UK e-mail: hani@essex.ac.uk László T Kóczy, Széchenyi István University, Győr, Hungary e-mail: koczy@sze.hu Vladik Kreinovich, University of Texas at El Paso, El Paso, USA e-mail: vladik@utep.edu Chin-Teng Lin, National Chiao Tung University, Hsinchu, Taiwan e-mail: ctlin@mail.nctu.edu.tw Jie Lu, University of Technology, Sydney, Australia e-mail: Jie.Lu@uts.edu.au Patricia Melin, Tijuana Institute of Technology, Tijuana, Mexico e-mail: epmelin@hafsamx.org Nadia Nedjah, State University of Rio de Janeiro, Rio de Janeiro, Brazil e-mail: nadia@eng.uerj.br Ngoc Thanh Nguyen, Wroclaw University of Technology, Wroclaw, Poland e-mail: Ngoc-Thanh.Nguyen@pwr.edu.pl Jun Wang, The Chinese University of Hong Kong, Shatin, Hong Kong e-mail: jwang@mae.cuhk.edu.hk More information about this series at http://www.springer.com/series/11156 Himansu Sekhar Behera Janmenjoy Nayak ⋅ Bighnaraj Naik Ajith Abraham Editors Computational Intelligence in Data Mining Proceedings of the International Conference on CIDM 2017 123 Editors Himansu Sekhar Behera Department of Computer Science and Engineering & Information Technology Veer Surendra Sai University of Technology Sambalpur, Odisha India Janmenjoy Nayak Department of Computer Science and Engineering Sri Sivani College of Engineering (SSCE) Srikakulam, Andhra Pradesh India Ajith Abraham Machine Intelligence Research (MIR) Lab Auburn, WA USA and Technical University of Ostrava Ostrava Czech Republic Bighnaraj Naik Department of Computer Application Veer Surendra Sai University of Technology Sambalpur, Odisha India ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-10-8054-8 ISBN 978-981-10-8055-5 (eBook) https://doi.org/10.1007/978-981-10-8055-5 Library of Congress Control Number: 2017964255 © Springer Nature Singapore Pte Ltd 2019 This book was advertised with a copyright holder The Editor(s)/The Author(s) in error, whereas the publisher holds the copyright This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Printed on acid-free paper This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Preface In the next decade, the growth of data both structured and unstructured will present challenges as well as opportunities for industries and academia The present scenario of storage of the amount of data is quite huge in the modern database due to the availability and popularity of the Internet Thus, the information needs to be summarized and structured in order to maintain effective decision-making With the explosive growth of data volumes, it is essential that real-time information that is of use to the business can be extracted to deliver better insights to decision-makers, understand complex patterns, etc When the quantity of data, dimensionality and complexity of the relations in the database are beyond human capacities, there is a requirement for intelligent data analysis techniques, which could discover useful knowledge from data While data mining evolves with innovative learning algorithms and knowledge discovery techniques, computational intelligence harnesses the results of data mining for becoming more intelligent than ever In the present scenario of computing, computational intelligence tools offer adaptive mechanisms that enable the understanding of data in complex and changing environments The Fourth International Conference on “Computational Intelligence in Data Mining (ICCIDM 2017)” is organized by Veer Surendra Sai University of Technology (VSSUT), Burla, Sambalpur, Odisha, India, during 11–12 November 2017 ICCIDM is an international forum for representation of research and developments in the fields of data mining and computational intelligence More than 250 prospective authors submitted their research papers to the conference After a thorough double-blind peer review process, editors have selected 78 papers The proceedings of ICCIDM is a mix of papers from some latest findings and research of the authors It is being a great honour for us to edit the proceedings We have enjoyed considerably working in cooperation with the international advisory, programme and technical committee to call for papers, review papers and finalize papers to be included in the proceedings This international conference on CIDM aims at encompassing new breed of engineers, technologists making it a crest of global success All the papers are focused on the thematic presentation areas of the conference, and they have provided ample opportunity for presentation in different sessions Research in data v vi Preface mining has its own history But, there is no doubt about the tips and further advancements in the data mining areas will be the main focus of the conference This year’s programme includes exciting collections of contributions resulting from a successful call for papers Apart from those, two special sessions named “Computational Intelligence in Data Analytics” and “Applications of Computational Intelligence in Power and Energy Systems” have been proposed for more discussions on the theme-related areas The selected papers have been divided into thematic areas including both review and research papers and highlight the current focus on computational intelligence techniques in data mining We hope the author’s own research and opinions add value to it First and foremost are the authors of papers, columns and editorials whose works have made the conference a great success We had a great time putting together this proceedings The ICCIDM conference and proceedings are a credit to a large group of people, and everyone should be congratulated for the outcome We extend our deep sense of gratitude to all those for their warm encouragement, inspiration and continuous support for making it possible We hope all of us will appreciate the good contributions made and justify our efforts Sambalpur, India Srikakulam, India Sambalpur, India Auburn, USA/Ostrava, Czech Republic Himansu Sekhar Behera Janmenjoy Nayak Bighnaraj Naik Ajith Abraham Conference Committee Chief Patron and President Prof E Saibaba Reddy, Vice Chancellor, VSSUT, B.Tech., M.E (Hons.) (Roorkee), Ph.D (Nottingham, UK), Postdoc (Halifax, Canada), Postdoc (Birmingham, UK) Honorary Advisory Chair Prof S K Pal, Sr., Member IEEE, LFIEEE, FIAPR, FIFSA, FNA, FASc, FNASc, FNAE, Distinguished Scientist and Former Director, Indian Statistical Institute, India Prof V E Balas, Sr., Member IEEE, Aurel Vlaicu University, Romania Honorary General Chair Prof Rajib Mall, Indian Institute of Technology (IIT) Kharagpur, India Prof P K Hota, Dean, CDCE, VSSUT, India General Chair Prof Ashish Ghosh, Indian Statistical Institute, Kolkata, India Prof B K Panigrahi, Indian Institute of Technology (IIT) Delhi, India Programme Chair Dr H S Behera, Veer Surendra Sai University of Technology (VSSUT), Burla, Odisha, India vii viii Conference Committee Chairman, Organizing Committee Prof Amiya Ku Rath, HOD, Department of CSE and IT, Veer Surendra Sai University of Technology (VSSUT), Burla, Odisha, India Vice-Chairman, Organizing Committee Dr S K Padhy, HOD, Department of Computer Application, Veer Surendra Sai University of Technology (VSSUT), Burla, Odisha, India Convenor Dr Bighnaraj Naik, Department of Computer Application, Veer Surendra Sai University of Technology (VSSUT), Burla, Odisha, India International Advisory Committee Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof A Abraham, Machine Intelligence Research Labs, USA Dungki Min, Konkuk University, Republic of Korea Francesco Marcelloni, University of Pisa, Italy Francisco Herrera, University of Granada, Spain A Adamatzky, Unconventional Computing Centre, UWE, UK H P Proenỗa, University of Beira Interior, Portugal P Mohapatra, University of California S Naik, University of Waterloo, Canada George A Tsihrintzis, University of Piraeus, Greece Richard Le, La Trobe University, Australia Khalid Saeed, AUST, Poland Yew-Soon Ong, Singapore Andrey V Savchenko, NRU HSE, Russia P Mitra, P.S University, USA D Sharma, University of Canberra, Australia Istvan Erlich, University of Duisburg-Essen, Germany Michele Nappi, University of Salerno, Italy Somesh Jha, University of Wisconsin, USA Sushil Jajodia, George Mason University, USA S Auephanwiriyakul, Chiang Mai University, Thailand Carlos A Coello Coello, Mexico M Crochemore, University de Marne-la-Vallée, France T Erlebach, University of Leicester, Leicester, UK T Baeck, Universiteit Leiden, Leiden, The Netherlands Conference Committee Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof Prof J Biamonte, ISI Foundation, Torino, Italy C S Calude, University of Auckland, New Zealand P Degano, Università di Pisa, Pisa, Italy Raouf Boutaba, University of Waterloo, Canada Kenji Suzuki, University of Chicago Raj Jain, WU, USA D Al-Jumeily, Liverpool J Moores University, UK M S Obaidat, Monmouth University, USA P N Suganthan, NTU, Singapore Biju Issac, Teesside University, UK Brijesh Verma, CQU, Australia Ouri E Wolfson, University of Illinois, USA Klaus David, University of Kassel, Germany M Dash, NTU, Singapore L Kari, Western University, London, Canada A S M Sajeev, Australia Tony Clark, MSU, UK Sanjib ku Panda, NUS, Singapore R C Hansdah, IISC Bangalore G Chakraborty, Iwate Prefectural University, Japan Atul Prakash, University of Michigan, USA Sara Foresti, University of degli Studi di Milano, Italy Pascal Lorenz, University of Haute Alsace, France G Ausiello, University di Roma “La Sapienza”, Italy X Deng, University of Liverpool, England, UK Z Esik, University of Szeged, Szeged, Hungary A G Barto, University of Massachusetts, USA G Brassard, University de Montréal, Montréal, Canada L Cardelli, Microsoft Research, England, UK A E Eiben, VU University, The Netherlands Patrick Siarry, Université de Paris, Paris R Herrera Lara, EEQ, Ecuador M Murugappan, University of Malaysia National Advisory Committee Prof Prof Prof Prof Prof P K Pradhan, Registrar, VSSUT, Burla R P Panda, VSSUT, Burla A N Nayak, Dean, SRIC, VSSUT, Burla D Mishra, Dean, Students’ Welfare, VSSUT, Burla P K Kar, Dean, Faculty & Planning, VSSUT, Burla ix 876 S K Sahoo et al where ‘σ’ is the standard deviation of the distribution, and ‘µ’ is the expectation of the distribution From engineering point of view, the discrete wavelet analysis is a two-channel digital filter bank composed of the low-pass and the high-pass filters, iterated on the low-pass output The low-pass filtering yields an approximation of a signal (at a given Scale), while the high-pass filtering yields the details that constitute the difference between the two successive approximations Figure shows a typical discrete wavelet two-level filter bank and Fig represents a typical two-level DWT for de-noising of a bottle sample Then the bottle image is segmented from the image background by an application of segmentation methods After estimating all adaptive features by extracting the features, they are combined to form a dataset using mathematical concepts such as average grayscale two-dimensional feature vector, wavelet transform, and principal component analysis (PCA) These extracted features are considered as input variables and types of defective bottles are considered as output variables in the classification of defect-free bottle In the classification stage, the intelligent techniques like artificial neural network (ANN) trained by back propagation algorithm, differential evaluation algorithm and support vector machine are used to classify the images as per predefined dimensions All the three adaptive features and two training methods of ANN are employed one by one in both defective and defect-free bottle images Feature numbers are reduced from 5000 to 2500 by Fig A typical discrete wavelet filter bank X (n) Original Signal LPF HPF Down Sampling Down Sampling A1 D1 LPF Level-1 HPF Down Sampling Down Sampling A2 D2 Level-2 A Dynamic Bottle Inspection Structure (a) 877 (b) Original bottle image 2-D DWT of bottle image Fig Typical two- level DWT for de-noising of a bottle sample principal component analysis (PCA) for better computation During the training and testing, the concentrated structures are applied as input to the classifier Again the classification is also performed through vision builder simulation window for validation purpose This vision builder (VB) along with laboratory view window is used for automated inspection to solve visual inspection tasks including inspection, parts presence, and counting Results of both the schemes are compared and final outcomes are conveyed During the experiment, factors similar to illumination, focal length, and magnification factor of camera and workpiece positions are maintained constant throughout training and testing phase Experimental Setup for Based Bottle Inspection Model and Study of Its Outcome Vision-centered scrutiny arrangement has a pair of innovative skills for contact-less measurement and inspection These apparatus incorporates multitude methodologies including digital imaging, integrated circuit technology, embedded systems and software Photographic view of machine vision inspection system is shown in Fig The inspection of bottle using the adaptive feature extractions with wavelet transform trained by support vector machine and adaptive model has been considered in this experimental work The image of the bottle is grabbed by a high-resolution camera and sent to the personal computer through frame buffers The acquired capture image has been processed for noise optimization and feature extraction by using analytical tool like wavelet transforms For analysis 878 S K Sahoo et al Fig Photographic view of AI-based bottle inspection system point of view, the extracted features obtained from wavelet scheme have been trained again and classified for the desired response Though there is not suitable standard for database available for bottle so 5000 images have been considered and resized in a standard dimension to form a big database During inspection interval, there is a chance for improper illumination and position changes of an object due to different factors, so that a sensory arrangement is implemented Generally, the investigation is performed within the room at 60 W bulb lighting having illumination intensity of 50 lx and 500 lx, respectively Figure indicates the error variation at classification stage with the proposed arrangement Error Response 30 % Error 25 20 15 25.11 21.11 18.56 12.56 18.34 18.85 18.85 19.14 18.49 8.34 8.85 8.34 9.14 8.49 12.4 7.4 10 11.96 5.96 5 Number of iterations ANN USING BP AND DEA Fig Error response of MVIS at classification stage PROPOSED MVI A Dynamic Bottle Inspection Structure 879 The term classification here refers to the technique of categorizing an object depending on their attributes This process needs a good quality of training data or information of an inspecting object which is obtained earlier Considering the extracted features of bottle object, the proper classification related to defective and defect free is decided In this classification stage, the rate of computational speed is compared by the different approach with the proposed sensor-centered MVI system The comparison of computational speed is discussed here for proper evaluation The capability of the proposed vision system is analyzed by comparing their performances in terms of three adaptive features Single and multiple images are considered for the experiment Table indicates that the neural networks with sensor module in the three types of feature extraction method have a higher success ratio than the without sensor module method In multiple zone images, wavelet-based feature extraction has produced the success ratio of 97.5% but the real-time inspection rate is 1.48 bottles/s In average grayscale 2D feature extraction method has produced success ratio of 95% but the real-time inspection rate is 1.73 bottles/s Table reveals that the neural networks using with sensor module in the three types of feature extraction method have higher success ratio than that of the without sensor module With sensor module, wavelet-based feature extraction has produced the success ratio of 97.5% but the real-time inspection rate is 2.22 bottles/s In average grayscale 2D feature extraction method has produced success ratio of 95% but real-time inspection rate is 2.5 bottles/s Figure reveals that the proposed model provides better response than the other MVI system having without sensory arrangement Vision inspection system using machine vision with adaptive feature extraction is very much useful to inspect the quality level for inspection of bottles It examines the defective bottles using artificial neural network and vision builder simulation platform In this thesis, the defective and defect-free bottles have been inspected by considering with and without sensor module scheme, three feature extraction methods as well as the simulation platform 3.1 Comparison of Overall Performance The overall performance of the developed vision systems for with and without sensor module implementation is compared in terms of the three adaptive features The complete comparison is shown in Table and Fig Ultimately, the performance comparisons between all methods provide information about the variations due to the selection of feature extraction method as well as training method of decision-making algorithm and types of image From the above figure, it is concluded that sensor implementation over the MVI system has better efficiency than the other for estimated values of extracted features Classification success ratio (%) Failure rate (%) False alarm rate (%) Success ratio (%) Real-time inspection rate (detected/s) Parameters Sl no 6.256 1.254 92.5 1.9 92.57 1.25 93.75 1.77 93.75 Without sensor module AVG grayscale 2D PCA-based feature vector features 3.75 1.25 95 1.63 95 Wavelet-based features 3.75 1.25 95 1.73 95 With sensor module AVG grayscale 2D feature vector Table Performance analysis of proposed vision inspection system by ANN using BP for defective bottle scrutiny 2.25 1.25 96.25 1.63 96.25 PCA-based features 2.5 97.5 1.48 97.5 Wavelet-based features 880 S K Sahoo et al Classification success ratio (%) Failure rate (%) False alarm rate (%) Success ratio (%) Real-time inspection rate (detected/s) Parameters Sl no 6.256 1.254 92.5 2.85 92.57 1.25 93.75 2.58 93.75 Without sensor module AVG grayscale 2D Gaussian-based feature vector features 3.75 1.25 95 2.5 95 PCA-based features 3.75 1.25 95 2.5 95 With sensor module AVG grayscale 2D feature vector Table Performance analysis of proposed vision inspection system by ANN using DEA for bottle inspection 3.75 1.25 95 2.42 95 Gaussian-based features 2.5 97.5 2.22 97.5 PCA-based features A Dynamic Bottle Inspection Structure 881 882 S K Sahoo et al Table Comparison of overall performance Sl no Feature extraction method Overall performance of VBIS Without sensor (%) With sensor (%) 2D feature vector PCA features Wavelet features 92.50 93.25 95 95.00 96.00 98 Fig Performance comparisons of proposed vision inspection system by ANN using BP and DEA for defective bottle inspection using with and without sensor implementation in MVI system 3.2 Comparison of Computational Time The overall computational time of the developed vision-based inspection system model is compared to the three adaptive features Sensor and without sensor-centered MVI models in both bottle images are considered for this evaluation along with the simulation platform like vision builder The complete comparisons are shown in Table and Fig A Dynamic Bottle Inspection Structure 883 Table Comparison of computational time of proposed vision inspection system for bottle inspection Sl no Feature extraction methods Overall performance BP DEA Without sensor Without sensor With sensor Gained computational time (s) Without With sensor sensor 28 32 33.33 30.43 31 32 33 36 31.11 34.69 28.65 30.33 33.04 29.80 With sensor 2D feature 42 46 vector PCA features 45 49 Wavelet 49 54 features Average gained computational time (s) Fig Comparison of computational time of proposed vision inspection system for bottle inspection Conclusions AI-centered MVI system is one of the best suitable schemes for proper inspection of the defective bottles This vision inspection system provides a good technology for inspecting the imperfections in bottles Based on the comparison of computational times of all methods, it may be asserted that ANN using DEA computational time is comparatively less than that of ANN using BP The percentage of average computational time gained in the model in bottle for bottle images without and with 884 S K Sahoo et al sensor implementations are 33.04 and 29.80, respectively The computational time is more for bottle classification with sensor implementation than the without sensor implementation Again inspection of bottle using MVI system enables the user to examine the imperfections in the bottles in a natural manner which is very similar to standard inspection methods It categorizes the different types of bottles as per standards The MVI system is capable of classifying the bottle imperfections as per standard The range of classification success ratio is 91.25–97.5% Eventually, it concluded that the AI-based MVIS is most suitable for inspection of quality level for imperfections in bottles with an overall efficiency of 98% as shown in Fig References Takayuki Kanda, Masahiro Shiomi, Zenta Miyashita, Hiroshi Ishiguro: A communication robot in a shopping mall IEEE transactions on Robotics (2010) Vol 26(5) 897–913 Mahdi Abbasgolipour, Mahmoud Omid, Alireza Keyhani.: Sorting Raisins by Machine vision system Modern Applied Science (2010) Vol 4(2) 49–60 Niko Herakovic, Marko Simic, Francelj Trdic, Jure Skvarc.: A machine vision system for automated quality control of welded rings Machine vision and Applications, Springer, (2010) 1–15 J G Victores, S Martinez, A Jardon, and C Balaguer.: Robot aided tunnel inspection and maintenance system by vision and proximity sensor integration Automation in construction, Elsevier (2011) Vol 20, 629–636 Sergio Cubero, Nuria Aleixos, Enrique Molto.: Advances in machine vision applications for automatic inspection and quality evaluation of fruits and vegetables Food Bioprocess Technology, Springer, (2011) Vol 4, 487–504 Tobias Andersson, Matthew J Thurley, Johan E Carlson.: A machine vision system for estimation of size distributions by weight of limestone particles Minerals Engineering, Elsevier, (2012) Vol 25, 38–46 Yongyu Li, Sagar Dhakal, Yankun Peng.: A machine vision system for identification of micro crack in egg shell Journal of Food Engineering, Elsevier, (2012) Vol 109, 127–134 Feature Selection-Based Clustering on Micro-blogging Data Soumi Dutta, Sujata Ghatak, Asit Kumar Das, Manan Gupta and Sayantika Dasgupta Abstract The growing popularity of micro-blogging phenomena opens up a flexible platform for the public as communication media for the public For any trending/nontrending topic, thousands of post are posted daily in micro-blogs During any important event, such as natural calamity and election, and sports event, such as IPL and World Cup, a huge number of messages (micro-blogs) are posted Due to fast and huge exchange of messages causes information overload, hence clustering or grouping similar messages is an effective way to reduce that Less content and noisy nature of messages are challenging factor in micro-blog data clustering Incremental huge data is another challenge to clustering So, in this work, a novel clustering approach is proposed for micro-blogs combining feature selection technique The proposed approach has been applied to several experimental dataset, and it is compared with several existing clustering techniques which results in better outcome than other methods Keywords Clustering ⋅ Feature selection ⋅ Micro-blogs Introduction In recent times, micro-blogging phenomena open up huge source of real-time information for the researchers During any important event, such as natural calamity and election, and sports event, such as IPL and World Cup, thousands of messages are posted in micro-blogging sites Due to fast information exchange nature of microblogging sites, it causes information to be overloaded Micro-blogging post contains at most 140 characters It also contains noisy and redundant data For our experiment S Dutta (✉) ⋅ S Ghatak ⋅ M Gupta ⋅ S Dasgupta Institute of Engineering & Management, Kolkata 700091, India e-mail: soumi.it@gmail.com S Dutta ⋅ A K Das Indian Institute of Engineering Science and Technology Shibpur, Howrah 711103, India © Springer Nature Singapore Pte Ltd 2019 H S Behera et al (eds.), Computational Intelligence in Data Mining, Advances in Intelligent Systems and Computing 711, https://doi.org/10.1007/978-981-10-8055-5_78 885 886 S Dutta et al purpose, we have considered Twitter streaming dataset which are crawled using the Twitter API [14] Twitter felicitates the searching technique by keywords or topic name to identify related tweets It is not possible for any user to go through all the post/tweets who wants to know the outline of a topic In such a scenario, an effective way to reduce the information load on the user is to group similar posts, so that the user might see only few messages in each cluster Another challenge of micro-blogging data is large volumes, which intend more time to cluster the data into subgroups So selected features can be identified from each cluster which represent the characteristics of the cluster Feature selection focuses on reduction of overfitting Feature selection employs dimension reduction for a given dataset where selected features are the important features which are sufficient to represent the dataset independently This reduced dimension of the dataset can also reduce the clustering time effectively Clustering also makes easier the data summarization task which is another well-established problem in information retrieval In the proposed clustering approach, dataset is preprocessed first Latent Dirichlet Allocation (LDA) [1] topic modeler is a generative process that can be used to identify the features capable of identifying a topic in the dataset and returned list of reduced features which can be used to represent each topic in the entire dataset Now, for cluster identification, hamming distance is measured between each topic-feature vector and a data tuple, and minimum distance value cluster is identified as destination cluster for that data tuple Using this approach, entire dataset can be clustered into multiple groups The proposed clustering approach is applied to micro-blogs dataset related to four disaster events Few classical clustering methods such as K-means and hierarchical are also applied to the same dataset The performance of the different algorithms is evaluated using the standard clustering index measure As a whole, the proposed approach achieves better performance than the classical methods for micro-blogging dataset The rest of the paper is organized as follows A short literature survey on clustering micro-blogging dataset is presented in Sect Section describes the proposed clustering approach The micro-blog datasets used for the algorithm are described in Sect 4, while Sect discusses the results of the comparison among the various clustering algorithms The paper is concluded in Sect with some potential future research directions Related Work Many prior works have been done on micro-blogging data clustering Hill et al [6] discuss how social network-based clusters can capture homophily along with the possibility that a network-based attribute approach might not only capture homophily but also can be used instead of demographic attributes to determine the similarity in user behavior, thus preserving privacy of the user base Feature Selection-Based Clustering on Micro-blogging Data 887 Cheong [2] attempts to detect intra-topic user and message clusters in Twitter, by incorporating an unsupervised self-organizing feature map (SOM) as an machine learning-based clustering tool Thomas et al [13] propose an efficient text classification scheme using clustering based on semi-supervised clustering as a complementary step to text classification The method provides better accuracy than the similarity measure for text processing (SMTP) used for distance calculation Yang and Leskovec [16] have proposed a clustering method by using temporal patterns of propagation Karypis et al [8] propose a hierarchical clustering algorithm using dynamic modeling which takes the dynamic modes of clusters and the adaptive merging decision; that is, depending upon the difference in clustering model, it can discover natural clusters of various shapes and sizes It supports a two-phase framework that has been built effectively using various graph representations suitable for various application domains Dueck et al [3] propose an affinity propagation algorithm for clustering tweets Dutta et al [4] use a graph-based community detection algorithm for clustering tweets and later use the clustering output for summarization Recently, Rangrej et al [10] have conducted a comparative study on three clustering algorithms—Kmeans, affinity propagation, and singular value decomposition algorithm—and have compared their performance in clustering short text documents Micro-blog Clustering Algorithms 3.1 Data Preprocessing Micro-blogging data often contains non-textual characters such as smileys, @usernames, exclamation/question marks, which acts as noise and degrades clustering performance So dataset needs to be preprocessed So, initially, stopwords, URLs, numerals, addressing, user mentions, e-mails, and special characters are removed from the dataset and stemmed This section describes the proposed methodology in detail Each tuple in the dataset represents a single message or post or tweet(document) in Twitter All the dataset is tokenized first, and a list of unique tokens are identified Then, a documentterm matrix is generated where rows(M) represent individual tweets and the columns represent distinct terms/tokens(Z) The entries in the matrix represent the presence(1) or absence(0) of a particular term/token in the post/tweet Table shows the corresponding matrix for the set of tweets in Table Next, dimension of document-term matrix is reduced using an Information Theoretic Approach So, for each term/token conditional probability(p values) is evaluated using Bayes’s Rule The standard formula for Bayes’s Rule is shown in Eq 1: P(H ∣ E) = [P(E ∣ H) ∗ P(H)] P(E) (1) 888 S Dutta et al Table Document-term matrix for the toy dataset shown in Table Tweet ID attr1 attr2 attr3 attrZ T1 T2 T3 T4 T5 TM 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 Here H represents occurrence of a token in entire dataset and E represents the occurrence of appearance of a token in all tweets Then, mean p value is computed which is compared with p value of each individual term/token Terms/tokens are discarded from the document-term matrix whose p value is higher than mean p value According to Shannon’s theory of communication, the mean p value represents average information yield The method derived by Shannon clearly states that the token with lower the p value yields the higher selfinformation content, shown in Eq Now, considering new subset of terms/tokens document-term matrix(M x G) is regenerated as shown in Table 2: I(Wn ) = f (P(Wn )) (2) The proposed algorithm aims to cluster the dataset The methodology is briefly outlined in Algorithm Before clustering we are considering a topic modeler approach to identify probable number of clusters So, LDA (Latent Dirichlet Allocation) topic modeler is used here which is a generative process that can be used to identify the features capable of identifying a topic in the dataset LDA or Latent Dirichlet Allocation is a statistical generative process that takes three inputs, n (the number of topics), alpha and theta (alpha and theta are hyperparameters, i.e., param- Table Reduced document-term matrix Tweet ID attr1 attr2 T1 T2 T3 T4 T5 TM 1 1 0 0 0 0 attr5 attrG 1 0 1 0 0 1 0 1 Feature Selection-Based Clustering on Micro-blogging Data 889 eters of prior distribution) and returns n-tuples, where each tuple represents a topic and its corresponding features are used to identify the topic The overall runtime complexity [12] of LDA method is O((NT)t (N + t)3 ) The expression is a polynomial in nature when the total number of topics is constant The inference function belongs to the NP-hard class of problems when the number of topics is large The number of topics is relatively small in our experimental datasets, and thus, the LDA performs well If the number of topics is large for any certain dataset, the performance of the LDA algorithm may decline The inference function, thus, has to be augmented in a way, so that the LDA algorithm performs better even if the number of topics becomes significantly large for a dataset To apply LDA approach on the experimental dataset, we need to measure alpha and theta using Bayesian inference function, as shown in Eq p(̃ x ∣ X, 𝛼) = ∫𝜃 p(̃ x ∣ 𝜃)p(𝜃 ∣ X, 𝛼)d𝜃 (3) With the help of Information Theoretic Approach (Shannon’s approach), the optimal number of groups, into which a text dataset can be divided, is determined Using the Shannon’s proposed formula for calculating the information yield of a particular message, we have reduced the total number of features that are initially generated by extracting the corpus from the dataset, i.e., the number of unique features that can be used to represent the entire dataset This number can be used as optimal number of topics to be detected from the dataset We compute this value as an inverse probability of zero elements That is, if in the document-term matrix total number of elements is N and total number of zeros is K, the probability of occurrence of zeros can be calculated as I = K∕N as the event is an independent one Thus, the probability of occurrence of nonzero elements would be T = − I The optimal number of topics is then derived as—ceiling [Length(Corpus)/T] The LDA method not only identifies the topics but also returns a list of reduced features (term/token) that can be used to represent the entire dataset and also to reduce the size of data that can later be processed by the clustering algorithm LDA returns the list of reduced feature K Then, a topic-feature matrix (F) is generated considering T × K dimensions, where T is total number of topics and K is total number of features If a feature f is present in a topic, the attribute is marked as 1; otherwise, it is marked as Table shows the corresponding matrix 890 S Dutta et al Algorithm Micro-Blogging Data clustering algorithm based on feature selection Input: L number of tweets Output: Tweets partitioned into T number of clusters Pre-process all tweets in the dataset by removing Stopwords, Special characters, User mentions, URLs, Emails; Stem all the tokens; Create the corpus C as the list of unique tokens that forms the entire dataset; Let N = distinct number of tokens in corpus; Compute DM = document-term matrix of order L x N; for each token in C Compute p-value using conditional probability of occurrence of each term by Bayes rule; end for Calculate the mean p-value(mp); for all Ci , i ∈ [1, L] Remove token if p > mp end for Reform the document-term matrix(DM) based on the new reduced corpus; Use Bayesian inference to compute the parameters alpha and theta from DM; z= Number of zero elements in DM; N= Total number of elements in DM; P(z) = z/N; P(Nz) = 1-P(z) where Nz; Calculate number of topics T = Ceiling [ Length(C) / P(Nz) ]; Run LDA(T, alpha, theta) store Feature/topic in list FL; Prepare document-feature matrix (D); Prepare F, a feature matrix that maps each feature per topic with respect to total features returned by LDA; i,j=0 while data in D for all vectori , i ∈ [1, F] Calculate hamming distance between data and vector as d = Hamming ( data, vectori ); Update the Distances[i][j++] vector with the hamming distance,d; end for for all rowi , i ∈ [1, Distances] Calculate and store the mean value of the present row in list row-mean end for end while while data in D for all x, x ∈ [0, distance] for all v, v ∈ [0, x] if v

Ngày đăng: 02/03/2019, 10:31

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN