Learning data mining with python second edition

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	503
Dung lượng	4,91 MB

Nội dung

Title Page Learning Data Mining with Python Second Edition Use Python to manipulate data and build predictive models Robert Layton BIRMINGHAM - MUMBAI Copyright Learning Data Mining with Python Second Edition Copyright © 2017 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: July 2015 Second edition: April 2017 Production reference: 1250417 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78712-678-7 www.packtpub.com Credits Author Copy Editor Robert Layton Vikrant Phadkay Reviewer Project Coordinator Asad Ahamad Nidhi Joshi Commissioning Editor Proofreader Veena Pagare Safis Editing Acquisition Editor Indexer Divya Poojari Mariammal Chettiyar Content Development Editor Graphics Tejas Limkar Tania Dutta Technical Editor Production Coordinator Danish Shaikh Aparna Bhagat About the Author Robert Layton is a data scientist investigating data-driven applications to businesses across a number of sectors He received a PhD investigating cybercrime analytics from the Internet Commerce Security Laboratory at Federation University Australia, before moving into industry, starting his own data analytics company dataPipeline (www.datapipeline.com.au) Next, he created Eureaktive (www.eureaktive.com.au), which works with tech-based startups on developing their proof-of-concepts and early-stage prototypes Robert also runs www.learningtensorflow.com, which is one of the world's premier tutorial websites for Google's TensorFlow library Robert is an active member of the Python community, having used Python for more than years He has presented at PyConAU for the last four years and works with Python Charmers to provide Python-based training for businesses and professionals from a wide range of organisations Robert can be best reached via Twitter @robertlayton Thank you to my family for supporting me on this journey, thanks to all the readers of revision for making it a success, and thanks to Matty for his assistance behind-the-scenes with the book Clustering News Articles It won't hurt to read a little on the following topics Clustering Evaluation The evaluation of clustering algorithms is a difficult problem—on the one hand, we can sort of tell what good clusters look like; on the other hand, if we really know that, we should label some instances and use a supervised classifier! Much has been written on this topic One slideshow on the topic that is a good introduction to the challenges follows: http://www.cs.kent.edu/~jin/DM 08/ClusterValidation.pdf In addition, a very comprehensive (although now a little dated) paper on this topic is here: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf The scikit-learn package does implement a number of the metrics described in those links, with an overview here: http://scikit-learn.org/stable/modules/clustering.html #clustering-performance-evaluation Using some of these, you can start evaluating which parameters need to be used for better clusterings Using a Grid Search, we can find parameters that maximize a metric—just like in classification Temporal analysis Larger exercise! The code we developed here can be rerun over many months By adding some tags to each cluster, you can track which topics stay active over time, getting a longitudinal viewpoint of what is being discussed in the world news To compare the clusters, consider a metric such as the adjusted mutual information score, which was linked to the scikit-learn documentation earlier See how the clusters change after one month, two months, six months, and a year Real-time clusterings The k-means algorithm can be iteratively trained and updated over time, rather than discrete analyses at given time frames Cluster movement can be tracked in a number of ways—for instance, you can track which words are popular in each cluster and how much the centroids move per day Keep the API limits in mind—you probably only need to one check every few hours to keep your algorithm up-to-date Classifying Objects in Images Using Deep Learning The following topics are also important when deeper study into Classifying objects is considered Mahotas URL: http://luispedro.org/software/mahotas/ Another package for image processing is Mahotas, including better and more complex image processing techniques that can help achieve better accuracy, although they may come at a high computational cost However, many image processing tasks are good candidates for parallelization More techniques on image classification can be found in the research literature, with this survey paper as a good start: http://ijarcce.com/upload/january/22-A%20Survey%20on%20Image%20 Classification.pdf Other image datasets are available at http://rodrigob.github.io/are_we_there_yet/build/class ification_datasets_results.html There are many datasets of images available from a number of academic and industry-based sources The linked website lists a bunch of datasets and some of the best algorithms to use on them Implementing some of the better algorithms will require significant amounts of custom code, but the payoff can be well worth the pain Magenta URL: https://github.com/tensorflow/magenta/tree/master/magenta/reviews This repository contains a few high-quality deep learning papers that are worth reading, along with in-depth reviews of the paper and their techniques If you want to go deep into deep learning, check out these papers first before expanding outwards Working with Big Data The following resources on Big Data would be helpful Courses on Hadoop Both Yahoo and Google have great tutorials on Hadoop, which go from beginner to quite advanced levels They don't specifically address using Python, but learning the Hadoop concepts and then applying them in Pydoop or a similar library can yield great results Yahoo's tutorial: https://developer.yahoo.com/hadoop/tutorial/ Google's tutorial: https://cloud.google.com/hadoop/what-is-hadoop Pydoop URL: http://crs4.github.io/pydoop/tutorial/index.html Pydoop is a python library to run Hadoop jobs Pydoop also works with HDFS, the Hadoop File System, although you can get that functionality in mrjob as well Pydoop will give you a bit more control over running some jobs Recommendation engine Building a large recommendation engine is a good test of your Big data skills A great blog post by Mark Litwintschik covers an engine using Apache Spark, a big data technology: http://tech.marksblogg.com/recommendation-engine-spark-pyth on.html W.I.L.L URL: https://github.com/ironman5366/W.I.L.L Very large project! This open source personal assistant can be your next JARVIS from Iron Man You can add to this project using data mining techniques to allow it to learn to some tasks that you need to regularly This is not easy, but the potential productivity gains are worth it More resources The following would serve as a really good resource for additional information: Kaggle competitions URL: www.kaggle.com/ Kaggle runs data mining competitions regularly, often with monetary prizes Testing your skills on Kaggle competitions is a fast and great way to learn to work with real-world data mining problems The forums are nice and share environments—often, you will see code released for a top-10 entry during the competition! Coursera URL: www.coursera.org Coursera contains many courses on data mining and data science Many of the courses are specialized, such as big data and image processing A great general one to start with is Andrew Ng's famous course: https://www.coursera.org/le arn/machine-learning/ It is a bit more advanced than this and would be a great next step for interested readers For neural networks, check out this course: https://www.coursera.org/course/neuralnets If you complete all of these, try out the course on probabilistic graphical models at https://www.coursera.org/course/pgm ... Learning Data Mining with Python Second Edition Use Python to manipulate data and build predictive models Robert Layton BIRMINGHAM - MUMBAI Copyright Learning Data Mining with Python Second Edition. .. with Big Data Courses on Hadoop Pydoop Recommendation engine W.I.L.L More resources Kaggle competitions Coursera Preface The second revision of Learning Data Mining with Python was written with the... GitHub at https://github.com/PacktP ublishing /Learning- Data- Mining- with- Python- Second- Edition The benefit of the github repository is that any issues with the code, including problems relating to

Ngày đăng: 04/03/2019, 13:44