Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 86 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
86
Dung lượng
7,9 MB
Nội dung
www.pwc.com Big Data Analytics Learning Lab UN Data Innovation Lab University of Nairobi March 13-14, 2017 Agenda I Introduction to Big Data What it is and why it matters II Big Data Analytics Putting Big Data to work III.Creating a Big Data-Enabled Organization Bringing Big Data Analytics home IV.Case Study ‘Nowcasting’ economic activity in Colombia PwC Introduction to Big Data What it is and Why it Matters 01 What is Big Data? “Big Data” exceeds the capacity of traditional analytics and information management paradigms across what is known as the V’s: Volume, Variety, Velocity, and Veracity Veracity Uncertainty of Data Velocity Analysis of Streaming Data Variety Volume Different Forms of Data Scale of Data With exponential increases of The speed at which data is Represents the diversity of the Reflects the size of a data set data from unfiltered and generated and used New data is data Data sets will vary by type New information is generated constantly flowing data sources, being created every second and (e.g social networking, media, daily and in some cases hourly, data quality often suffers and in some cases it may need to be text) and they will vary how well creating data sets that are new methods must find ways to analyzed just as quickly they are structured measured in terabytes and “sift” through junk to find petabytes meaning PwC The Promise of Big Data Even more important than its definition is what Big Data promises to achieve: intelligence in the moment Volume Volume Variety Variety Velocity Velocity Veracity Veracity Traditional Techniques & Issues PwC • Does not account for biases, noise and abnormality in data Big Data Differentiators • Data is stored, and mined meaningful to the problem being analyzed • Keeps data clean and processes to keep ‘dirty data’ from accumulating in your systems • No real time analysis • Compatibility issues • Advanced analytics struggle with non-numerical In real-time: • Dynamically analyze data • Consistently integrate new information • Auto deletes unwanted to ensure optimal storage • Frameworks accommodate varying data types and data models • Insightful analysis with very few parameters data • Analysis is limited to small data sets • Analyzing large data sets = High Costs & High Memory • Scalable for huge amounts of multi-sourced data • Facilitation of massively parallel processing • Low-cost data storage Types of Big Data Variety is the most unique aspect of Big Data New technologies and new types of data have driven much of the evolution around Big Data Twitter, Linkedin, Facebook, Tumblr, Blog, SlideShare, YouTube, Images, videos, audio, Flash, live streams, podcasts, etc Google+, Instagram, Flickr, Pinterest, Vimeo, WordPress, IM, RSS, Review, Chatter, Jive, Yammer, etc Media Social Media Medical devices, smart electric meters, car sensors, XLS, PDF, CSV, email, Word, PPT, HTML, HTML 5, road cameras, satellites, traffic recording devices, plain text, XML, JSON, etc processors found within vehicles, video games, cable Docs Sensor data boxes, assembly lines, office building, cell towers, jet engines, air conditioning units, refrigerators, trucks, farm machinery, etc Government, weather, competitive, traffic, regulatory, compliance, health care services, economic, census, public finance, stock, OSINT, the World Bank, SEC/Edgar, Wikipedia, IMDb, etc Public Web Machine Log Event logs, server data, application logs, business Data process logs, audit logs, call detail records (CDRs), mobile location, mobile app usage, clickstream data, etc Archives of scanned documents, statements, insurance forms, medical record and customer correspondence, paper archives, and print stream files that contain original systems of record between organizations and their customers Archive Business Apps Project management, marketing automation, productivity, CRM, ERP content management system, HR, storage, talent management, procurement, expense management Google Docs, intranets, portals, etc PwC “Single sources of data are no longer sufficient to cope with the increasingly complicated problems in many policy arenas.” Big data “is not notable because of its size, but because of its relationality to other data Due to efforts to mine and aggregate data, Big Data is fundamentally networked.” (1) M Milakovich, “Anticipatory Government: Integrating big data for Smaller Government”, in Oxford Internet Institute “Internet, Politics, Policy 2012” Conference, Oxford, 2012 (2) D Boyd and K Crawford, “Six Provocations for big data,” in A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, 2011 PwC Why is Big Data valuable? We have identified key areas where Big Data is uniquely valuable: Enhanced visibility of relevant information and better transparency to massive amounts of data Improved reporting to Accessibility Accessibility to to Data Data stakeholders Next generation analytics can enable automated decision making (inventory management, financial risk assessment, Decision Decision Making Making sensor data management, machinery tuning) Segmentation of population to customize offerings and marketing campaigns (consumer goods, retail, social, clinical Marketing Marketing Trends Trends Performance Performance Improvement Improvement New New Business Business Models/Services Models/Services PwC data, etc) Exploration for, and discovery of, new needs, can drive organizations to fine tune for optimal performance and efficiency (employee data) Discovery of trends will lead organizations to form new business models to adapt by creating new service offerings for their customers Intermediary companies with big data expertise will provide analytics to rd parties $1 Trillion One study estimated the potential value of big data in the U.S health care, European public sector administration, global personal location data, U.S retail, and global manufacturing to be over $1 trillion U.S dollars per year Another study estimated the value of big data in the areas of customer intelligence, supply chain intelligence, performance improvements, fraud detection, and quality and risk management to be $41 billion per year in the UK alone $41 Billion (1) J Manyika, M Chui, B Brown, J Bughin, R Dobbs, C Roxburgh and A H Byers, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey & Company, 2011 (2) Centre for Economics and Business Research, “Data equity: unlocking the value of big data,” SAS, 2012 PwC Not to be confused with… Structured, semi-structured or unstructured information distinguished by one or more of the four “V”s: Veracity, Velocity, Variety, Volume Big Data Open Data Public, freely available data Crowdsourced Data Data collected through contributions from a large number of individuals Graphic and definitions based on “Big Data in Action for Development,” World Bank, worldbank.org PwC 10 Resources Tutorials, Tools, Applications, and Research Groups Link and Description Introduction for Beginners Introductory Lecture Paper on Business Applications Tutorials Network Analysis Process Online Introductory Book Introduction to Network Analysis Application and Theory (Open source book) SNA Group at Stanford: Tool, Lectures, and Papers Complex Networks and Systems Reasearch Collaboration Research Groups and Papers SNA Group and Indiana University: Lectures, Papers, and Tools Reality Mining at MIT Papers from International Conference on Advances in Social Networks Analysis and Mining Wiki List of Social Network Analysis Software Review of 100+ Social Network Analysis Tools List of Tools from The SAGE Handbook of Social Network Analysis Tools and Data Sets More Tool Reviews Twitter Data Sets Web/Blog Data Sets Facebook Data Sets PwC 72 Additional Case Studies Example Advanced natural language processing and deep question-answering technology are being applied to address clinical decision-making Memorial Sloan-Kettering Cancer Center • Memorial Sloan-Kettering Cancer Center is applying DeepQA technology (technology that relies on advanced analytics powered by IBM’s Watson) to develop a decision-support application for cancer treatment • Doctors will be able to generate and evaluate hypothesis on evidence and treatment and the Cancer Center will be able to better identify and personalize cancer therapies for individual patients WellPoint and Cedars-Sinai • WellPoint and the Cedars-Sinai Samuel Oschin Comprehensive Cancer Institute will work together to help improve patient care and support physicians in their efforts to make the most informed, personalized treatment decisions possible • It is estimated that new clinical research and medical information doubles every five years, and nowhere is this knowledge advancing more quickly than in the complex area of cancer care • The WellPoint health care solutions will use DeepQA technology to draw from vast libraries of information including medical evidencebased scientific and health care data, and clinical insights from institutions like Cedars-Sinai Source: Memorial Sloan-Kettering Cancer Institute Press Release March 2012, WellPoint Press Release, December 2011; PwC 74 Example Large volumes of real-time sensor data are empowering individuals to take more control of their health Quantified Health – P4 Medicine (Predictive, Preventive, Personalized, Participatory) • Non-invasive wearable sensors are creating a new ‘Quantified Health’ movement and one of the fastest growing sectors in the tech industry, let alone in the field of Big Data Analytics • The number of connected industrial and medical devices is projected to reach 16 billion by 2015 • The mHealth market is estimated to reach a value of $23 billion by 2017 Source: Bruce Bigelow, Big Data, Big Biology, and the ‘Tipping Point’ in Quantified Health: Takeaways from Xconomy’s On-the-Record Dinner, Xconomy, April 26, 2012 PwC 75 Example Advanced machine learning and visualization techniques are being used to model drug interactions Modeling Adverse Drug Reactions When biological and phenotypic features were integrated alongside chemical structures to predict adverse drug reactions, prediction accuracy increased from 0.9054 to 0.9524 Source: “Liu M, Wu Y, Chen Y, et al Large-scale prediction of adverse drug reactions by integrating chemical, biological, and phenotypic properties of drugs J Am Med Inform Assoc 2012;19:e28–35 PwC 76 Other Examples Companies in other sectors are also pursuing various applications of ‘Big Data’ and ‘Smart Analytics’ Satellite Data Allianz Hartford Steam Boiler Location Hartford Steam Boiler is using sensors and real-time sensor Allianz is ‘mashing’ satellite data, third-party data to monitor assets, reduce losses and manage risks street-level data, images, and other internal data better to better understand risk concentrations and Hartford Steam Boiler has been able to manage manage concentration risk in commercial property insurance Map Data Property-Specific Data concentration risks and reduce losses, having one of the lowest combined ratios for a commercial insurer Proctor & Gamble Proctor & Gamble is investing in analytics talent for quicker decision making, with the CIO planning to increase fourfold the number of staff with expertise in business analytics Executives are currently using big data to uncover what is currently going on in their business, to understand why, to predict future performance and to understand what actions P&G should take Source: “Procter & Gamble – Business Sphere and Decision Cockpits”, Ravi Kalakota, Pratical Analytics Wordpress, Feb 2012, mskcc.org/cancer-care; eWeek.com, Healthcare IT News, IBM Watson to Aid Sloan-Kettering With Cancer Research, March 2012 PwC 77 Big Data Analytics Technology & Vendor Mappings Big Data Analytics – Technology & Vendor Mappings Layer L1 L2 Private Public Infrastructure Vendor EMC Private Cloud EMC HP Private Cloud HP Teradata Private Cloud Teradata Dell Private Cloud Dell Azure SQL Microsoft Amazon Web Services Amazon Google Cloud Platform Google EMC Hybrid Cloud EMC HP Helion HP IBM Hybrid Cloud IBM Logos Cloud Hybrid PwC Technology 79 Big Data Analytics – Technology & Vendor Mappings Layer L1 L2 Batch/Micro Technology Vendor Apache Kafka Apache Software Foundation Fluentd Open Source Sqoop Apache Software Foundation Rabbit MQ Rabbit MQ AWS Kinesis Amazon Web Services Apache Spark Apache Software Foundation Apache Storm Apache Software Foundation Apache Spark Streaming Apache Software Foundation Samza Apache Software Foundation NiFi Apache Software Foundation Logos Data Ingestion & Data Acquisition Integration Real time/ Streaming PwC 80 Big Data Analytics – Technology & Vendor Mappings Layer L1 L2 Data Profiling/Cle- Technology Vendor Logos *Need assistance in locating ansing Data Matching/DData Quality *Need assistance in locating uplication Standardizati- *Need assistance in locating on/Normaliz-ation ETL/ELT Hadoop Apache Hadoop Talend Talend Hive Apache Software Foundation Drill Apache Software Foundation Data Ingestion & Integration Data Integration PwC Staging *Need assistance in locating Persistent Staging *Need assistance in locating File Exchange *Need assistance in locating File Storage *Need assistance in locating 81 Big Data Analytics – Technology & Vendor Mappings Layer L1 L2 Technology Vendor Custom Compliers *Need assistance in locating Batch MapReduce MapReduce Apache Hadoop Spark Apache Software Foundation AWS EMR Amazon Web Services Tez Apache Software Foundation Spark Apache Software Foundation Logos Execution/Data Processing 3.5 In-Memory Execution/ Data Processing Processing Computing *Need assistance in locating Framework Cluster Management Resource YARN Apache Hadoop Mesos Apache Software Foundation Zookeeper Apache Software Foundation Oozie Apache Software Foundation Managem-ent PwC 82 Big Data Analytics – Technology & Vendor Mappings Layer L1 L2 Workflow Technology Vendor Hue Open Source Ambari Apache Software Foundation Lipstick Netflix Ganglia The Ganglia Project SQL Server Microsoft Oracle 10g Oracle Parallel Database Teradata Teradata Data Appliances HP Vertica HP IBM BigInsights IBM EMC Greenplum EMC ClustrixDB Clustrix Mem SQL Memsql HDFS Apache Hadoop Logos Management 3.5 Execution/ Data Processing Resource Managem-ent Traditional Database Relational Database Data Repositories PwC NewSQL Hadoop DFS 83 Big Data Analytics – Technology & Vendor Mappings Layer L1 L2 Relational/ NewSQL Columnar DB Technology Vendor MySQL Open Source PostgreSQL Open Source AWS RDS Amazon Web Services Cassandra Apache Software Foundation Hbase Apache Hadoop AWS Redshift Amazon Web Services Hazelcast Open Source Aerospike Aerospike Logos In-Memory Data Repositories NoSQL Metadata Storage PwC *Need assistance in locating 84 Big Data Analytics – Technology & Vendor Mappings Layer L1 L2 Key Value Column Store Technology Vendor Redis Open Source Riak Basho AWS DynamoDB Amazon Web Services Cassandra Apache Software Foundation Hbase Apache Hadoop AWS Redshift Amazon Web Services Neo4j Neo Technology OrientDB Orient Tehcnologies ArangoDB Open Source MongoDB MongoDB, Inc Elastic Elastic Couchbase Couchbase Logos Data Repositories NoSQL Graph Database Document Database PwC 85 Big Data Analytics – Technology & Vendor Mappings Layer L1 L2 *Need assistance in Reporting & Vendor Logos Microstrategy locating Dashboar-ds Datameer *Need assistance in Visualizati-on Technology Qlik Sense Qlick Tableau Tableau locating tools/ Interactive Visual Analytics *Need assistance in Real-time Alerts *Need assistance in locating locating Presentation/ Data *Need assistance in Visualization locating Website Front- D3 Open Source Angular JS Google Flask Open Source Highcharts Highcharts Django Django Software Foundation end API PwC *Need assistance in *Need assistance in locating locating 86 ...Agenda I Introduction to Big Data What it is and why it matters II Big Data Analytics Putting Big Data to work III.Creating a Big Data- Enabled Organization Bringing Big Data Analytics home IV.Case... (large, unstructured, fast, and uncertain data) and ? ?Big Data Analytics? ?? Big Data + Big Data Analytics Refers to the DATA only Methods of using Big Data to generate insight Machine Machine Learning/Deep... use Big Data often commitment to using analytics in decisionmaking; a decisive mentality capable of Big Data Analytics is about Big Data Analytics requires firm is also about data quality, data