fast data processing systems with smack stack

370 746 0
fast data processing systems with smack stack

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Fast Data Processing Systems with SMACK Stack Combine the incredible powers of Spark, Mesos, Akka, Cassandra, and Kafka to build data processing platforms that can take on even the hardest of your data troubles! Raúl Estrada BIRMINGHAM - MUMBAI Fast Data Processing Systems with SMACK Stack Copyright © 2016 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information Production reference: 1151216 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78646-720-1 www.packtpub.com Credits Author Copy Editor Raúl Estrada Safis Editing Reviewers Project Coordinator Anton Kirillov Sumit Pal Shweta H Birwatkar Commissioning Editor Proofreader Veena Pagare Safis Editing Acquisition Editor Indexer Divya Poojari Mariammal Chettiyar Content Development Editor Graphics Amrita Noronha Disha Haria Technical Editor Production Coordinator Sneha Hanchate Nilesh Mohite About the Author Raúl Estrada is a programmer since 1996 and Java Developer since 2001 He loves functional languages such as Scala, Elixir, Clojure, and Haskell He also loves all the topics related to Computer Science With more than 12 years of experience in High Availability and Enterprise Software, he has designed and implemented architectures since 2003 His specialization is in systems integration and has participated in projects mainly related to the financial sector He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys Mobile Programming and Game Development He considers himself a programmer before an architect, engineer, or developer He is also a Crossfitter in San Francisco, Bay Area, now focused on Open Source projects related to Data Pipelining such as Apache Flink, Apache Kafka, and Apache Beam Raul is a supporter of free software, and enjoys to experiment with new technologies, frameworks, languages, and methods I want to thank my family, especially my mom for her patience and dedication I would like to thank Master Gerardo Borbolla and his family for the support and feedback they provided on this book writing I want to say thanks to the acquisition editor, Divya Poojari, who believed in this project since the beginning I also thank my editors Deepti Thore and Amrita Noronha Without their effort and patience, it would not have been possible to write this book And finally, I want to thank all the heroes who contribute (often anonymously and without profit) with the Open Source projects specifically: Spark, Mesos, Akka, Cassandra, and Kafka; an honorable mention for those who build the connectors of these technologies About the Reviewers Anton Kirillov started his career as a Java developer in 2007, working on his PhD thesis in the Semantic Search domain at the same time After finishing and defending his thesis, he switched to Scala ecosystem and distributed systems development He worked for and consulted startups focused on Big Data analytics in various domains (real-time bidding, telecom, B2B advertising, and social networks) in which his main responsibilities were focused on designing data platform architectures and further performance and stability validation Besides helping startups, he has worked in the bank industry building Hadoop/Spark data analytics solutions and in a mobile games company where he has designed and implemented several reporting systems and a backend for a massive parallel online game The main technologies that Anton has been using for the recent years include Scala, Hadoop, Spark, Mesos, Akka, Cassandra, and Kafka and there are a number of systems he’s built from scratch and successfully released using these technologies Currently, Anton is working as a Staff Engineer in Ooyala Data Team with focus on fault-tolerant fast analytical solutions for the ad serving/reporting domain Sumit Pal has more than 24 years of experience in the Software Industry, spanning companies from startups to enterprises He is a big data architect, visualization and data science consultant, and builds end-to-end data-driven analytic systems Sumit has worked for Microsoft (SQLServer), Oracle (OLAP), and Verizon (Big Data Analytics) Currently, he works for multiple clients building their data architectures and big data solutions and works with Spark, Scala, Java, and Python He has extensive experience in building scalable systems in middletier, datatier to visualization for analytics applications, using BigData and NoSQL DB Sumit has expertise in DataBase Internals, Data Warehouses, Dimensional Modeling, As an Associate Director for Big Data at Verizon, Sumit, strategized, managed, architected and developed analytic platforms for machine learning applications Sumit was the Chief Architect at ModelN/LeapfrogRX (2006-2013), where he architected the core Analytics Platform Sumit has recently authored a book with Apress - called - "SQL On Big Data - Technology, Architecture and Roadmap" Sumit regularly speaks on the above topic in Big Data Conferences across USA Sumit has hiked to Mt Everest Base Camp at 18.2K feet in Oct, 2016 Sumit is also an avid Badminton player and has won a bronze medal in 2015 in Connecticut Open in USA in the men's single category www.PacktPub.com For support files and downloads related to your book, please visit www.PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Table of Contents Preface Chapter 1: An Introduction to SMACK Modern data-processing challenges The data-processing pipeline architecture The NoETL manifesto Lambda architecture Hadoop SMACK technologies Apache Spark Akka Apache Cassandra Apache Kafka Apache Mesos Changing the data center operations From scale-up to scale-out The open-source predominance Data store diversification Data gravity and data locality DevOps rules Data expert profiles Data architects Data engineers Data analysts Data scientists Is SMACK for me? Summary 10 11 12 12 13 14 15 16 17 17 18 18 18 18 19 19 19 20 20 21 21 22 22 Chapter 2: The Model - Scala and Akka 23 The language – Scala Kata – The collections hierarchy 25 26 27 28 29 30 30 Sequence Map Set Kata – Choosing the right collection Sequence Map Set Kata – Iterating with foreach Kata – Iterating with for Kata – Iterators Kata – Transforming with map Kata – Flattening Kata – Filtering Kata – Subsequences Kata 10 – Splitting Kata 11 – Extracting unique elements Kata 12 – Merging Kata 13 – Lazy views Kata 14 – Sorting Kata 15 – Streams Kata 16 – Arrays Kata 17 – ArrayBuffer Kata 18 – Queues Kata 19 – Stacks Kata 20 – Ranges The model – Akka The Actor Model in a nutshell Kata 21 – Actors The actor system Actor reference Kata 22 – Actor communication Kata 23 – Actor life cycle Kata 24 – Starting actors Kata 25 – Stopping actors Kata 26 – Killing actors Kata 27 – Shutting down the actor system Kata 28 – Actor monitoring Kata 29 – Looking up actors Summary Chapter 3: The Engine - Apache Spark Spark in single mode Downloading Apache Spark Testing Apache Spark Spark core concepts 32 33 34 35 36 37 38 39 39 41 42 42 43 45 46 47 48 48 50 51 52 54 55 57 57 58 60 62 64 67 68 69 70 70 71 71 72 73 74 [ ii ] Study Case - Mesos and Docker Posix disk This isolator can be used in OS X and Linux and provides basic disk isolation Normally it is used to reinforce disk quotas To enable this isolator we must add the posix/disk to the -isolation flag The disk quota reinforcement is disabled by default, to enable it use -enforce_container_disk_quota when starting the slave To retrieve disk statistics, we configure the /monitor/statistics.json Disk utilization is reported by running the du command periodically To configure the time interval between commands use the agent flag: container_disk_watch_interval The default value is 15 seconds Docker containerizers To run Mesos tasks inside a Docker container we use this containerizer The Docker image can also be launched as an executor or a task We enable the Docker containerizer with the agent flag: containerizers=docker We use this containerizer type when: The running tasks use the Docker package One Mesos slave runs within a Docker container For full and updated information visit: http://mesos.apache.org/documentation/latest/docker-containerizer/ Docker containerizer setup To enable the Docker Containerizer on a slave, the slave must be launched with “Docker” as one of the containerizer options: mesos-slave containerizers=docker The Docker Command Line Interface (CLI) client must be installed on every slave where the Docker containerizer is specified [ 337 ] Study Case - Mesos and Docker If the iptables are enabled on the slave, the iptables must permit all traffic from the bridge interface with this command: iptables -A INPUT -s 172.17.0.0/16 -i docker0 -p tcp -j ACCEPT Launching the Docker containerizers The launching process has the following steps: The task is attempted only if ContainerInfo.type is set to DOCKER The image is pulled from the specified repository The pre-launch hook is called The executor is launched in one of the following two scenarios: The Mesos agent runs in a Docker container in these cases: If the flag docker_mesos_image is present If the value of the flag docker_mesos_image is considered to be the Docker image used for launching the Mesos agent If an executor different from the default command executor is used to run the task If the task uses TaskInfo, the default mesos-dockerexecutor is launched in a Docker container to execute commands through the Docker CLI The Mesos agent does not run in a Docker container in these cases: If a task uses a custom executor to run If a task uses TaskInfo, a sub-process to run default mesos-dockerexecutor, is forked Shells are spawned by this executor to run Docker commands through the Docker CLI The Docker containerizer states are: FETCHING: Fetches the executor PULLING: Docker image pulls the image RUNNING: Waiting for instructions DESTROYING: Launcher destroys the container [ 338 ] Study Case - Mesos and Docker Composing containerizers This containerizers type allows the different container technologies to be combined to work together We can enable this type with the containerizers agent flag specifying the coma separated list with the containerizer names: containerizers=mesos,docker The first containerizer in the list is used for task launching and will provide support for the container configuration of the task We use this containerizer type when we need to test tasks with different resource isolation types A framework can leverage this containerizer type to test a task using the controlled environment provided by the Mesos containerizer and simultaneously ensure that the task works in Docker containers Summary In this study case, we've covered the Mesos API for framework development We've also studied how to use Mesos and Docker containerizers The Mesos framework version 1.0.0 was released on 07/29/2016 so it's a very new technology We could look at a complete Mesos framework creation, but it is beyond the scope of this book The first release of Docker was on 03/13/2013 and version 1.12.0 was released on 07/14/2016 Both technologies are still new and promise good things [ 339 ] Index A access control lists (ACLs) about 317 actions 317 objects 318 register_frameworks ACL 318 run_tasks ACL 318 Actor Model about 52, 53 actor system, shutting down 68 actor, communicating 58 actor, monitoring 69, 70 actors, killing 67 actors, looking up 70 actors, starting 62, 64 actors, stopping 64, 65, 66 characteristics 53 Drone, modeling 55, 56 life cycle methods 60, 62 versus Object Oriented Programming 54, 55 administration, Apache Kafka about 195 cluster tools 195, 197 cluster, mirroring 200, 201 servers, adding 197, 198 topic tools 199, 200 Akka-Cassandra connectors about 296, 297 Cassandra cluster trait, defining 301 scanner, testing 304, 306 tweets, reading from Cassandra 299, 301 tweets, scanning 303 tweets, writing to Cassandra 298 Akka-Spark connectors 306 Akka about 15 Actor Model 52, 53 Amazon Machine Images (AMI) 229 Amazon Web Services (AWS) Amazon account, creating 228 Apache Mesos, executing 228 instances, launching 233, 234 key pair, selecting 228 Mesos, building 232, 233 Mesos, downloading 231 Mesos, installing 230, 231 security groups 229 URL 228 Amazon URL, for account creation 228 Apache Aurora features 252 installing 253, 254 versus Marathon 252, 253 Apache Cassandra, on Mesos advanced configuration 261, 262 configuring 259 JSON code, URL 259 references 262 Apache Cassandra about 16 backup, creating 125 clients, URL 146 compression 126 data model 116, 118 data storage 118 DataStax OpsCenter 121, 122 features 16 history 111 installing 116, 119, 121 key space, creating 122, 123 recovery 127 URL 116, 119 Apache Kafka, on Mesos configuring 262, 264, 265 log management 266 Apache Kafka about 17, 152 fast data 154, 155 features 17, 152, 153, 154 importing 159 installing 157, 158 installing, in Linux 158, 159 integrating 194 integrating, with Apache Spark 194, 195 Java, installing 158 references 156, 157 use cases 155 Apache Mesos, installation issues debugging 239 directory permissions, assigning 239 directory structure 240 library dependencies, missing 238, 239 library, missing 239 multiple slaves, launching on machine 240, 241 slaves, connection problem 240 Apache Mesos, on private data center environment, setting up 235 executing 234 master, starting 236 Mesos, installation 235 process automation 237, 238 slaves, starting 237 Apache Mesos about 17 advantages 203 API 207, 320 architecture 202 attributes 206 challenges 204 executing, on AWS 228 framework 204 objectives 203 resources 206 scheduling and management frameworks 241 URL 231 Apache Spark, on Mesos coarse-grained mode 259 executing 257 fine-grained mode 259 jobs, submitting in client mode 258 jobs, submitting in cluster mode 258 reference link 259 Apache Spark about 14 advantages 14 components 15 core concepts 74 downloading 72 executing, in single mode 71 in cluster mode 87 testing 73, 74 Apache Zookeeper about 99 installing 243 project, URL 162 URL 243 API, Apache Mesos about 207, 320 Executor API 208, 209 Executor Driver API 209 messages 208 Scheduler API 210 Scheduler Driver API 212, 213 scheduler HTTP API 320 architecture about 170 groups 173 Kafka design 174 leaders 173 log compaction 173, 174 message compression 174, 175 offset 172 replication 175 segment files 172 Ask pattern reference link 301 Aurora 205 authentication about 124 setting up 124 authorization about 124 [ 341 ] setting up 124 AWS instance launching 229, 230 types 229 B Bloom filter 142 bootstrap 119 broker properties about 169 auto.create.topics.enable true 169 broker.id 169 default.replication.factor 169 listeners 169 log.dirs 169 num.partitions 169 URL 170 zookeeper.connect 170 buffers ArrayBuffer 28 ListBuffer 28 C Calliope project about 290 Calliope, installing 291 Cassandra tables, loading 295 CQL3 291 SQL context, creation 294 SQL, configuration 294 Thrift 292 CAP Brewer's theorem 114 Cassandra 206 Cassandra Query Language (CQL) 121, 131 checkpointing 102, 108 Chronos, REST API about 247 job tasks, deleting 249 job, adding 248 job, deleting 249 job, starting 248 running jobs, listing 247 URL 247 Chronos about 205, 245 and Marathon 247 installation 246 job, scheduling 246 CLI delete commands drop columnfamily 137 drop keyspace 137 part 137 truncate 137 client mode 93 clients near real-time 154 offline 154 real-time 154 cluster manager 88 cluster mode 93 cluster about 159 broker 160 broker properties 169 consumer 160 multiple broker cluster 165, 166 producer 160 single broker cluster 160 topic 160 zookeeper 160 coarse-grained mode 318, 319 coarse-grained retention 173 collections, Scala ArrayBuffer 48 arrays 47, 48 determining 30 determining, for Map 32 determining, for sequence 30, 31, 32 determining, for set 33 duplicates, removing 42 filtering 39 flattening 38 hierarchy 26 lazy view 43, 44 Map 28, 29 merging 42 queue 48 range 51 sequence hierarchy 27 set 29 [ 342 ] sorting 45 splitting 41 Stack 50 streams 46 subtracting 42 transforming, with map 37 Complex Event Processing (CEP) 153 consumer API using 186 consumers, properties consumer.id 193 fetch.min.bytes 193 group.id 193 heartbeat.interval.ms 193 key.deserializer 193 key.serializer 193 max.partition.fetch.bytes 193 session.timeout.ms 194 URL 194 value.deserializer 194 value.serializer 194 zookeeper.connect 193 consumers about 185, 186 multithread Scala consumers 189 Scala consumers, creating 186 container-based technology 329 containerizers about 329 and containers 332, 333 composing 339 creating 334 Docker 330, 331, 332 Docker containerizer 337 Mesos containerizer 335 types 333 containers about 329 benefits 329, 330 usage 333 CQL commands about 132 ALTER KEYSPACE 132 ALTER TABLE 132 ALTER TYPE 132 ALTER USER 132 BATCH 132 CREATE INDEX 132 CREATE KEYSPACE 132 CREATE TABLE 132 CREATE TRIGGER 132 CREATE TYPE 132 CREATE USER 132 DELETE 132 DESCRIBE 132 DROP INDEX 132 DROP KEYSPACE 132 DROP TABLE 132 DROP TRIGGER 132 DROP TYPE 132 DROP USER 132 GRANT 132 INSERT 133 LIST PERMISSIONS 133 LIST USERS 133 REVOKE 133 SELECT 133 TRUNCATE 133 UPDATE 133 URL 133 USE 133 CQL shell delete commands alter_drop 137 delete 137 delete_columns 137 delete_where 137 drop_columnfamily 137 drop_keyspace 137 drop_table 137 truncate 138 CQL shell about 131 CAPTURE 131 CONSISTENCY 131 COPY 131 Cqlsh 131 DESCRIBE 131 EXIT 131 EXPAND 131 PAGING 131 [ 343 ] SHOW 131 SOURCE 131 TRACING 131 CQL3 about 291 columns, reading from Cassandra 291 RDD, writing to Cassandra 292 custom partitioning about 181 classes, importing 181 consumer, executing 184 message, building 183 message, sending 183 partitioner class, implementing 181 producer, executing 184 programs, compiling 184 properties, defining 181 topic, creating 183 custom Spark stream Akka Streams 273 Cassandra, enabling 273 creating 272 Kafka Streams 272 reading, from Cassandra 273, 274 writing, to Cassandra 273 Customer Relationship Management (CRM) D data center operation commodity machines with low cost network, deploying 18 data gravity 19 data locality 19 data store, selecting 18 DevOps, rules 19 modifying 18 open source, adopting 18 data expert profiles about 19 data analysts 21 data architects 20 data engineers 20 data scientists 21 data flush 118 data skills analytical skills 19 engineering skills 19 data sources external sources 20 internal sources 20 data-processing pipeline architecture 10 Hadoop 12 lambda architecture 12 NoETL 11 Database Management System (DBMS) 112 dataframes API features 76 DataStax OpsCenter 121, 122 DBMS cluster about 133, 136 CLI delete commands 137 CQL shell delete commands 137 database, deleting 137 Directed Acyclic Graph (DAG) 90 Docker container versus virtual machine 332 Docker containerizer about 337 launching 338 setting up 337, 338 URL 337 versus Mesos containerizer 334 Docker eco-system, concepts Base image 331 Docker Compose 331 Docker container 331 Docker engine 331 Docker hub 331 Docker image 331 Docker machine 332 Docker registry 331 Docker swarm 331 Docker Toolbox 331 Docker trusted registry 331 Dockerfile 331 Docker hub Docker tag 331 kinematic 331 virtual machine 331 [ 344 ] Docker about 330 benefits 330 build phase 330 Rkun phase 330 ship phase 330 URL 254 Domain Specific Language (DSL) 252 Dominant Resource Fairness (DRF) algorithm about 215, 216, 217 dominant resource 215 dominant share 216 features 220 weighted DRF algorithm 218, 219 driver about 88, 89 program, dividing into tasks 90 tasks, scheduling on executors 90 Drone actor reference 57 actor system, creating 57 building 55 DStreams (discretized streams) 71 E Elastic 206 Enterprise Resource Planning (ERP) Enterprise Service Bus (ESB) 156 Executor API about 208 disconnected method 209 error method 209 frameworkMessage method 209 killTask method 209 launchTask method 209 registered method 208 reregistered method 208 shutdown method 209 URL 208 Executor Driver API abort method 210 about 209 join method 210 run method 210 sendFrameworkMessage method 210 sendStatusUpdate method 210 start method 209 stop method 209 URL 209 executors 88 Extract, Transform, Load (ETL) about 11 reference link 11 F fault tolerant Spark Streaming about 107 checkpointing 108 fault-tolerant systems reference link 53 FIFO (First-in First-out) 48 fine-grained mode 319 finer-grained retention 173 for loop used, for iterating 35, 36 foreach method used, for iterating 34 framework, Apache Mesos about 204, 205 executor 204 for long running applications 205 for scheduling 205 for storage 206 scheduler 204, 211 URL 205 G garbage collector 109 Gradle URL 262 H Hadoop 12 hypervisor-based technology 329 I immutable collection 30 Information Management System (IMS) 112 iterating [ 345 ] with for loop 35, 36 with foreach method 34 iterators 36 J Java Message Service (JMS) 170, 194 Java version URL 262 JDK reference link 158 Jenkins 205 JobServer 205 K Kafka design master-less 174 metadata 174 OLTP 174 push and pull 174 retention 174 storage 174 synchronous 174 Kafka-Akka connectors 307, 309, 310 Kafka-Cassandra connectors 311, 312 Katas 23 L lambda architecture 12 lazy view 43, 44 Least Recently Used (LRU) 87 LIFO (Last-IN-First-Out) 50 Linux Container (LXC) 330 List Processing (LISP) 25 load balancing 241 log4j configuring 128 M Map 28, 29 Marathon, REST API about 249 application configuration, modifying 251 application, adding 250, 251 application, deleting 251 running application, listing 249 Marathon about 205 and Chronos 247 Apache Zookeeper, installing 243 application, scaling 245 application, terminating 245 executing, in local mode 244 features 242 installation 242 multi-node installation 244 test application, executing 245 URL 242 versus Apache Aurora 252, 253 Maven repository URL 146 Mesos containerizer, isolator options PID namespace 336 Posix disk 337 shared filesystem 336 Mesos containerizer about 335 architecture 336 launching 335 URL 335 versus Docker containerizer 334 Mesos framework about 314 access control 315 access control lists (ACLs) 317, 318 authentication 315, 316 authentication, configuration options 316 authorization 315, 317 executor 315 reference link 316 scheduler 315 Mesos Master User Interface 207 Mesosphere URL 234 message broker 152 mirroring 200 modern data-processing challenges multiple broker cluster [ 346 ] consumer, starting 167 producer, starting 167 starting 166 topic, creating 167 multithread Scala consumers classes, importing 189 coding 190 compiling 192 creating 189 executing 192 producer, executing 192 properties, defining 189 topic, creating 192 mutable collection 30 N NoETL 11 NoSQL about 112 CAP Brewer's theorem 114 versus SQL 114 O Object Oriented Programming versus Actor Model 54, 55 offset 160, 172 Online Analytical Processing (OLAP) 15 Online Transaction Processing (OLTP) 15, 174 P parallel SSH tool (pssh) 125 PID namespace advantages 336 pipeline data architecture 13 Platform as a Service (PaaS) 205 Producer properties acks 185 bootstrap.servers 184 buffer.memory 185 compression.type 185 key.serializer 184 retries 185 URL 185 value.serializer 184 Producer about 154 adapters 154 logs 154 properties 184 proxy 154 web page 154 web services 154 Producers about 177 Producer API 177 Scala producers 177 with custom partitioning 181 protocol buffers about 208 URL 208 R reassign-partition tool execute mode 197 generate mode 197 verify mode 197 recovery, Apache Cassandra about 127 Bloom filter 141 client-server architecture 145 CQL 131 data cache 142, 143 DB optimization 138 DBMS cluster 133, 136 DBMS optimization 138 drivers 146 Java garbage collection, tuning 144 Java heap, setting up 144 log file, rotating 129 log4j, configuring 128 logs 128 restart node 127 schema, printing 128 SQL dump, creating 130 stored procedures 144 transaction log, storing 130 triggers 144 user activity log, storing 130 views 144 Relational Database Management System [ 347 ] (RDBMS) 112 replication modes about 176 asynchronous replication 176 synchronous replication 176 Representational State Transfer (REST) 242 request calls, scheduler HTTP API about 320 ACCEPT 321 ACKNOWLEDGE 324 DECLINE 322 KILL 323 MESSAGE 325 RECONCILE 324 REQUEST 325 REVIVE 322 SHUTDOWN 323 SUBSCRIBE 320 TEARDOWN 321 resilient distributed dataset (RDD) about 71, 75, 76 actions operation 76, 84, 86 persistence 86 programs, executing 79 Spark applications, executing 77 Spark context, initializing 77 transformation operation 76, 79, 83 resource allocation about 214, 215 Dominant Resource Fairness (DRF) algorithm 215, 216, 217 resource reservation about 221 dynamic reservation 223 frameworks, assigning to roles 222 HTTP reserve 226 HTTP unreserve 227 policies, setting 222 reserve operation 223, 224 roles, defining 221 static reservation 221 unreserve operation 225 resources, Apache Mesos about 206 configuration 220 response calls, scheduler HTTP API about 325 ERROR 328 FAILURE 328 HEARTBEAT 328 MESSAGE 327 OFFERS 326 RESCIND 326 SUBSCRIBED 326 UPDATE 327 run modes, Spark Mesos about 318 coarse-grained mode 318, 319 fine-grained mode 319 runtime architecture, Spark about 88 application deployment 93, 95 cluster manager 92 driver 89 executor 91 program execution 92 S Scala consumers classes, importing 186 coding 187 compiling 188 creating 186 executing 189 producer, executing 189 properties, defining 187 topic, creating 188 Scala producers about 177 classes, importing 178 compiling 180 executing 180 message, building 178 message, sending 178 properties, defining 178 topic, creating 180 Scala about 25, 26 collections 26 iterating, with for loop 35 [ 348 ] iterating, with foreach method 34 iterators 36 subsequences, extracting 39 Scheduler API about 210 disconnected method 210 error method 210 executorLost method 211 frameworkMessage method 211 offerRescinded method 211 registered method 211 reregistered method 211 resourceOffers method 211 slaveLost method 211 statusUpdate method 211 URL 210 Scheduler Driver API abort method 212 about 212 acceptOffers method 212 acknowledgeStatusUpdate method 212 declineOffer method 212 join method 212 killTask method 212 launchTasks method 213 reconcileTasks method 213 requestResources method 213 reviveOffers method 213 run method 213 sendFrameworkMessage method 213 start method 213 stop method 214 suppressOffers method 214 URL 212 scheduler HTTP API about 320 request calls 320 response calls 325 URL 320 scheduling and management frameworks about 241 Apache Aurora 241 Bamboo 241 Chronos 241 Consul 241 HAProxy 241 Marathon 241 Marathoner 241 Netflix Fenzo 241 Singularity 241 Yelp's PaaSTA 241 SEO (Search Engine Optimization) 206 sequence, hierarchy Buffer 28 IndexedSeq 27 LinearSeq 27 service discovery 241 set 29 shell commands URL 131 single broker cluster about 160 consumer, starting 164, 165 producer, starting 163, 164 starting 162 topic, creating 163 Zookeeper, starting 161, 162 Singularity about 205 configuration file 255 features 254 installation 254, 255 SMACK, technologies about 13 Akka 15 Apache Cassandra 16 Apache Kafka 17 Apache Mesos 17 Apache Spark 14, 15 Solr 206 Sorted String Table (SSTable) 118 Spark application 88 Spark Cassandra connector about 267 Cassandra, preparing 270 Cassandra, setting up 272 cluster deployment 281, 282, 283, 285, 286, 287, 288, 289 collection of tuples, saving 274 collections, modifying 276, 277 [ 349 ] collections, saving 275 context creation, streaming 272 custom Spark stream, creating 272 datasets, saving to Cassandra 274 features 268 objects, saving 277, 278 RDDs, saving 279, 280 requisites 269 scala options 279 Scala options, converting to Cassandra options 278 Spark Streaming, setting up 271, 272 SparkContext setup, creating 270, 271 use cases 289, 290 Spark Mesos run modes 318 Spark Streaming about 99 architecture 99, 101 batch size 109 fault tolerant 107 garbage collector 109 output operations 107 parallelism level, increasing 108 performance considerations 108 transformations 102 window size 109 Spark, Mesos, Akka, Cassandra, and Kafka (SMACK) about 10 need for 22 Spark-Cassandra connector about 146 connection, establishing 147, 149 installing 146 URL 146, 147 using 149 SQL versus NoSQL 114 SSSP 205 standalone cluster manager about 95 application, submitting 97 client mode 97 cluster mode 97 launching 96, 97 longevity, ensuring 99 resources, configuring 98 stateful transformations about 104 update state by key 106 windowed operations 104 stateless transformations 102 Storm 206 stream processing 156 Supply Chain Management (SCM) T tasks 90 Thrift about 292 columns, reading from Cassandra 293 RDD, writing to Cassandra 293 Total Cost of Ownership (TCO) 18 transformations about 102 stateful transformations 104 stateless transformations 102 U Unmanned Aerial Vehicle (UAV) 55 use cases, Apache Kafka commit logs 155 log aggregation 156 messaging 156 record user activity 156 stream processing 156 User Defined Classes (UDC) 268 V vagrant features 254 installing 253 virtual machine versus Docker container 332 W weighted DRF algorithm 218, 219, 220 windowed operations [ 350 ] about 104 slide duration 104 window duration 104 Z znodes 160 .. .Fast Data Processing Systems with SMACK Stack Combine the incredible powers of Spark, Mesos, Akka, Cassandra, and Kafka to build data processing platforms that can... platforms that can take on even the hardest of your data troubles! Raúl Estrada BIRMINGHAM - MUMBAI Fast Data Processing Systems with SMACK Stack Copyright © 2016 Packt Publishing All rights reserved... the data center operations From scale-up to scale-out The open-source predominance Data store diversification Data gravity and data locality DevOps rules Data expert profiles Data architects Data

Ngày đăng: 21/06/2017, 15:51

Từ khóa liên quan

Mục lục

  • Cover

  • Copyright

  • Credits

  • About the Author

  • About the Reviewers

  • www.PacktPub.com

  • Table of Contents

  • Preface

  • Chapter 1: An Introduction to SMACK

    • Modern data-processing challenges

    • The data-processing pipeline architecture

      • The NoETL manifesto

      • Lambda architecture

      • Hadoop

      • SMACK technologies

        • Apache Spark

        • Akka

        • Apache Cassandra

        • Apache Kafka

        • Apache Mesos

        • Changing the data center operations

          • From scale-up to scale-out

          • The open-source predominance

          • Data store diversification

Tài liệu cùng người dùng

Tài liệu liên quan