IT training confluent kafka definitive guide complete khotailieu

Co m pl im en ts of Kafka The Definitive Guide REAL-TIME DATA AND STREAM PROCESSING AT SCALE Neha Narkhede, Gwen Shapira & Todd Palino Get Started With Apache Kafka™ Today CONFLUENT OPEN SOURCE A 100% open source Apache Kafka distribution for building robust streaming applications CONNECTORS CLIENTS SCHEMA REGISTRY REST PROXY • Thoroughly tested and quality assured • Additional client support, including Python, C/C++ and NET • Easy upgrade path to Confluent Enterprise Start today at confluent.io/download Kafka: The Definitive Guide Real-Time Data and Stream Processing at Scale Neha Narkhede, Gwen Shapira, and Todd Palino Beijing Boston Farnham Sebastopol Tokyo Kafka: The Definitive Guide by Neha Narkhede, Gwen Shapira, and Todd Palino Copyright © 2017 Neha Narkhede, Gwen Shapira, Todd Palino All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Shannon Cutt Production Editor: Shiny Kalapurakkel Copyeditor: Christina Edwards Proofreader: Amanda Kersey July 2017: Indexer: WordCo Indexing Services, Inc Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 2017-07-07: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491936160 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Kafka: The Definitive Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-99065-0 [LSI] Table of Contents Foreword xiii Preface xvii Meet Kafka Publish/Subscribe Messaging How It Starts Individual Queue Systems Enter Kafka Messages and Batches Schemas Topics and Partitions Producers and Consumers Brokers and Clusters Multiple Clusters Why Kafka? Multiple Producers Multiple Consumers Disk-Based Retention Scalable High Performance The Data Ecosystem Use Cases Kafka’s Origin LinkedIn’s Problem The Birth of Kafka Open Source The Name 4 5 10 10 10 10 10 11 11 12 14 14 15 15 16 v Getting Started with Kafka 16 Installing Kafka 17 First Things First Choosing an Operating System Installing Java Installing Zookeeper Installing a Kafka Broker Broker Configuration General Broker Topic Defaults Hardware Selection Disk Throughput Disk Capacity Memory Networking CPU Kafka in the Cloud Kafka Clusters How Many Brokers? Broker Configuration OS Tuning Production Concerns Garbage Collector Options Datacenter Layout Colocating Applications on Zookeeper Summary 17 17 17 18 20 21 21 24 28 29 29 29 30 30 30 31 32 32 32 36 36 37 37 39 Kafka Producers: Writing Messages to Kafka 41 Producer Overview Constructing a Kafka Producer Sending a Message to Kafka Sending a Message Synchronously Sending a Message Asynchronously Configuring Producers Serializers Custom Serializers Serializing Using Apache Avro Using Avro Records with Kafka Partitions Old Producer APIs Summary vi | Table of Contents 42 44 46 46 47 48 52 52 54 56 59 61 62 Kafka Consumers: Reading Data from Kafka 63 Kafka Consumer Concepts Consumers and Consumer Groups Consumer Groups and Partition Rebalance Creating a Kafka Consumer Subscribing to Topics The Poll Loop Configuring Consumers Commits and Offsets Automatic Commit Commit Current Offset Asynchronous Commit Combining Synchronous and Asynchronous Commits Commit Specified Offset Rebalance Listeners Consuming Records with Specific Offsets But How Do We Exit? Deserializers Standalone Consumer: Why and How to Use a Consumer Without a Group Older Consumer APIs Summary 63 63 66 68 69 70 72 75 76 77 78 80 80 82 84 86 88 92 93 93 Kafka Internals 95 Cluster Membership The Controller Replication Request Processing Produce Requests Fetch Requests Other Requests Physical Storage Partition Allocation File Management File Format Indexes Compaction How Compaction Works Deleted Events When Are Topics Compacted? Summary 95 96 97 99 101 102 104 105 106 107 108 109 110 110 112 112 113 Table of Contents | vii Reliable Data Delivery 115 Reliability Guarantees Replication Broker Configuration Replication Factor Unclean Leader Election Minimum In-Sync Replicas Using Producers in a Reliable System Send Acknowledgments Configuring Producer Retries Additional Error Handling Using Consumers in a Reliable System Important Consumer Configuration Properties for Reliable Processing Explicitly Committing Offsets in Consumers Validating System Reliability Validating Configuration Validating Applications Monitoring Reliability in Production Summary 116 117 118 118 119 121 121 122 123 124 125 126 127 129 130 131 131 133 Building Data Pipelines 135 Considerations When Building Data Pipelines Timeliness Reliability High and Varying Throughput Data Formats Transformations Security Failure Handling Coupling and Agility When to Use Kafka Connect Versus Producer and Consumer Kafka Connect Running Connect Connector Example: File Source and File Sink Connector Example: MySQL to Elasticsearch A Deeper Look at Connect Alternatives to Kafka Connect Ingest Frameworks for Other Datastores GUI-Based ETL Tools Stream-Processing Frameworks Summary viii | Table of Contents 136 136 137 137 138 139 139 140 140 141 142 142 144 146 151 154 155 155 155 156 Cross-Cluster Data Mirroring 157 Use Cases of Cross-Cluster Mirroring Multicluster Architectures Some Realities of Cross-Datacenter Communication Hub-and-Spokes Architecture Active-Active Architecture Active-Standby Architecture Stretch Clusters Apache Kafka’s MirrorMaker How to Configure Deploying MirrorMaker in Production Tuning MirrorMaker Other Cross-Cluster Mirroring Solutions Uber uReplicator Confluent’s Replicator Summary 158 158 159 160 161 163 169 170 171 172 175 178 178 179 180 Administering Kafka 181 Topic Operations Creating a New Topic Adding Partitions Deleting a Topic Listing All Topics in a Cluster Describing Topic Details Consumer Groups List and Describe Groups Delete Group Offset Management Dynamic Configuration Changes Overriding Topic Configuration Defaults Overriding Client Configuration Defaults Describing Configuration Overrides Removing Configuration Overrides Partition Management Preferred Replica Election Changing a Partition’s Replicas Changing Replication Factor Dumping Log Segments Replica Verification Consuming and Producing Console Consumer Console Producer 181 182 183 184 185 185 186 186 188 188 190 190 192 192 193 193 193 195 198 199 201 202 202 205 Table of Contents | ix ... Under-Replicated Partitions Broker Metrics Topic and Partition Metrics JVM Monitoring OS Monitoring Logging Client Monitoring Producer Metrics Consumer Metrics Quotas Lag Monitoring End-to-End Monitoring... multiple partitions, there is no guarantee of message time-ordering across the entire topic, just within a single partition Figure 1-5 shows a topic with four partitions, with writes being appended... filesystem Topics are additionally broken down into a number of partitions Going back to the “commit log” description, a partition is a sin‐ gle log Messages are written to it in an append-only fashion,

Định dạng
Số trang	322
Dung lượng	6,23 MB