Table of Contents Apache Cassandra Essentials Credits About the Author About the Reviewers www.PacktPub.com Support files, eBooks, discount offers, and more Why subscribe? Free access for Packt account holders Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Errata Piracy Questions Getting Your Cassandra Cluster Ready Installation Prerequisites Compiling Cassandra from source and installing Installation from a precompiled binary The installation layout The directory layout in tarball installations The directory layout in package-based installation Configuration files cassandra.yaml Running a Cassandra server Running a Cassandra node Setting up the cluster Viewing the cluster status Summary An Architectural Overview Background Cassandra cluster overview The Gossip protocol Failure detection Data distribution Replication SimpleStrategy NetworkTopologyStrategy Snitches Virtual nodes Adding nodes to our cluster Create keyspace and column family Summary Creating Database and Schema A database and schema Keyspace Column families Static rows Wide rows A primary key Partition keys and clustering columns A composite partition key Multiple clustering columns Static columns Modifying a table Data types Counters Collections Sets Lists Map UDTs Secondary indexes Allowing filtering TTL Conditional querying Conditions on a partition key Conditions on a partition key and clustering columns Sorting query results Write operations Lightweight transactions Batch statements Summary Read and Write – Behind the Scenes Write operations CommitLog Anatomy of Memtable SSTable explained SSTable Compaction strategies Size-tiered compaction Leveled compaction DateTiered compaction Read operations Reads from row cache Read operations for row cache miss Key is in KeyCache Key search miss both the key cache and the row cache Delete operations Data consistency Read operation Digest reads Read repair Consistency levels Write operation Hinted handoff Consistency levels Tracing Cassandra queries Summary Writing Your Cassandra Client Connecting to a Cassandra cluster Driver Connection policies Load balancing policies Retry policies Reconnection policies Reading and writing to the Cassandra cluster QueryBuilder Reading and writing asynchronously Prepared statements Example REST service using prepared statement Batch statements Mapping API Tracing Cassandra queries using Java driver Summary Monitoring and Tuning a Cassandra Cluster Monitoring a Cassandra cluster Use logging for debugging Monitoring using command-line utilities nodetool cfstats nodetool cfhistograms nodetool netstats nodetool tpstats JConsole Third-party tools Tuning Cassandra nodes Configuring Cassandra caches Tuning Bloom filters Configuring and tuning Java Summary Backup and Restore Taking backup of a Casandra cluster Manual backup Deleting snapshots Incremental backup Restoring data to Cassandra The Cassandra bulk loader Exporting and importing data using the Cassandra JSON utility Loading external data into Cassandra Removing nodes from Cassandra cluster Adding nodes to a Cassandra cluster Replacing dead nodes in a cluster Summary Index Apache Cassandra Essentials Apache Cassandra Essentials Copyright © 2015 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information First published: November 2015 Production reference: 1161115 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78398-910-2 www.packtpub.com Credits Author Nitin Padalia Reviewers Ranjeet Kumar Jha Sonal Raj Chaoran Yu Commissioning Editor Akram Hussain Acquisition Editor Meeta Rajani Content Development Editor Aparna Mitra Technical Editor Rohan Uttam Gosavi Copy Editor Pranjali Chury Project Coordinator Mary Alex Proofreader Safis Editing Indexer Mariammal Chettiyar Graphics Disha Haria Production Coordinator Nilesh Mohite Cover Work Nilesh Mohite About the Author Nitin Padalia is the technical leader at Aricent Group, where he is involved in building highly scalable distributed applications in the field of telecommunications From the beginning of his career, he has been working in the field of telecommunications and has worked on protocols such as SMPP, RTP, SIP, and VOIP Since the beginning of his career, he has worked on the development of applications that can scale infinitely with highest performance possible He has experience of developing applications for bare metal hardware, virtualized environments, and cloud-based applications using various languages and technologies I would like to thank all the reviewers of this book; their comments helped me to present data effectively Meeta Rajani, for setting things up and providing input during the initial phase of the book Anish Sukumaran, for helping me through his comments and input till the completion of this book Chaoran Yu, for good suggestions regarding presenting data and examples in a way that could be more helpful from the readers' perspective Ranjit, for his input throughout the book I also would like to thank my family—my mother, father, wife, and kids—for letting me take some time out to write this book Adding nodes to a Cassandra cluster In order to add a node to a Cassandra cluster, you should consider the following configuration options in the Cassandra.yaml file: auto_bootstrap: Set this configuration option to true so that a newly joining node can collect data from other nodes listen_address: Set this to the appropriate IP address of the node endpoint_snitch: Ensure that the new node is using the same snitch as that being used by the other nodes seed_provider: This lists the nodes that are in the seed node list in the existing cluster Since this new node is bootstrapping, it can't be in the seed node list right now cluster_name: Ensure that the cluster name is the same as that of the other nodes in the cluster In the Cassandra-rackdc.properties file, update the correct datacentre and rack information for the new node After ensuring that all configurations are good, start Cassandra on this new node Once the new node is up and running, execute the nodetool cleanup command on all nodes other than this new node to clean up the partition keys that those nodes are no longer handling Replacing dead nodes in a cluster To replace a dead node, we should first remove that node using the nodetool removenode command as described earlier, and then we should add the new node as discussed in the previous section Note If the new node's IP address is different to the previous dead node's IP address, then start Cassandra on the new node with the startup parameter replace_address= To replace a node that is alive and being replaced due to hardware upgrade or another such reason, we should first add the new node and then decommission the old node using the nodetool decommission command, as discussed previously Summary Cassandra is a highly available, fault-tolerant, distributed database However, sometimes data can get corrupted due to client application faults and other reasons To handle such situations, Cassandra provides tools to back up and restore data to the last known state Using the nodetool snapshot command, we can manually take snapshots of the data of a node Restoring the snapshot from a node might not be consistents but restoring from all nodes' data will eventually become consistent An incremental backup configuration allows the taking of automatic snapshots of node data While loading SSTables in the bulk sstableloader is a great utility, using the sstableloader SSTables can be uploaded to different clusters with different ranges and replication factors too The sstable2json command converts a SSTable to JSON format, which then can be converted back to SSTable using the json2sstable command and be loaded into a Cassandra node The CQLSSTableWriter class APIs can be used to create the SSTable from external data, such as a CSV file Cassandra provides command line utilities to add, remove, or replace Cassandra nodes Index B backup about / Taking backup of a Casandra cluster manual backup / Manual backup incremental backup / Incremental backup batch statements about / Batch statements used, for reading/writing Cassandra cluster / Batch statements Bigtable about / Background Bloom filters tuning / Tuning Bloom filters C Cassandra design / Background Cassandra.yaml file, configuration options auto_bootstrap / Adding nodes to a Cassandra cluster listen_address / Adding nodes to a Cassandra cluster endpoint_snitch / Adding nodes to a Cassandra cluster seed_provider / Adding nodes to a Cassandra cluster cluster_name / Adding nodes to a Cassandra cluster Cassandra bulk loader about / The Cassandra bulk loader data importing, Cassandra JSON utility used / Exporting and importing data using the Cassandra JSON utility data exporting, Cassandra JSON utility used / Exporting and importing data using the Cassandra JSON utility external data, loading / Loading external data into Cassandra Cassandra caches configuring / Configuring Cassandra caches Key cache / Configuring Cassandra caches Row cache / Configuring Cassandra caches Cassandra cluster installation / Installation prerequisites / Prerequisites compiling, from source / Compiling Cassandra from source and installing installing / Compiling Cassandra from source and installing installing, from precompiled binary / Installation from a precompiled binary overview / Cassandra cluster overview Gossip protocol / The Gossip protocol failure detection / Failure detection nodes, adding / Adding nodes to our cluster connecting to / Connecting to a Cassandra cluster policies / Driver Connection policies reading/writing / Reading and writing to the Cassandra cluster reading/writing, with QueryBuilder / QueryBuilder reading/writing, asynchronously / Reading and writing asynchronously reading/writing, with prepared statement / Prepared statements, Example REST service using prepared statement reading/writing, with batch statements / Batch statements monitoring / Monitoring a Cassandra cluster logs, used for debugging / Use logging for debugging monitoring, command line utilities used / Monitoring using command-line utilities monitoring, JConsole used / JConsole monitoring, third-party tools used / Third-party tools Cassandra JSON utility used, for importing data / Exporting and importing data using the Cassandra JSON utility used, for exporting data / Exporting and importing data using the Cassandra JSON utility Cassandra logs used, for debugging / Use logging for debugging Cassandra server running / Running a Cassandra server Cassandra node, running / Running a Cassandra node cluster, setting up / Setting up the cluster cluster status, viewing / Viewing the cluster status cfhistograms command about / nodetool cfhistograms cfstats command about / nodetool cfstats using / nodetool cfstats clustering columns about / Partition keys and clustering columns conditional querying / Conditions on a partition key and clustering columns collections about / Collections set / Sets list / Lists map / Map column family about / Column families static rows / Static rows wide rows / Wide rows Column Family creating / Create keyspace and column family command line utilities used, for monitoring Cassandra cluster / Monitoring using command-line utilities nodetool utility / Monitoring using command-line utilities CommitLog about / CommitLog compaction about / SSTable Compaction strategies size-tiered compaction / Size-tiered compaction leveled compaction / Leveled compaction DateTiered compaction / DateTiered compaction composite partition key about / A composite partition key multiple clustering columns / Multiple clustering columns static columns / Static columns table, modifying / Modifying a table conditional queries performing / Conditional querying on partition key / Conditions on a partition key on partition key and clustering columns / Conditions on a partition key and clustering columns query results, sorting / Sorting query results configuration files about / Configuration files cassandra.yaml / cassandra.yaml cluster configurations / cassandra.yaml data partitioning / cassandra.yaml storage configurations / cassandra.yaml client configurations / cassandra.yaml security configurations / cassandra.yaml consistency about / Data consistency for read operation / Read operation for write operation / Write operation coordinator node about / Cassandra cluster overview counters about / Counters limitation / Counters D database about / A database and schema keyspace / A database and schema, Keyspace column family / Column families primary key / A primary key composite partition key / A composite partition key data distribution about / Data distribution Datastax about / Installation Datastax driver about / Load balancing policies RoundRobinPolicy / Load balancing policies DCAwareRoundRobinPolicy / Load balancing policies LatencyAwarePolicy / Load balancing policies TokenAwarePolicy / Load balancing policies WhiteListPolicy / Load balancing policies data type mapping reference link / Loading external data into Cassandra data types about / Data types native types / Data types collection types / Data types tuples types / Data types user-defined types (UDT) / Data types custom types, using Java class / Data types counters / Counters collections / Collections User Defined Types (UDTs) / UDTs DateTiered compaction about / DateTiered compaction delete operations about / Delete operations digest reads about / Digest reads DynamoDB about / Background F failure detection about / Failure detection filters allowing / Allowing filtering G Gossip protocol about / The Gossip protocol H hinted handoff about / Hinted handoff consistency levels, for write operation / Consistency levels I incremental backup about / Incremental backup Installation layout about / The installation layout tarball installations / The directory layout in tarball installations package-based installation / The directory layout in package-based installation J Java tuning / Configuring and tuning Java configuring / Configuring and tuning Java JConsole used, for monitoring Cassandra cluster / JConsole Overview tab / JConsole Memory tab / JConsole Threads tab / JConsole Classes tab / JConsole VM Summary tab / JConsole MBeans tab / JConsole K keyspace creating / Create keyspace and column family about / A database and schema, Keyspace L leveled compaction about / Leveled compaction Lightweight Transaction (LWT) about / Lightweight transactions, Consistency levels list about / Lists load balancing policies about / Load balancing policies M manual backup about / Manual backup snapshots, deleting / Deleting snapshots map about / Map mapping API about / Mapping API example / Mapping API implementing / Mapping API Memtable about / Anatomy of Memtable multiple clustering columns about / Multiple clustering columns N netstats command about / nodetool netstats NetworkTopologyStrategy about / NetworkTopologyStrategy snitches / Snitches nodes adding, to cluster / Adding nodes to our cluster tuning / Tuning Cassandra nodes Cassandra caches, configuring / Configuring Cassandra caches Bloom filters, tuning / Tuning Bloom filters Java, tuning / Configuring and tuning Java Java, configuring / Configuring and tuning Java removing / Removing nodes from Cassandra cluster adding / Adding nodes to a Cassandra cluster replacing / Replacing dead nodes in a cluster nodetool repair about / Keyspace nodetool utility about / Monitoring a Cassandra cluster, Monitoring using command-line utilities cfstats command / nodetool cfstats cfhistograms command / nodetool cfhistograms netstats command / nodetool netstats tpstats command / nodetool tpstats O one day international (ODI) about / Map P package-based installation directory layout / The directory layout in package-based installation partition key about / Column families conditional querying / Conditions on a partition key phi accrual failure detector reference link / Failure detection policies, Cassandra cluster about / Driver Connection policies load balancing policies / Load balancing policies retry policies / Retry policies reconnection policies / Reconnection policies prepared statement used, for reading/writing Cassandra cluster / Prepared statements, Example REST service using prepared statement primary key about / A primary key clustering columns / Partition keys and clustering columns Q QueryBuilder used, for reading/writing Cassandra cluster / QueryBuilder R read operations about / Read operations reads, from row cache / Reads from row cache for row cache miss / Read operations for row cache miss partition key, searching in KeyCache / Key is in KeyCache key search, missing in key cache / Key search miss both the key cache and the row cache key search, missing in row cache / Key search miss both the key cache and the row cache consistency / Read operation digest reads / Digest reads read repair / Read repair read repair about / Read repair consistency level, for read operation / Consistency levels reconnection policies about / Reconnection policies ConstantReconnectionPolicy / Reconnection policies ExponentialReconnectionPolicy / Reconnection policies replication about / Replication restoring about / Restoring data to Cassandra Cassandra bulk loader / The Cassandra bulk loader retry policies about / Retry policies DefaultRetryPolicy / Retry policies DowngradingConsistencyRetryPolicy / Retry policies FallthroughRetryPolicy / Retry policies LoggingRetryPolicy / Retry policies row cache about / Reads from row cache S schema about / A database and schema secondary indexes about / Secondary indexes set about / Sets SimpleStrategy about / SimpleStrategy size-tiered compaction about / Size-tiered compaction snapshot about / Taking backup of a Casandra cluster snapshots deleting / Deleting snapshots snitches about / Snitches SSTable about / SSTable explained static columns about / Static columns static rows about / Static rows T table options / Column families modifying / Modifying a table tarball installations directory layout / The directory layout in tarball installations third-party tools used, for monitoring Cassandra cluster / Third-party tools reference link / Third-party tools time to live (TTL) about / TTL tpstats command about / nodetool tpstats tracing about / Tracing Cassandra queries enabling / Tracing Cassandra queries using Java driver U User Defined Types (UDTs) about / UDTs V virtual nodes about / Virtual nodes configuring / Virtual nodes W wide rows about / Wide rows write operations about / Write operations, Write operations CommitLog / CommitLog Memtable / Anatomy of Memtable SSTable / SSTable explained compaction / SSTable Compaction strategies consistency / Write operation hinted handoff / Hinted handoff ... attributes Apache Cassandra being a highly available, massively scalable, NoSQL, querydriven database helps our applications to achieve these modern day must have attributes Apache Cassandra' s... from Cassandra cluster Adding nodes to a Cassandra cluster Replacing dead nodes in a cluster Summary Index Apache Cassandra Essentials Apache Cassandra Essentials Copyright © 2015 Packt Publishing... book In this book, we'll set up a Cassandra cluster Cassandra server's latest code can be downloaded from http:/ /cassandra .apache. org/download/ We refer to the Cassandra Server version more than