1. Trang chủ
  2. » Công Nghệ Thông Tin

IT training scaling data services with pivotal gemfire khotailieu

75 90 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 75
Dung lượng 2,72 MB

Nội dung

of Mike Stolz ts Getting Started with In-Memory Data Grids en im e® pl ® — od m ire Ge Co mF he Ge ac al p ot by A Piv red we Po Scaling Data Services with Pivotal GemFire® In-Memory Data Grid Powered by Apache® Geode™ Fast Scalable Speed access to data from your Continually meet demand by applications, especially for data in elastically scaling your application’s slower, more expensive databases data layer Available Event-Driven Improve resilience to potential Provide real-time notifications to server and network failures with applications through a pub-sub high availability mechanism, when data changes Learn more at pivotal.io/pivotal-gemfire Download open source Apache Geode at geode.apache.org Try GemFire on AWS at aws.amazon.com/marketplace Scaling Data Services with Pivotal GemFire® Getting Started with In-Memory Data Grids Mike Stolz Beijing Boston Farnham Sebastopol Tokyo Scaling Data Services with Pivotal GemFiređ by Mike Stolz Copyright â 2018 OReilly Media, Inc., All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editors: Susan Conant and Jeff Bleiel Production Editor: Justin Billing Copyeditor: Octal Publishing, Inc Proofreader: Charles Roumeliotis December 2017: Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 2017-11-27: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Scaling Data Serv‐ ices with Pivotal GemFire®, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights 978-1-492-02755-3 [LSI] Table of Contents Foreword vii Preface ix Acknowledgments xi Introduction to Pivotal GemFire In-Memory Data Grid and Apache Geode Memory Is the New Disk What Is Pivotal GemFire? What Is Apache Geode? What Problems Are Solved by an IMDG? Real GemFire Use Cases IMDG Architectural Issues and How GemFire Addresses Them 1 3 Cluster Design and Distributed Concepts The Distributed System Cache Regions Locator CacheServer Dealing with Failures: The CAP Theorem Availability Zones/Redundancy Zones Cluster Sizing Virtual Machines and Cloud Instance Types Two More Considerations about JVM Size 8 9 11 11 12 13 iii Quickstart Example 15 Operating System Prerequisites Installing GemFire Starting the Cluster GemFire Shell Something Fun: Time to One Million Puts 15 16 17 17 18 Spring Data GemFire 23 What Is Spring Data? Getting Started Spring Data GemFire Features 23 24 25 Designing Data Objects in GemFire 29 The Importance of Keys Partitioned Regions Colocation Replicated Regions Designing Optimal Data Types Portable Data eXchange Format Handling Dates in a Language-Neutral Fashion Start Slow: Optimize When and Where Necessary 29 30 31 31 32 33 34 35 Multisite Topologies Using the WAN Gateway 37 Example Use Cases for Multisite Design Patterns for Dealing with Eventual Consistency 37 38 Querying, Events, and Searching 43 Object Query Language OQL Indexing Continuous Queries Listeners, Loaders, and Writers Lucene Search 43 44 45 46 47 Authentication and Role-Based Access Control 49 Authentication and Authorization SSL/TLS 49 52 Pivotal GemFire Extensions 53 GemFire-Greenplum Connector Supporting a Fraud Detection Process Pivotal Cloud Cache iv | Table of Contents 53 54 54 10 More Than Just a Cache 57 Session State Cache Compute Grid GemFire as System-of-Record 57 57 58 Table of Contents | v Foreword In Super Mario Bros., a popular Nintendo video game from the 1980s, you can run faster and jump higher after catching a hidden star With modern software systems, development teams are finding new kinds of star power: cloud servers, streaming data, and reactive architectures are just a few examples Could GemFire be the powerful star for your mission-critical, realtime, data-centric apps? Absolutely, yes! This book reveals how to upgrade your performance game without the head-bumping head‐ aches More cloud, cloud, cloud, and more data, data, data Sound familiar? Modern applications change how we combine cloud infrastructure with multiple data sources We’re heading toward real-time, datarich, and event-driven architectures For these apps, GemFire fills an important place between relational and single-node key–value data‐ bases Its mature production history is attractive to organizations that need mature production solutions At Southwest Airlines, GemFire integrates schedule information from more than a dozen systems, such as passenger, airport, crew, flight, gate, cargo, and maintenance systems As these messages flow into GemFire, we update real-time web UIs (at more than 100 loca‐ tions) and empower an innovative set of decision optimization tools Every day, our ability to make better flight schedule decisions bene‐ fits more than 500,000 Southwest Airlines customers With our event-driven software patterns, data integration concepts, and dis‐ tributed systems foundation (no eggs in a single basket), we’re well positioned for many years of growth vii Is GemFire the best fit for all types of application problems? Nope If your use case doesn’t have real-time, high-performance require‐ ments, or a reasonably constrained data window, there are probably better choices One size does not fit all Just like trying to store everything in an enterprise data warehouse isn’t the best idea, the same applies for GemFire, too Here’s an important safety tip GemFire by itself is lonely It needs the right software patterns around it Without changing how you write your software, GemFire is far less powerful and probably even painful Well-meaning development teams might gravitate back toward their familiar relational worldview If you see teams attempt‐ ing to join regions just like a relational database, remind them to watch the Wizard of Oz With GemFire, you aren’t in Kansas any‐ more! From my experience, when teams say, “GemFire hurts,” it’s usually related to an application software issue It’s easy to miss a nonindexed query in development, but at production scale it’s a dif‐ ferent story Event-driven or reactive software patterns are a perfect fit with GemFire To learn more, the Spring Framework website is an excel‐ lent resource It contains helpful documentation about noSQL data, cloud-native, reactive, and streaming technologies It’s an exciting time for the Apache Geode community I’ve enjoyed meeting new “friends-of-data” both within and outside of South‐ west I hope you’ll build your Geode and distributed software friend network Learning new skills is a two-way street It won’t be long before you’re helping others solve new kinds of challenging prob‐ lems When you combine GemFire with the right software patterns, right problems to solve, and an empowered software team, it’s fun to deliver innovative results! — Brian Dunlap Solution Architect, Operational Data Southwest Airlines viii | Foreword Lucene Search The embedded Lucene search feature allows users to perform vari‐ ous kinds of searches for region data using Apache Lucene Apache Lucene is an open source search engine It supports performing sim‐ ple, wildcard, and fuzzy searches Searching for region entries by name (for which people will not typically know the exact first name or last name) is a good use case for using Lucene Lucene searches can match text fields in region entries based on sin‐ gle character wildcard using “?” The query N??A will match the words “NYLA” as well as “NINA.” Use "*" to perform multiplecharacter wildcard-based searches The query *EEN will match any word that ends with “EEN” Lucene supports fuzzy searches using “~” The fuzzy query Josiah~ will match words with similar spelling Lucene also supports range searches and Boolean operations See the Pivotal GemFire product documentation and the Apache Lucene documentation for a full range of its capabilities Apache Lucene indexes are the core component to facilitate searches Lucene examines all of the words for region data based on the index definition The index structure will store words along with pointers to where the words can be found The indexes will be upda‐ ted asynchronously whenever region entries are saved Indexing asynchronously is used to optimize region put operations The embedded Lucene search feature provides search support through a Java API or the gfsh command-line utility: gfsh>search lucene region=/Users keys-only=true name=userIndex queryStrings="+firstName:Greg +lastName:Green" defaultField="lastName" This gfsh Lucene search query will search for region records based on the firstName and lastName fields on the userIndex This search looks for region entries where the firstName field is Gregory and lastName is Green The + means that both firstName and last Name field matches are required Lucene Search | 47 CHAPTER Authentication and Role-Based Access Control Swapnil Bawaskar In this chapter, we look at using role-based access control (RBAC) in GemFire and securing the communication between various compo‐ nents by using Secure Sockets Layer (SSL) Authentication and Authorization Before we dig into how authentication and authorization works, let’s try to look at the operations that you can perform in GemFire that would need to be authenticated and authorized Background In GemFire you can start/stop locators and servers and you can alter their runtime to change log-level as well as other administrative actions You are creating regions to store your data, defining indexes, and defining disk stores to persist your data; you then actually insert data, and access and query it We can broadly classify these actions into two categories based on the type of resource being worked on Starting servers, altering run‐ time, and defining disk stores are operations that involve working on your CLUSTER, whereas put(), get() and queries work on DATA as the resource The security framework classifies all operations in these two major categories 49 Within each resource classification, we can further classify all com‐ mands as either accessing the resource (READ), writing to the resource (WRITE), or making changes to the resource (MANAGE) For example, list members just accesses the CLUSTER resource, whereas stop server manages it Table 8-1 shows classifications of some of the commands You can find a comprehensive list in the product documentation Table 8-1 Classifications of operations Read Cluster Write Manage show metrics export logs list members change log level alter runtime start server shutdown Data query region.put region.get/getAll region.replace create region destroy region This Resource:Operation tuple forms the basic unit for authoriza‐ tion Implementation GemFire’s security framework is pluggable, enabling you to integrate with your existing infrastructure The only interface you need to implement is SecurityManager from the org.apache.geode.secu rity package An implementation of SecurityManager only needs to implement the following method: public Object authenticate(Properties) This makes authorization optional When this method is invoked, the Properties object that is passed in will have two entries, security-username and security-password, that you can use to authenticate the user with your existing infrastructure The authorize method, if you choose to implement it, has two parameters that are passed in the principal (the object that you returned from the authenticate method) and the ResourcePermis sion tuple for the requested operation: boolean authorize(Object principal, ResourcePermission permission) 50 | Chapter 8: Authentication and Role-Based Access Control After you implement them, you need to tell GemFire about your SecurityManager by adding the following line to the gemfire.proper‐ ties file: security-manager=com.mycompany.MySecManager Your client application will have to supply its credentials to the server in order to perform any operation The way to that is to implement the AuthInitialize interface in the org.apache.geode.security package and then make sure that the getCredentials() method returns properties with at least two entries: security-username and security-password Again, your client needs to tell the system about your AuthInitial ize implementation by using the security-client-auth-init gemfire property Fine-Grained Authorization In order to provide fine-grained control over the CLUSTER resource, it has been broken down further into DISK, GATEWAY, QUERY, LUCENE, and DEPLOY, allowing you to give only some users the ability to cre‐ ate WAN gateways, for example All of these subresources also get finer control over READ/WRITE/MANAGE operations Examples of per‐ missions for some commands are as follows: CLUSTER:MANAGE:DISK: create disk-store, alter disk-store CLUSTER:WRITE:DISK: write to the disk store CLUSTER:MANAGE:QUERY: create index, destroy index CLUSTER:READ:QUERY: list indexes For a full list, look at the product documentation In our discussion earlier, we talked only about operations working on one resource; however, it is possible to have operations working on multiple resources Consider, for example, the creation of a per‐ sistent region This operation acts on two resources, DATA and DISK, therefore you will need DATA:MANAGE as well as CLUSTER:WRITE:DISK permissions in order to create a persistent region Authentication and Authorization | 51 SSL/TLS To encrypt communication over the wire, let’s look at the various channels of communication in the system: Locator-to-locator Locator-to-server Server-to-server Client-to-locator Client-to-server Between WAN gateways REST API and Pulse JMX communication In many cases, you would want to secure only a subset of these com‐ munication channels For example, your cluster might be protected by a firewall, so securing server-to-server communication might not be required, but your clients can be connecting to the servers from outside, in which case you would want to secure the client-to-server communication You can pick and choose which communication channels should be secured You can specify the components that should be secured using the ssl-enabled-components GemFire property Following are the values that this property accepts: locator: for 1, 2, and in the previous list server: for cluster: for gateway: for web: for jmx: for all: for securing everything 52 | Chapter 8: Authentication and Role-Based Access Control CHAPTER Pivotal GemFire Extensions John Knapp and Jagdish Mirani GemFire-Greenplum Connector Even though GemFire is built for rapid response time, Pivotal also produces an analytic relational database known as Greenplum Greenplum was designed to provide analytic insights into large amounts of data It was not designed for real-time response Yet, many real-world problems require a system that does both At Pivo‐ tal, we use GemFire for real-time requirements and the GemFireGreenplum Connector to integrate the two The GemFire-Greenplum connector is a bidirectional, parallel data transfer mechanism It is based on the Pivotal Greenplum Database’s ability to view data residing outside the normal Greenplum storage (external tables) and Greenplum’s parallel data transfer channel (gpfdist) Here’s a brief description of how this works The GemFire-Greenplum Connector (GGC) is an extension pack‐ age built on top of GemFire that maps rows in Greenplum tables to plain-old Java objects (POJOs) in GemFire regions With the GGC, the contents of Greenplum tables now can be easily loaded into GemFire, and entire GemFire regions likewise can be easily con‐ sumed by Greenplum The upshot is that data architects no longer need to spend time hacking together and maintaining custom code to connect the two systems GGC functions as a bridge for bidirectionally loading data between Greenplum and GemFire, allowing architects to take advantage of 53 the power of two independently scalable massively parallel process‐ ing (MPP) data platforms while greatly simplifying their integration GGC uses Greenplum’s external table mechanisms to transfer data between all segments in the Greenplum cluster to all of the GemFire servers in parallel, preventing any single-point bottleneck in the process Supporting a Fraud Detection Process Greenplum’s horizontal scalability and rich analytics library (MADlib, PL/R, etc.) help teams quickly iterate on anomaly detec‐ tion models against massive datasets Using those models to catch fraud in real time, however, requires using them in an application Depending on the velocity of data ingested through that application, a “fast data” solution might be required to classify the transaction as fraudulent in a timely manner This activity involves a small dataset and real-time response By connecting Greenplum and GemFire together, we can provide this fast-data solution hosted inside Gem‐ Fire and have it be informed by its connection to the deep analytics performed in Greenplum Pivotal Cloud Cache Pivotal Cloud Cache (PCC) is a caching service for Pivotal Cloud Foundry (PCF) powered by Pivotal GemFire The PCF platform fos‐ ters the adoption of modern approaches to building and deploying software Caching is finding new relevance in modern, cloud-native, distributed application architectures As applications are split up into smaller separately deployed components, often on different sys‐ tems in different locations, network latencies can severely limit application performance Caching data locally can reduce network hops, reducing the impact from network latencies The sheer number of components in modern architectures introdu‐ ces many points of failure Adding highly available caches at critical points in the topology can dramatically increase the overall availa‐ bility of the system Many of the core caching features in PCC are provided by technol‐ ogy that is already battle tested in GemFire Many of the PCC fea‐ tures that can be traced back to GemFire are covered in this book, including high availability across servers and availability zones, hor‐ 54 | Chapter 9: Pivotal GemFire Extensions izontal scalability and data partitioning and colocation, continuous query, subscribing to events by registering interest (pub/sub), rolebased security, and more If you’re interested in caching, and you’re already running PCF, you’re likely to be interested in PCC PCC’s unique purpose as a ser‐ vice on PCF is the result of significant investment in the platform integration of PCC Services on PCF are designed to be easy for operators to set up and configure, and easy for developers to install and integrate with their apps This is partly accomplished by deliver‐ ing services that are opinionated, and preconfigured service plans by use case By doing this we have eliminated a lot of setup and config‐ uration steps This use-case oriented approach is reflected in how PCC delivers the look-aside caching pattern and HTTP Session State Caching Other patterns like in-line caching and multisite will be equally opinionated and therefore easy to configure PCC Service Instances On-Demand A key goal of PCC, or any PCF service, is to make it easy for devel‐ opers to self-serve on-demand service instances However, uncon‐ trolled utilization of resources can lead to escalating IT infrastructure costs What’s needed is a managed provisioning envi‐ ronment that provides developers with easy self-service access to caching as a resource, but also allows operators to maintain control through operator-defined plans, upgrade rules, and quotas Opera‐ tors can customize service plan definitions to support internal char‐ geback packages and enforce resource constraints On-Demand Service Broker PCC’s on-demand provisioning is built on the On-Demand Service Broker API, an abstraction that makes it easier to integrate and use services on PCF The API provides support for provisioning, catalog management, binding, and updating instances Services like PCC are made available via the PCF Marketplace, which provides developers with a catalog of add-on services to enhance, secure, and manage applications On-demand services enable the flexibility to create instances in a scalable and cost-effective way When an operator deploys the ser‐ vice, they not preallocate virtual machine (VM) resources for ser‐ vice instances Instead, they define an allowable range of VM Pivotal Cloud Cache | 55 memory and CPU sizes, set quotas on the number of service instan‐ ces that can be started, and create a dedicated network on the Infra‐ structure as a Service to host any required number of service instance VMs When a developer requests a service instance, it is provisioned ondemand by BOSH, a tool chain for release engineering, deployment, and lifecycle management of services BOSH is a vital part of the PCF platform The developer selects the service plan, and BOSH dynamically creates new dedicated VMs for the instance configured as per the service plan Cloud Agnostic After a BOSH release is created, it’s compatible with multiple clouds BOSH abstracts the specifics of each cloud provider with a Cloud Provider Interface (CPI) abstraction BOSH users now can realize the benefits of each provider, without in-depth knowledge of each High Availability The BOSH layer monitors service nodes, removes unresponsive instances, and restores capacity by spinning up new instances or adds capacity on demand PCC supports the notion of multiple availability zones mapped to GemFire redundancy zones, which allows smooth recovery from server failures Integration with Logging and Monitoring Services For gaining visibility into the details of the PCC service operation, PCC streams its logs to PCF’s Logregator Firehose Standard moni‐ toring and logging tools that integrate to the Firehose via a nozzle can be used for displaying this operational data in dashboards 56 | Chapter 9: Pivotal GemFire Extensions CHAPTER 10 More Than Just a Cache Mike Stolz Session State Cache Web and mobile apps maintain the illusion of a user session even though they are connected over a sessionless communication proto‐ col This is achieved by caching the session state in a data layer sepa‐ rate from the app servers so that the load balancer is free to move load to any app server; it will still have access to the session state at all times Compute Grid There are many use cases for which moving the compute to the data instead of moving the data to the compute can cause tremendous performance improvements An example of such a use case is a financial risk-management system In one benchmark of a positionkeeping system calculating positions on 20 books of 10,000 Euro‐ pean options, starting with five million trades per book, with 20,000 market data updates per second and 2,000 new trades per second, the mean time to “price the book” went down from 2.67 seconds using a separate data grid and compute grid to 0.035 seconds when the compute was done in situ with the data That is an improvement in performance of 76 times, with half as much hardware just by moving the compute to the data The key to this kind of perfor‐ mance gain is the use of GemFire’s server-side data-aware function execution service There are two ways that this service works In both cases you program your custom function to act only on data 57 that is local to the member that it is running on, and you deploy it to the servers in the cluster Then, you invoke it from a server or client In the “data-aware” function mechanism, when you invoke the func‐ tion, you are invoking it on all of the members that host a specified Region, and you pass in a filter, which is a list of keys that are used to cause the function to run only on the members that are primaries for those keys There is also a notion of executing a function on specific members of the cluster, or a group of members who are part of a member group, or all members in the cluster In any of these cases, each of those members should be programmed to operate only on the data that is local to the member on which the function is running GemFire as System-of-Record GemFire offers robust features to enable usage as the system-ofrecord for applications A system-of-record is the authoritative source of truth for a given dataset The implications of system-ofrecord are a set of features required to make the data safe These fea‐ tures are high availability, disk durability, business continuity, snapshots, backup, incremental backup, and restore All of those features are present in GemFire Let’s take each of the features one-by-one and examine the implementation details High Availability GemFire high availability is achieved via synchronous replication between the primary and backup members For Partitioned Regions where the data is sharded across the members of the cluster, there is a notion of a primary for every object, and some number of back‐ ups Every member is primary for some of the objects, and every member is backup for other objects Because the replication between primaries and backups is synchronous and is completed before returning to the caller of put(key, value), there is no chance of silent data loss in the event of primary failure 58 | Chapter 10: More Than Just a Cache Disk Durability In keeping with the shared-nothing architecture of GemFire, both primaries and backups are saved to separate locally attached storage on separate hosts in separate redundancy zones, which are usually mapped to separate Infrastructure as a Service availability zones so there is no chance of a single point of failure causing data loss For system-of-record use cases, it is recommended that you configure three redundancy zones and three copies of the mission-critical data Business Continuity In addition to the availability zones, you also can configure GemFire to provide further business continuity in the event of a major disas‐ ter This disaster recovery functionality is provided by the GemFire WAN Gateway, which is designed for replication of data between distant sites Although it is important to note that it is asynchro‐ nous, when used in conjunction with multiple availability zones the likelihood of data loss due to the buffering in the WAN Gateway is still quite low It is also important to note that the WAN Gateway is able to be bidirectional This is a critical feature for use in both active/active use cases and active/passive use cases The active/active need for bidirectionality is obvious Both sites are active and backing each other up The active/passive use case also benefits from the bidirec‐ tionality of the WAN Gateway When the primary site fails, and everything cuts over to the backup site, you want to queue up changes made on the backup until the primary comes back to life This way when the primary comes back, it automatically gets brought back up to date on anything it missed while it was away So GemFire’s WAN Gateway makes fail-back just as easy as failover Anybody can failover, but fail-back is difficult without a feature like this Snapshots GemFire has the ability to take a point-in-time snapshot of data in any Region, and the ability to import that data into the same or another GemFire system When gfsh takes a snapshot, it also takes a copy of the PDX registry so that you will be able to apply the snap‐ shot correctly in another system In addition, there is a public API GemFire as System-of-Record | 59 that allows you to read the contents of the snapshot file, and with it anything you may want, including modify the records before inserting them into another system Backup, Incremental Backup, and Restore GemFire has built-in full backup, incremental backup, and restore mechanisms For each member with persistent data, a full backup includes a full image of all of the saved state on all members in the cluster An incremental backup saves the difference between the last backup and the current data An incremental backup copies only op-logs that are not already present in the baseline directories for each member All these enterprise features make Apache Geode and its commercial counterpart GemFire suitable for even the most mission-critical use cases in many of the world’s largest enterprises 60 | Chapter 10: More Than Just a Cache About the Authors Mike Stolz is the product lead for Pivotal GemFire and is based on Long Island, New York As product lead, he defines a roadmap that steers the product toward value, engages with a range of stakehold‐ ers both inside and outside the company to both listen and inform, shapes expectations of what the product is and is not good for, and engages deeply with the product managers internally to ensure Pivo‐ tal is delivering on the vision He was the first user to deploy GemFire in production back in 2003 when he was Director and Chief Architect for Fixed Income, Cur‐ rencies, Commodities, Liquidity, and Risk Technology at Merrill Lynch In fact, as of this writing, the global market data system that he and his team deployed way back then is still in continuous opera‐ tion and is still running on the original version of GemFire that it was initially deployed on It is the longest running GemFire based app in the world Mike took an early retirement package from Merrill Lynch and joined GemStone Systems, the creators of GemFire, as VP of Archi‐ tecture in 2007 and has served in many roles on the GemFire team ever since ... when data changes Learn more at pivotal. io /pivotal- gemfire Download open source Apache Geode at geode.apache.org Try GemFire on AWS at aws.amazon.com/marketplace Scaling Data Services with Pivotal. .. Data Services with Pivotal GemFire Getting Started with In-Memory Data Grids Mike Stolz Beijing Boston Farnham Sebastopol Tokyo Scaling Data Services with Pivotal GemFire by Mike Stolz Copyright... database It provides high availability for data stored in it with synchronous replication of data across members, failover, selfhealing, and automated rebalancing It can provide durability of its

Ngày đăng: 12/11/2019, 22:29

TỪ KHÓA LIÊN QUAN

w