Co m pl im en ts Second Edition Open Source Massively Parallel Data Analytics Marshall Presser REPORT of Data Warehousing with Greenplum SECOND EDITION Data Warehousing with Greenplum Open Source Massively Parallel Data Analytics Marshall Presser Beijing Boston Farnham Sebastopol Tokyo Data Warehousing with Greenplum by Marshall Presser Copyright © 2019 O’Reilly Media, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com) For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or cor‐ porate@oreilly.com Acquisition Editor: Michelle Smith Development Editor: Corbin Collins Production Editor: Deborah Baker Copyeditor: Bob Russell, Octal Publish‐ ing, LLC July 2019: Proofreader: Charles Roumeliotis Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest Second Edition Revision History for the Second Edition 2019-06-07: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Data Warehousing with Greenplum, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights This work is part of a collaboration between O’Reilly and Pivotal See our statement of editorial independence 978-1-492-05810-6 [LSI] Table of Contents Foreword to the Second Edition vii Foreword to the First Edition xi Preface xv Introducing the Greenplum Database Problems with the Traditional Data Warehouse Responses to the Challenge A Brief Greenplum History What Is Massively Parallel Processing? The Greenplum Database Architecture Additional Resources 11 What’s New in Greenplum? 15 What’s New in Greenplum 5? What’s New in Greenplum 6? Additional Resources 15 17 19 Deploying Greenplum 21 Custom(er)-Built Clusters Greenplum Building Blocks Public Cloud Private Cloud Greenplum for Kubernetes Choosing a Greenplum Deployment Additional Resources 21 23 24 25 26 27 27 iii Organizing Data in Greenplum 29 Distributing Data Polymorphic Storage Partitioning Data Orientation Compression Append-Optimized Tables External Tables Indexing Additional Resources 30 33 33 37 37 39 39 40 41 Loading Data 43 INSERT Statements \COPY Command The gpfdist Process The gpload Tool Additional Resources 43 43 44 46 47 Gaining Analytic Insight 49 Data Science on Greenplum with Apache MADlib Apache MADlib Text Analytics Brief Overview of GPText Architecture Additional Resources 49 51 57 58 64 Monitoring and Managing Greenplum 65 Greenplum Command Center Workload Management Greenplum Management Tools Additional Resources 65 70 75 79 Accessing External Data 81 dblink Foreign Data Wrappers Platform Extension Framework Greenplum Stream Server Greenplum-Kafka Integration Greenplum-Informatica Connector GemFire-Greenplum Connector Greenplum-Spark Connector Amazon S3 iv | Table of Contents 81 83 84 86 87 88 89 90 91 External Web Tables Additional Resources 92 93 Optimizing Query Response 95 Fast Query Response Explained GPORCA Recent Accomplishments Additional Resources 95 100 101 Table of Contents | v Foreword to the Second Edition My journey with Pivotal began in 2014 at Morgan Stanley, where I am the global head of database engineering We wanted to address two challenges: • The ever-increasing volume and velocity of data that needed to be acquired, processed, and stored for long periods of time (more than seven years, in some cases) • The need to satisfy the growing ad hoc query requirements of our business users Nearly all the data in this problem space was structured, and our user base and business intelligence tool suite used the universal lan‐ guage of SQL Upon analysis, we realized that we needed a new data store to resolve these issues A team of experienced technology professionals spanning multiple organizational levels evaluated the pain points of our current data store product suite in order to select the next-generation platform The team’s charter was to identify the contenders, define a set of evaluation criteria, and perform an impartial evaluation Some of the key requirements for this new data store were that the product could easily scale, provide dramatic query response time improve‐ ments, be ACID and ANSI compliant, leverage deep data compres‐ sion, and support a software-only implementation model We also needed a vendor that had real-world enterprise experience, under‐ stood the problem space, and could meet our current and future needs We conducted a paper exercise on 12 products followed by two comprehensive proofs-of-concept (PoCs) with our key applica‐ tion stakeholders We tested each product’s utility suite (load, vii unload, backup, restore), its scalability capability along with linear query performance, and the product’s ability to recover seamlessly from server crashes (high availability) without causing an applica‐ tion outage This extensive level of testing allowed us to gain an inti‐ mate knowledge of the products, how to manage them, and even some insight into how their service organizations dealt with soft‐ ware defects We chose Greenplum due to its superior query perfor‐ mance using a columnar architecture, ease of migration and server management, parallel in-database analytics, the product’s vision and roadmap, and strong management commitment and financial back‐ ing Supporting our Greenplum decision was Pivotal’s commitment to our success Our users had strict timelines for their migration to the Greenplum platform During the POC and our initial stress tests, we discovered some areas that required improvement Our deployment schedule was aggressive, and software fixes and updates were needed at a faster cadence than Greenplum’s previous softwarerelease cycle Scott Yara, one of Greenplum’s founders, was actively engaged with our account, and he responded to our needs by assign‐ ing Ivan Novick, Greenplum’s current product manager, to work with us and adapt their processes to meet our need for faster soft‐ ware defect repair and enhancement delivery This demonstrated Pivotal’s strong customer focus and commitment to Morgan Stanley To improve the working relationship even further and align our engineering teams, Pivotal established a Pivotal Tracker (issue tracker, similar to Jira) account, which shortened the feedback loop and improved Morgan Stanley’s communication with the Pivotal engineering teams We had direct access to key engineers and visi‐ bility into their sprints This close engagement allowed us to more with Greenplum at a faster pace Our initial Greenplum projects were highly successful and our plant doubled annually The partnership with Pivotal evolved and Pivotal agreed to support our introduction of Postgres into our environ‐ ment, even though Postgres was not a Pivotal offering at the time As we became customer zero on Pivotal Postgres, we aligned our online transaction processing (OLTP) and big data analytic offerings on a Postgres foundation Eventually, Pivotal would go all in with Postgres by open sourcing Greenplum and offering Pivotal Postgres as a generally available product Making Greenplum the first open source massively parallel processing (MPP) database built on Post‐ viii | Foreword to the Second Edition gres gave customers direct access to the code base and allowed Pivo‐ tal to tap into the extremely vibrant and eager community that wanted to promote Postgres and the open source paradigm This showed Pivotal’s commitment to open source and allowed them to leverage open source code for core Postgres features and direct their focus on key distinguishing features of Greenplum such as an MPP optimizer, replicated tables, workload manager (WLM), range parti‐ tioning, and graphical user interface (GUI) command center Greenplum continues to integrate their product with key open source compute paradigms For example, with the Pivotal’s Platform Extension Framework (PXF), Greenplum can read and write to Hadoop Distributed File System (HDFS) and its various popular formats such as Parquet Greenplum also has read/write connectors to Spark and Kafka In addition, Greenplum has not neglected the cloud, where they have the capability to write to an Amazon Web Services (AWS) Amazon Simple Storage Service (Amazon S3) object store and have hybrid cloud solutions that run on any of the major cloud vendors The cloud management model is appealing to Mor‐ gan Stanley because managing large big data platforms on-premises is challenging The cloud offers near-instant provisioning, flexible and reliable hardware options, near-unlimited scalability, and snap‐ shot backups Pivotal’s strategic direction of leveraging open source Postgres and investing in the cloud aligns with Morgan Stanley’s strategic vision The Morgan Stanley Greenplum plant is in the top five of the Greenplum customer footprints due to the contributions of many teams within Morgan Stanley As our analytic compute require‐ ments grow and evolve, Morgan Stanley will continue to leverage technology to solve complex business problems and drive innova‐ tion — Howard Goldberg Executive Director Global Head of Database Engineering Morgan Stanley Foreword to the Second Edition | ix The gpkafka utility incorporates the gpsscli and gpss commands of the GPSS and makes use of a YAML configuration file A full example is beyond the scope of this book but is well described in the Greenplum-Kafka Integration documentation Greenplum-Informatica Connector New in Greenplum Version Implementing the Greenplum-Informatica Connector requires Greenplum and Informatica’s Power Center Informatica is a widely used commercial product utilized in ETL The Greenplum-Informatica Connector uses Informatica’s Power Connector to facilitate high-speed, parallel transfer of data into Greenplum in both batch and streaming mode after Informatica has collected data from multiple sources and transformed it into a struc‐ ture useful for analytic processing The Greenplum-Informatica Connector, like the Greenplum-Kafka Integration, uses GPSS utilities The architecture is much the same for Informatica as a producer as for a Kafka producer, as shown in Figure 8-2 Figure 8-2 The Greenplum-Informatica Connector 88 | Chapter 8: Accessing External Data GemFire-Greenplum Connector Greenplum is designed to provide analytic insights into large amounts of data, not real-time response Yet, many real-world prob‐ lems require a system that does both Pivotal GemFire is the Pivotalsupported version of Apache Geode, an in-memory data grid (IMDG) Greenplum is not really a real-time response tool Gem‐ Fire provides real-time response, and the GemFire-Greenplum Con‐ nector integrates the two Detecting and stopping fraudulent transactions related to identity theft is a top priority for many banks, credit card companies, insur‐ ers, and tax authorities, as well as digital businesses across a variety of industries Building these systems typically relies on a multistep process, including the difficult steps of moving data in multiple for‐ mats between the analytical systems used to build and run predictive models and transactional systems, in which the incoming transac‐ tions are scored for the likelihood of fraud Analytical systems and transactional systems serve different purposes and, not surprisingly, often store data in different formats fit for purpose This makes sharing data between systems a challenge for data architects and engineers—an unavoidable trade-off, given that trying to use a sin‐ gle system to perform two very different tasks at scale is often a poor design choice GGC is an extension package built on top of GemFire that maps rows in Greenplum tables to plain-old Java objects (POJOs) in Gem‐ Fire regions With the GGC, the contents of Greenplum tables now can be easily loaded into GemFire, and entire GemFire regions like‐ wise can be easily consumed by Greenplum The upshot is that data architects no longer need to spend time hacking together and main‐ taining custom code to connect the two systems GGC functions as a bridge for bidirectionally loading data between Greenplum and GemFire, allowing architects to take advantage of the power of two independently scalable MPP data platforms while greatly simplifying their integration GGC uses Greenplum’s external table mechanisms (described in Chapter 4) to transfer data between all segments in the Greenplum cluster to all of the GemFire servers in parallel, preventing any single-point bottleneck in the process GemFire-Greenplum Connector | 89 Greenplum-Spark Connector Apache Spark is an in-memory cluster computing platform It is pri‐ marily used for its fast computation Although originally built on top of Hadoop MapReduce, it has enlarged the paradigm to include other types of computations including interactive queries and stream processing A full description is available on the Apache Spark website Spark was not designed with its own native data platform and origi‐ nally used HDFS as its data store However, Greenplum provides a connector that allows Spark to load a Spark DataFrame from a Greenplum table Users run computations in Spark and then write a Spark DataFrame into a Greenplum table Even though Spark does provide a SQL interface, it’s not as powerful as that in Greenplum However, Spark does interation on DataFrames quite nicely, some‐ thing that is not a SQL strong point Figure 8-3 shows the flow of the Greenplum-Spark Connector The Spark driver initiates a Java Database Connectivity (JDBC) connec‐ tion to the master in a Greenplum cluster to get metadata about the table to be loaded Table columns are assigned to Spark partitions, and executors on the Spark nodes speak to the Greenplum segments to obtain data Figure 8-3 The Greenplum-Spark Connector 90 | Chapter 8: Accessing External Data The details of installation and configuration are located in the Greenplum-Spark Connector documentation The documentation provides usage examples in both Scala and PySpark, two of the most common languages Spark programmers use Amazon S3 Greenplum has the ability to use readable and writable external tables as files in the Amazon S3 storage tier One useful feature of this is hybrid queries in which some of the data is natively stored in Greenplum, with other data living in S3 For example, S3 could be an archival location for older, less frequently used data Should the need arise to access the older data, you can access it transparently in place without moving it into Greenplum You also can use this pro‐ cess to have some table partitions archived to Amazon S3 storage For tables partitioned by date, this is a very useful feature In the following example, raw data to be loaded into Greenplum is located into an existing Greenplum table The data is in a file, the top lines of which are as follows: dog_id, dog_name, dog_dob 123,Fido,09/09/2010 456,Rover,01/21/2014 789,Bonzo,04/15/2016 The corresponding table would be as follows: CREATE TABLE dogs (dog_id int, dog_name text, dog_dob date) distributed randomly; The external table definition uses the Amazon S3 protocol: CREATE READABLE EXTERNAL TABLE dogs_ext like(dogs) LOCATION ('s3://s3-us-west-2.amazonaws.com/s3test.foo.com/ normal/) FORMAT 'csv' (header) LOG ERRORS SEGMENT REJECT LIMIT 50 rows; Assuming that the Amazon S3 URI is valid and the file is accessible, the following statement will show three rows from the CSV file on Amazon S3 as though it were an internal Greenplum table Of course, performance will not be as good as if the data were stored internally The gphdfs protocol allows for reading and writing data to HDFS For Greenplum used in conjunction with Hadoop as a Amazon S3 | 91 landing area for data, this provides easy access to data ingested into Hadoop and possibly cleaned there with native Hadoop tools External Web Tables Greenplum also encompasses the concept of an external web table, both readable and writable There are two kinds of web tables: those accessed by a pure HTTP call, and those accessed via an OS com‐ mand In general, these web tables will access data that is changing on a regular basis and can be used to ingest data from an external source The United States Geological Survey produces a set of CSV files that describe global earthquake activity You can access this data on a regular basis, as shown in the following example: CREATE TABLE public.wwearthquakes_lastwk ( time TEXT, latitude numeric, longitude numeric, depth numeric, mag numeric, mag_type varchar (10), magSource text ) DISTRIBUTED BY (time); DROP EXTERNAL TABLE IF EXISTS public.ext_wwearthquakes_lastwk; create external web table public.ext_wwearthquakes_lastwk (like wwearthquakes_lastwk) Execute 'wget -qO - http://earthquake.usgs.gov/earthquakes/feed/ v1.0/summary/all_week.csv' defining an OS command to execute ON MASTER Format 'CSV' (HEADER) Segment Reject limit 300; grant select on public.ext_wwearthquakes_lastwk to gpuser; The following example illustrates using an OS command that would run on each segment’s output, assuming the script reads from a pipe into stdin (more details are available in the Pivotal documentation): CREATE EXTERNAL WEB TABLE error_check ( edate date, euser text, etype text, emsg text) EXECUTE 'ssh /usr/local/scripts/error_script.sh' FORMAT 'CSV' ; 92 | Chapter 8: Accessing External Data To create an external table, users must explicitly be granted this privilege in the following way by gpdamin, the Greenplum superuser The following command grants gpuser the ability to create external tables using the HTTP protocol: ALTER ROLE gpuser CREATEEXTTABLE; Additional Resources The gRPC website provides a wealth of documentation as well as tutorials That said, knowledge of gRPC is not required to run GPSS There is more detailed information about the GemFire-Greenplum Connector in the Pivotal GemFire-Greenplum Connector docu‐ mentation GemFire is very different from Greenplum This brief tutorial from Pivotal is a good place to begin learning about it For more on GPSS, visit this Pivotal documentation page Greenplum PXF is described in detail on this Pivotal documentation page For more details on dblinks, see the Pivotal dblink Functions page Additional Resources | 93 CHAPTER Optimizing Query Response When Greenplum forked from PostgreSQL, it inherited the query planner However, because the planner had no understanding of the Greenplum architecture, it became unwieldy to use in an MPP envi‐ ronment Pivotal made a strategic decision to build a query opti‐ mizer tuned for Greenplum Fast Query Response Explained Venkatesh Raghavan When processing large amounts of data in a distributed environ‐ ment, a naive query plan might take orders of magnitude more time than the optimal plan In some cases, the query execution will not complete, even after several hours, as shown in our experimental study.1 The Pivotal Query Optimizer (PQO) is designed to find the optimal way to execute user queries in distributed environments such as Pivotal’s Greenplum Database and HAWQ The open source version of PQO is called GPORCA To generate the fastest plan, GPORCA considers thousands of alternative query execution plans and makes a cost-based decision As with most commercial and scientific database systems, user quer‐ ies are submitted to the database engine via SQL SQL is a declara‐ “New Benchmark Results: Pivotal Query Optimizer Speeds Up Big Data Queries up to 1000x” 95 tive language that is used to define, manage, and query the data stored in relational/stream data management systems Declarative languages describe the desired result, not the logic required to produce it The responsibility for generating an optimal execution plan lies solely with the query optimizer employed in the database management system To understand how query processing works in Greenplum, there is an excellent description in the Gem‐ Fire documentation GPORCA is a top-down query optimizer based on the Cascades optimization framework,2 which is not tightly coupled with the host system This unique feature enables GPORCA to run as a stand‐ alone service outside the database system Therefore, GPORCA sup‐ ports products with different computing architectures (e.g., MPP and Hadoop) using a single optimizer It also takes advantage of the extensive legacy of relational optimization in different query pro‐ cessing paradigms like Hadoop For a given user query, there can be a significantly large number of ways to produce the desired result set, some much more efficient than others While searching for an optimal query execution plan, the optimizer must examine as many plans as possible and follow heuristics and best practices in choosing some plans and ignoring others Statistics capturing the characteristics of the stored data (such as skew), num‐ ber of distinct values, histograms, and percentage of null values, as well as the cost model for the various operations are all crucial ingredients the query optimizer relies on when navigating the space of all viable execution plans For GPORCA to be most effective, it is crucial for DBAs to maintain up-to-date statistics for the queried data Until recently, Greenplum used what is referred to as the legacy query optimizer (LQO) This is a derivative of the original Post‐ greSQL planner that was adapted to the Greenplum code base ini‐ tially The PostgreSQL planner was originally built for single-node PostgreSQL optimized for online transaction processing (OLTP) queries In contrast, an MPP engine is built for long-running online analytical processing (OLAP) queries For this reason, the Post‐ greSQL planner was not built with an MPP database in mind G Graefe, “The Cascades Framework for Query Optimization,” IEEE Data Eng Bull 18(3), 1995 96 | Chapter 9: Optimizing Query Response Although features like join ordering were carefully thought out, the architecture and design choices make maintenance and adding new features increasingly difficult At the end of 2010, Greenplum began an internal effort to produce a modern query optimizer, which made its first appearance in Green‐ plum version 4.3.5 as GPORCA What makes GPORCA particularly useful is its ability to generate more efficient code for some of the complex situations that com‐ monly arise in analytic data warehouses, including the following: • Smarter partition elimination • Subquery unnesting • Common table expressions (CTE) • Multilevel partitioning • Improved join ordering • Join aggregate reordering • Sort order optimization • Skew awareness Previously, the LQO was set as the default, but as of Greenplum 5.0, GPORCA is the default query optimizer (see Figure 9-1) You can change the default at the database level or the session level by setting the GUC parameter optimizer = on When enabling GPORCA, we request that users or DBAs ensure that statistics have been collected on the root partition of a partitioned table This is because, unlike the legacy planner, GPORCA uses the statistics at the root partitions rather than using the statistics of individual leaf partitions Figure 9-1 Query flow when GPORCA is enabled Fast Query Response Explained | 97 Let’s look at an example Following is the schema for the table part from the TPC-H benchmark: CREATE TABLE part ( p_partkey integer NOT NULL, p_name character varying(55) NOT NULL, p_mfgr character(25) NOT NULL, p_brand character(10) NOT NULL, p_type character varying(25) NOT NULL, p_size integer NOT NULL, p_container character(10) NOT NULL, p_retailprice numeric(15,2) NOT NULL, p_comment character varying(23) NOT NULL ) distributed by (p_partkey); Consider the correlated query shown in Figure 9-2, which fetches all parts with size greater than 40 or retail price greater than the average price of all parts that have the same brand Figure 9-2 Correlated subquery on the part table Figure 9-3 presents the query plan produced by GPORCA; the opti‐ mizer status shows the version of GPORCA used to generate the plan Figure 9-3 GPORCA plan for a correlated subquery on the part table In comparison, Figure 9-4 shows an LQO plan that employs a corre‐ lated execution strategy 98 | Chapter 9: Optimizing Query Response Figure 9-4 Legacy query optimizer plan for a correlated subquery on the part table The cost models used by the two optimizers are differ‐ ent For instance, the top node for the GPORCA plan has a cost of 98133504, whereas that of the LQO is 187279528517187 These numbers make sense within a particular optimizer, but they are not comparable between the two different optimizers GPORCA excels on partitioned tables By comparison, the LQO can only eliminate partitions statically For example, if a table is parti‐ tioned by date, a WHERE clause that limits the date range would eliminate any partitions in which the limited date range could not occur However, it cannot handle dynamic conditions in which the WHERE clause has a subquery that determines the range Further‐ more, many large fact tables in a data warehouse might have a sig‐ nificantly large number of partitions The legacy planner could encounter out-of-memory (OOM) errors in cases for which GPORCA would not Modern data analytics and business intelligence (BI) often produce SQL with correlated subqueries, where the inner subquery requires knowledge of the outer query Consider the preceding example that fetches parts with size > 40 or retail price greater than the average price of all parts that have the same brand In the plan shown in Figure 9-4 and generated by the LQO, for the tuple in the outer part table p1, the plan executes a subplan that computes the average part price of all parts having the same brand as the tuple from table part p1 This computed intermediate value is used to determine whether that tuple in p1 will be in the query result or not Because the LQO plan repeatedly executes the subplan for each tuple in the part table p1, the plan is considered a correlated execution Such a correlated plan is suboptimal because it does extraneous work that could be Fast Query Response Explained | 99 avoided In the worst case, if all the parts belong to the same brand, we will be computing the average price one too many times In contrast, GPORCA generates a de-correlated plan in which it first computes the average price for each brand This is done only once The intermediate results then are joined with the parts table to gen‐ erate a list of parts that meets the user’s criteria Un-nesting correlated queries is very important in analytic data warehouses due to the way that BI tools are built They are also common in handwritten SQL code This is evidenced by the fact that 20% and 40% of the workloads in the TPC-DS and TPC-H benchmarks, respectively, have correlated subqueries With these and other optimizations, SQL optimized by GPORCA can achieve increases in speed of a factor of 10 or more There are other queries, albeit a small number, for which GPORCA has not yet produced an improvement in performance As more capabilities are added to GPORCA over time, it will be the rare case for which the LQO provides better performance GPORCA Recent Accomplishments The GPORCA team is constantly improving the optimizer’s capabil‐ ities Here are some recent additions: • Incremental analyze via HyperLogLog • Rapid distinct value aggregation • Improved optimization time, caching, and early space pruning • Large table join, join order optimization using a greedy algo‐ rithm • Improved cost tuning to pick index joins when appropriate • Support for geospatial workloads with GIST indexes • Improved cardinality estimation: left joins and predicates on text columns • Complex nested subqueries: optimizing for colocation (without deadlocks) 100 | Chapter 9: Optimizing Query Response Additional Resources To read more about query optimization, visit the following pages: • “Profiling Query Compilation Time with GPORCA,” an article from the Pivotal Engineering Journal • The “About GPORCA” Pivotal documentation page Additional Resources | 101 About the Author Marshall Presser recently retired from Pivotal where he had been an advisory data engineer as well as field CTO He is based in Wash‐ ington, DC In addition to helping customers solve complex analytic problems with the Greenplum Database, he also worked on develop‐ ment projects Prior to coming to Pivotal (formerly Greenplum), Marshall spent 12 years at Oracle, specializing in high availability, business continuity, clustering, parallel database technology, disaster recovery, and largescale database systems Marshall has also worked for a number of hardware vendors implementing clusters and other parallel architec‐ tures His background includes parallel computation and operating system and compiler development, as well as private consulting for organizations in health care, financial services, and federal and state governments Marshall holds a BA in Mathematics and an MA in Economics and Statistics from the University of Pennsylvania, and a MSc in Com‐ puting from Imperial College, London ... proofs-of-concept (PoCs) with our key applica‐ tion stakeholders We tested each product’s utility suite (load, vii unload, backup, restore), its scalability capability along with linear query performance,... 800-998-9938 or cor‐ porate@oreilly.com Acquisition Editor: Michelle Smith Development Editor: Corbin Collins Production Editor: Deborah Baker Copyeditor: Bob Russell, Octal Publish‐ ing, LLC July... the market with a product based on Post‐ greSQL; it utilized the larger memory in modern servers to good effect with a data flow architecture, and reduced costs by deploying on commodity hardware