"Data virtualization had been held back by complexity for decades until recent advances in cloud technology, data lakes, networking hardware, and machine learning transformed the dream into reality. It''''s becoming increasingly practical to access data through an interface that hides low-level details about where it''''s stored, how it''''s organized, and which systems are needed to manipulate or process it. You can combine and query data from anywhere and leave the complex details behind. In this practical book, authors Dr. Daniel Abadi and Andrew Mott discuss in detail what data virtualization is and the trends in technology that are making data virtualization increasingly useful. With this book, data engineers, data architects, and data scientists will explore the architecture of modern data virtualization systems and learn how these systems differ from one another at technical and practical levels."
Trang 2Chapter 1 Introduction to Data Virtualization and Data Lakes
As long as humankind has existed, knowledge has been spread out across the world In ancienttimes, one had to go to Delphi to see the oracle, to Egypt for the construction knowledgenecessary to build the pyramids, and to Babylon for the irrigation science to build the HangingGardens If someone had a simple question, such as, “Were the engineering practices used inbuilding the pyramids and Hanging Gardens similar?” no answer could be obtained in less than adecade The person would have to spend years learning the languages used in Babylon and Egypt
in order to converse with the locals who knew the history of these structures; learn enough aboutthe domains to express their questions in such a way that the answers would reveal whether theengineering practices were similar; and then travel to these locations and figure out who were thecorrect locals to ask
In modern times, data is no less spread out than it used to be Important knowledge—capable ofanswering many of the world’s most important questions—remains dispersed across the entireworld Even within a single organization, data is typically spread across many different locations.Data generally starts off being located where it was generated, and a large amount of energy isneeded to overcome the inertia to move it to a new location The more locations data is generated
in, the more locations data is found in
Although it no longer takes a decade to answer basic questions, it still takes a surprisingly longtime This is true even though we are now able to transfer data across the world at the speed oflight: an enormous amount of information can be accessed from halfway across the earth in just afew seconds Furthermore, computers can now consume and process billions of bytes of data inless than a second Answering almost any question should be nearly instantaneous, yet inpractice it is not Why does it take so long to answer basic questions—even today?
The answer to this question is that many of the obstacles to answering a question about thepyramids and Hanging Gardens in ancient times still exist today Language was a barrier then It
is still a barrier today Learning enough about the semantics of the domain was a barrier then It
is still a barrier today Figuring out who to ask (i.e., where to find the needed data) was a barrierthen It is still a barrier today So while travel time is billions of times faster now than it wasthen, and processing time is billions of times faster now than it was then, this only benefits us tothe extent that these parts of analyzing data are no longer the bottleneck Rather, it is these otherbarriers that prevent us from efficiently getting answers to questions we have about datasets
Language is a barrier beyond the fact that one dataset may be in English, another in Chinese, andanother in Greek Even if they are all in English, the computing system that stores the data mayrequire questions to be posed in different languages in order to extract or answer questions aboutthese datasets One system may have an SQL interface, another GraphQL, and a third systemmay support only text search The client who wishes to pose a question to these differing systemsneeds to learn the language that the system supports as its interface Further, the client needs tounderstand enough about the semantics of how data is stored in a source system to pose a
Trang 3question coherently Is data stored in tables, a graph, or flat files? If tables, what do the rows andcolumns correspond to? What are the types of each column (integers, strings, text)? Does acolumn for a particular table refer to the same real-world entity as a column from a differenttable? Furthermore, where can I even find a dataset related to my question? I know there is auseful dataset in Spain Is there also one in Belgium? Egypt? Japan? How do I discover what isout there and what I have access to? And how do I request permission to access something I donot currently have access to?
The goal of data virtualization (DV) is to eliminate or alleviate these other barriers A DV
System creates a central interface in which data can be accessed no matter where it is located, nomatter how it is stored, and no matter how it is organized The system does not physically movethe data to a central location Rather, the data exists there virtually A user of the system is
given the impression that all data is in one place, even though in reality it may be spread acrossthe world Furthermore, the user is presented with information about what datasets exist, howthey are organized, and enough of the semantic details of each dataset to be able to formulatequeries over them The user can then issue commands that access any dataset virtualized by thesystem without needing to know any of the physical details regarding where data is located,which systems are being used to store it, and how the data is compressed or organized in storage
The convenience of a proper functioning DV System is obvious We will see in Chapter 2 thatthere are many challenges in building such systems, and indeed many DV Systems often fallsignificantly short of the promise we have described Nonetheless, when they work as intended(as we illustrate in Chapter 6 ), they are extremely powerful and can dramatically broaden thescope of a data analysis task while also accelerating the analysis process in a variety of ways,including by reducing the human effort involved in bringing together datasets for analysis
A Quick Overview of Data Virtualization System Architecture
Figure 1-1 shows a high-level architecture of a DV System, the core software that implementsdata virtualization, which we’ll discuss in detail in Chapter 3 The system itself is fairlylightweight—it does not store any data locally except for a cache of recently accessed data andquery results (see Chapter 4 ) Instead, it is configured to access a set of underlying data sourcesthat are virtualized by the system such that a client is given the impression that these data sourcesare local to the DV System, even though in reality they are separate systems that are potentiallygeographically dispersed and distant from the DV System In Figure 1-1 , three underlying datasources are shown, two of which are separate data management systems that may contain theirown interfaces and query access capabilities for datasets stored within The third is a data lake,which may store datasets as files in a distributed filesystem
Trang 4Figure 1-1 A high-level DV System architecture
The existence of these underlying data source systems is registered in a catalog that is viewable
by clients of the DV System Furthermore, the catalog usually contains information regarding thesemantics of the datasets stored in the underlying data systems—what exactly is the data stored
in these datasets, how this data was collected, what real-world entities are referred to in thedataset, and how the datasets are organized—in tables, graphs, or hierarchical structures.Furthermore, the schema of these structures is defined All of this information allows clients to
be aware of what datasets exist, what important metadata these datasets have (includingstatistical information about the data contained within the datasets), and how to express coherentrequests to access them
The most complex part of a DV System is the data virtualization engine (DV Engine), whichreceives requests from clients (generated using the client interface) and performs whateverprocessing is required for these requests This typically involves communication with thespecific underlying data sources that contain data relevant to those requests The DV Engine thus
Trang 5needs to know how to communicate with a variety of different types of systems that may storedata that is being virtualized by the system Furthermore, it may need to forward parts of clientrequests to these underlying source systems Therefore, the engine needs to know how toproperly express these requests such that the underlying data source system can perform theserequests in a high-performing way and return the results in a manner that is consumable in ascalable fashion by the DV System The DV Engine may also need to combine results receivedfrom multiple underlying data source systems involved in a client request.
In general, the goal of data virtualization is to allow clients to express requests over datasetswithout having to worry about the details of how the underlying data source systems store thesource data Yet most underlying data sources have unique interfaces that require expertise inthat particular system before data can be accessed Therefore, the DV Engine typically requiressome translation on the fly from a global interface that is used by the client to access anyunderlying system into the particular interface used by specific underlying data sources We willdiscuss in Chapter 2 how this process is complex and error prone, and cover approaches thatmodern systems use to reduce this complexity
Figure 1-1 included a data lake as one of the underlying data sources A data lake is a cluster
of servers that combine to form a distributed filesystem inside which raw datasets (often verylarge in size) can be stored scalably at low cost In the cloud era, they have become popularalternatives to storing data in expensive data management systems, which implement manyfeatures that are unnecessary for use cases that primarily require simply storing and accessingraw data Data lakes often contain datasets in the early stages of data pipelines, before gettingfinalized for specific applications These raw or early-stage datasets may contain significantvalue for data analysis and data science tasks, and a major use case for DV Systems has emerged
in which data in a data lake is virtualized by the DV System, and thus becomes accessible via theclient interface that the DV System provides Furthermore, we will see in future chapters thataccessing raw data in a data lake is often much simpler and less error-prone for a DV Systemrelative to accessing data in more advanced data management systems Therefore, many of themost successful deployments of data virtualization systems at the time of writing contain at leastone data lake as an underlying data source system As a result, this book contains a special focus
on data lake use cases for data virtualization in addition to discussing data virtualization moregenerally In the next section, we give a more detailed introduction to data lakes
Data Lakes
Before the emergence of data virtualization, the only way to access a variety of datasets for aparticular analysis task was to physically bring them all together prior to this analysis Fordecades, data warehouses were the primary solution to bringing data from across an organizationinto a central, unified location for subsequent analysis Data was extracted from source datasystems via an extract, transform, and load (ETL) process; integrated and stored inside
dedicated data warehouse software such as Oracle Exadata, Teradata, Vertica, or Netezzaproducts; and made available for data scientists and data analysts to perform detailed analyses.These data warehouses stored data in highly optimized storage formats in order to analyze it withhigh performance and throughput, so that analysts could experience fast responses even whenanalyzing very large datasets However, due to the large amount of complexity in the software,
Trang 6these data warehouse solutions were expensive and charged by the size of data stored; therefore,they were often prohibitively expensive for storing truly large datasets, especially when it was
unclear in advance if these datasets would provide significant value to analysis tasks.Furthermore, data warehouses typically require up-front data cleaning and schema declaration,which may involve nontrivial human effort—which gets wasted if the dataset ends up not beingused for analysis
Data lakes therefore emerged as much cheaper alternatives to storing large amounts of data indata warehouses Typically built via relatively simple free and open source software, a home-built data lake’s only cost of storing data was the cost of the hardware for the cluster of serversthat were running the data lake software, and the labor cost of the employees overseeing thisdeployment Furthermore, data could be dumped into data lakes without up-front cleaning,integration, semantic modeling, and schema generation, thereby making data lakes an attractiveplace to store datasets whose value for analysis tasks has yet to be determined Data lakesallowed for a “store first, organize later” approach
Starting in approximately 2008, a system called Hadoop emerged as one of the first popularimplementations of a data lake Based on several important pieces of infrastructure used byGoogle to handle many of its big data applications (including its GFS distributed filesystem andits MapReduce data processing engine), Hadoop became almost synonymous with data lakes inthe decade after its emergence
Data lakes are known for several important characteristics that initially existed in the Hadoopecosystem and have become core requirements of any modern data lake:
Horizontal scalability
Support for structured, semi-structured, and unstructured data
Open file formats
Support for schema on read
We discuss each of these characteristics in the following sections
Horizontal Scalability
Hadoop incorporated an open source distributed filesystem called HDFS (Hadoop distributedfilesystem) that was based on the architecture of the Google filesystem (GFS), which itself wasbased on decades of research into scalable filesystems The most prominent feature of HDFS ishow well it scales to petabytes of data Although metadata is stored on a single leader server, themain filesystem data is partitioned (and replicated) across a cluster of inexpensive commodityservers Over time, as filesystem data continues to expand, HDFS gracefully expands toaccommodate the new data by simply adding additional inexpensive commodity servers to theexisting HDFS cluster Although some amount of rebalancing (i.e., repartitioning) of data isnecessary when new nodes are added in order to keep filesystem data reasonably balanced acrossall available servers—including the newly added servers—the overall process of horizontallyscaling the cluster by adding new machines whenever the filesystem capacity needs to expand isstraightforward for system administrators Horizontal scalability proved to be a much more cost-
Trang 7effective way to scale to large datasets, compared to upgrading existing servers with additionalstorage and processing capability, since there is a limited amount of additional resources that can
be added to a single server before these additions become prohibitively expensive
By storing data on a cluster of cheap commodity servers, along with the filesystem softwarebeing free and open source, HDFS reduced the cost per byte of storage relative to other existingoptions by more than a factor of 10 This fundamentally altered the psychology of organizationswhen it came to keeping around datasets of questionable value Whereas in previous times,organizations were forced to justify the existence of any large dataset to ensure the high cost ofkeeping it around was worth it, the low-cost scalability of HDFS allowed organizations to keepdatasets of questionable value with vague justifications that perhaps they will be useful in thefuture This feature of scalable cheap storage is now a core requirement of all data lakes
Support for Structured, Semi-Structured and Unstructured Data
Another major paradigm shift introduced by Hadoop was a general reduction of constraints abouthow data must be structured in storage Most data management systems at the time weredesigned for a particular type of data model—such as tables, graphs, or hierarchical data—andfunctioned with high performance only if someone stored data using the data model expected bythat system For example, many of the most advanced data management systems were designedfor data to be stored in the relational model, in which data is organized in rows and columns
inside tables Although such systems were able to handle unstructured data (e.g., raw text, video,audio files) or semi-structured data (e.g., log files, Internet of Things device output) usingvarious techniques, they were not optimized for such data, and performance degradeddramatically if large amounts of unstructured data were stored inside the system
In contrast, Hadoop was designed for data to be stored as raw bytes in a distributed filesystem.Although it is possible to structure data within HDFS (we will discuss this shortly), the defaultexpectation of the system is that data is unstructured, and its original data processing capabilities(such as MapReduce) were designed accordingly to handle unstructured data as input.1 This isbecause MapReduce was originally designed to serve Google’s large-scale data processingneeds, especially for tasks such as generating its indexes for use in its core search business Forexample, counting and classifying words in a web page served as a prominent example in theoriginal publication describing the MapReduce system Since web pages generally consist ofunstructured or semi-structured data, HDFS and MapReduce were both optimized to work withthese data types
This embrace of unstructured data again altered the psychology of organizations when it came tokeeping around large raw datasets Before Hadoop become popular, organizations were facedwith a decision to make for every new raw dataset: either (a) spend a large amount of effort toclean and reorganize the data such that it can fit into the preferred data model of the datamanagement system that the organization was using or (b) discard the dataset By providing agood solution for storing large unstructured and semi-structured datasets, these datasets becamepossible to keep around and store in the data lake
Trang 8Open File Formats
So far, we have explained that Hadoop was originally designed to work over raw files in afilesystem rather than the highly optimized compressed structured data formats typically found inrelational database systems Nonetheless, Hadoop’s strong ability to cost-effectively store andprocess data at scale naturally led to it being used for storing highly structured data as well Overtime, projects such as Hive and HadoopDB emerged that enabled the use of Hadoop forprocessing structured data.2, 3 , 4 Once Hadoop started being used for structured data use cases,
several open file formats (such as Parquet and ORC) emerged that were designed to store data inoptimized formats for these use cases—for example, storing tables column by column rather thanrow by row This enabled structured data to be stored in HDFS (a filesystem that is oblivious towhether it is storing structured or unstructured data) while still providing the performancebenefits at the storage level that relational database systems use to optimize performance, such aslightweight compression formats that can be operated on directly without decompression, byte-offset attribute storage, index-oriented layout, array storage that is amenable to SIMD paralleloperations, and many other well-known optimizations
Some important features of these file formats include:
Columnar storage format in which each column is stored separately,
which allows for better compression and more efficient data retrieval.This format is particularly well suited to analytical workloads whereonly a subset of the columns in a table are accessed by any particularquery and time does not have to be wasted scanning data in thosecolumns that are not accessed
Partitioning of files based on specific attributes or columns Instead of
storing all data in a single large file, it is divided into smaller filesbased on the values of one or more attributes of the data Partitioning
data can significantly improve analysis performance by enabling data
pruning, in which partitions that don’t contain relevant data for a
particular task can be skipped This significantly reduces the amount ofdata that needs to be scanned during analysis For example, if the data
is partitioned by date, three years of data can be divided into 1,095partitions (one for each day over those three years) A particularanalysis task that only involves analyzing data from a particular weekwithin that three-year period can directly read the seven partitionscorresponding to the days in that week instead of scanning through theentire dataset to find the relevant data Directly scanning the relevantpartitions results in only having to read 1% of the total data volume
A separate advantage of data partitioning is that it makes it simpler toparallelize data processing tasks because different partitions can beprocessed concurrently
Trang 9 Inclusion of metadata that describes the structure of the data in the
file and how it is partitioned This metadata helps data processingengines understand how data is organized within the files
Storage of data using open standards defined by a particular file
format By clearly defining how data is stored using that format, dataprocessing systems are able to be developed that directly supportthese open formats as inputs to those systems This allows the samedataset to be directly used as input to various big data frameworks andtools, including Trino, Apache Spark, and Apache Hive, without having
to be converted on the fly to different internal data structures used bythose systems This means that data only needs to be stored once andcan be accessed by many engines
Support for Schema on Read
Some datasets—especially those that were generated from a variety of heterogeneous sources—may not contain a consistent schema For example, in a dataset containing information aboutcustomers of a business, some entries may contain the customer name while others do not; someentries may contain an address or phone number while others do not; some entries may contain
an age or other demographic information while others do not In traditional structured systems, arigorous process is often required to define a single, unified schema for a particular dataset Thedata is then cleaned and transformed to abide by this schema Any information that is missing(name, address, age, etc.) must be explicitly noted for a particular entry (e.g., by including thespecial null value for missing data) This entire cleaning and transformation process happensprior to loading data into the system, and the process is referred to by the acronym ETL
Since data lakes allow data to be stored in a raw format, new datasets can be stored immediately,prior to any kind of transformation This enables unclean data with irregular schemas andmissing attributes to be loaded into the system as is without being forced to conform to a globalschema Furthermore, this type of data can be made available for data processing via a conceptcalled schema on read, in which the schema for each entry is defined separately for that
entry and read alongside the entry itself When an application or data processing tool reads theraw data, it reads the schema information as well and processes the data on the fly according tothe defined schema for each individual entry This is a significant departure from the moretraditional schema on write concept, which requires data to be formatted, modeled, and
integrated as it is written to storage before consumption and usage
A data lake thus enables an ELT paradigm (as opposed to the ETL paradigm previously defined),
where data is first extracted and loaded into the data lake storage prior to any cleaning ortransformation At first, whenever it is accessed by data processing tools, the schema on readtechnique is used to process the data Over time, data can be cleaned and transformed into aunified schema for higher quality and performance But this process only needs to happen whennecessary, only when the human and computing resources required to perform thistransformation process are available, and only when the value of the data after transformationincreases by enough to justify the cost of the transformation Until this point, the data remainsavailable via the schema on read paradigm This reduces the time-to-value for new datasets,
Trang 10since they become immediately available for analysis, and only increase in value over time as thedata is improved iteratively on demand
The Cloud Era
All of the leading cloud vendors offer some sort of data lake solution (often more than one) Insome cases, these solutions are based directly on Hadoop, where Hadoop software runs under thecovers, and the cloud solution takes over the significant burden of managing and runningHadoop environments In other cases, these solutions are built using their own softwaredeveloped in-house, while adhering to the principles we discussed previously
One notable addition that cloud vendors have made to the data lake paradigm beyond what wepreviously discussed is the introduction of object storage Object storage is designed toaccommodate unstructured data by organizing it into discrete entities known as objects These
objects are then housed within a simplified, nonhierarchical data landscape Each objectencapsulates the data, associated metadata, and a unique identifier, facilitating straightforwardaccess and retrieval for applications via an API For example, the REST-based API for accessingdata on Amazon’s object storage is called S3 (Simple Storage Service)
Both object storage and the more traditional file-based storage can be foundational components
of data lakes; however, while file-based storage is designed and mostly employed for analyticsuse cases, object storage has a much broader applicability and is used as a location for storingdata for both operational and analytical purposes
Data Virtualization Over Data Lakes
In the next chapter we will discuss some major challenges in building functional, performance DV Systems and some solutions that have emerged to overcome these challenges.During the course of that discussion (along with additional discussion in the following chapters),
high-it will become clear why data lakes are an excellent use case for virtualization systems This doesnot mean that data virtualization should not be used for data outside data lakes—modern DVSystems are able to process data located anywhere with high performance Rather, the existence
of data in data lakes is pushing forward demand for DV Systems while at the same timeeliminating some of the more difficult challenges in deploying such systems
Chapter 2 Recent Technology Developments Driving the Rebirth of Data Virtualization
For decades, both well-known research projects in academia and small and large softwarevendors in industry have sold the promise of data virtualization In almost every case, theseprojects fell short of the promise, stigmatizing the industry associated with the terms data virtualization and data federation
Trang 11It’s not that they could not do what they set out to do—every project produced an interface thatcould query data from various sources The problem was that each solution came with sideeffects that undermined the utility of the solution Some examples of such side effects are:
The performance is too slow
The solution is too hard to use
The solution is too narrow because it only works for specific types ofdata or underlying data storage solutions
In this chapter, we will explore the fundamentals of data virtualization: what makes it hard andwhat are the challenges in building such systems? Why exactly have many systems fallen short
of their promise?
Then we will discuss some recent technology developments that offer an optimistic outlook thatdata virtualization may succeed in areas where it has previously failed These technologydevelopments are independent of any one software vendor Rather, every vendor in the datavirtualization space can take advantage of these developments
Definitions
Throughout this book, we will refer to the software that implements a data virtualization solution
as a DV System, and the system’s query processing engine that does processing work in
response to a request from a client as a DV Engine A DV System generally contains other
components aside from the DV Engine (such as a client interface, connectors to data sources, andcaching—see Chapters 3 and 4), but the engine is the largest and most important component andwill be the focus of this chapter (along with the next one) Any data storage system that containsdata being virtualized by the DV System is called an underlying data source or a data source system
Five Challenges of Data Virtualization
In the following, we will use a hypothetical example to understand some of the challenges inbuilding a DV System In this example, an international retail company has separate datasets forbusiness transacted in different locations The division of the business in the United States sellsmany different products and contains a dataset describing sales and products in the region
This dataset includes several tables, two of which are shown on the left side of Figure 2-1 : onedescribing each product available for sale, and one describing each sale made
Trang 12Figure 2-1 Example of an international retail company and its datasets (the US dataset is called “Data Source 1” and the UK dataset is called
“Data Source 2”; the UK division of the business sells only laptops)
Figure 2-1 shows only enough data needed for our discussion In practice, these tables willusually contain many other columns that are not shown in the figure, such as quantity of itemssold, location of sale, etc Furthermore, in practice, there will usually be other tables, such asinformation on customers, orders, sales venues, parts, warranties, etc Nonetheless, the smallamount of data shown in the figure is enough for our discussion on the complexity of datavirtualization
Assume that a data analyst wants to analyze laptop sales across all units within the business Thetraditional way to do this is to bring all data across all units together inside a single system (such
as a data warehouse)—either in advance of the query or at query time—and perform the analysisthere Unfortunately, there is typically an enormous amount of work involved in bringing all thedata together, ranging from the bureaucratic to the technological For example, most datawarehousing solutions increase in cost as the data stored in the warehouse increases Therefore,several levels of approval are usually needed before additional data can be added to it Theremay be legal questions around moving data out of the current location, and privacy issues whenbringing data together Furthermore, access permissions, data life cycle management, and otherdata governance issues must be worked out before it can be moved All of these contribute to thebureaucratic work of data movement On the technical side, network bottlenecks, data warehousescalability limitations, and schema/catalog management problems may all be encountered bymoving data The promise of data virtualization is to make most of this work unnecessary—youcan leave data where it is and query what is needed from a unified interface
Trang 13Unfortunately, even for data as simple as that shown in Figure 2-1 , there are nontrivial aspects toquerying this data as is from a unified interface First, there are some semantic differences acrossdatasets The US dataset contains sales information for a variety of products, whereas the UKdataset contains information only on laptop sales A human would be able to figure out how tocombine the data from the different datasets, but this would be hard for a computer since thecombinability of the datasets requires understanding that the data in one table (the Laptop value
in the Product Name column) matches up with the metadata in the other dataset The only way toknow that the UK dataset deals with laptops is to look at the table name—this is metadata ratherthan data Furthermore, a subject matter expert would still need to confirm that a laptop in the
US is referring to the same semantic entity as a laptop in the UK Additionally, the same Price
column name exists in both datasets, but they are not immediately combinable in an analysissince one is in US dollars and the other is in British pounds
A separate challenge involved in querying this dataset via data virtualization occurs if thedatasets are spread out across the world In our example, the US dataset is likely locatedsomewhere in the US and the UK dataset may be located somewhere in the UK Performing ananalysis across both datasets involves communication over a network across large distances Thisnetwork communication significantly slows down the performance of the analysis If the data isbrought together in one place instead of using data virtualization, these cross-region networkcommunication costs could be avoided
A third challenge that frequently arises is that different datasets may be stored in differentsystems For example, one dataset may be stored in MySQL and the other in Oracle, Hadoop,MongoDB, or one of hundreds of other data management options available today Each systemtypically has a different interface with slightly different APIs for extracting data from thatsystem The DV System needs to understand how to communicate with all the differentunderlying source systems that may contain data relevant to an analytical task
Going back to our example, the US dataset may be stored in an Oracle database managementsystem, while the UK dataset may be stored in PostgreSQL Even if there were no semanticdifferences between the datasets whatsoever, and even if both datasets happened to be stored inthe same part of the world, the fact that separate software systems manage them adds a surprisingamount of complexity to the DV Engine Although both Oracle and PostgreSQL provide an SQLinterface, there are subtle differences in how queries are formulated for each system
These differences exist even for simple queries that do not require joining tables or any advancedoperations A simple query that extracts a count of laptop sales per day during a recent one-weektime period may be expressed using SQL over an Oracle system:
SELECT day, count(*),
FROM txns
WHERE productName="laptop"
AND date >= next_day(trunc(sysdate), 'MONDAY') - 14
AND date < next_day(trunc(sysdate), 'MONDAY') –7
GROUP BY day;
Yet the same query over an identical schema in PostgreSQL may look quite different:
Trang 14SELECT day, count(*),
FROM txns
WHERE productName="laptop"
AND date BETWEEN
NOW()::DATE-EXTRACT(DOW FROM NOW())::INTEGER-7 AND
NOW()::DATE-EXTRACT(DOW from NOW())::INTEGER
GROUP BY day;
A DV System must formulate the query using the correct syntax over each underlying sourcedata system that may contain relevant data for a query The problem is that while most systemshave an API that is based on one of the various versions of the SQL standard, they nonethelesstypically implement their own system-specific dialect that differs from standard SQL alongseveral nuances Furthermore, most systems include system-specific extensions that areextremely useful and are designed to suck the end user into a heavy reliance on these extensions,thereby making it more difficult to switch to a different database system at a later point
Building a data virtualization system that can function over different types of underlying datamanagement systems becomes an exercise in speaking many different languages and dialects,and in developing expertise in the semantic differences that exist across dialects In practice, it isinfeasible to develop a sufficient amount of expertise in every possible underlying system.Therefore, data virtualization systems have historically been forced to pick and choose whatunderlying systems to support, thereby limiting their applicability
A fourth challenge is that query optimization is very hard to perform correctly in datavirtualization systems Chapter 3 will go into detail on how query optimization works and whatmakes it so hard in the context of data virtualization systems In short: query optimization needs
to estimate the cost of the different options for performing a query When the same system that isperforming the optimization is also performing the query processing, the query optimizer will beable to generate accurate estimates for the costs of different options However, in datavirtualization systems, the processing is often performed by an entirely different system (anunderlying data source system) Therefore, the optimizer in the DV Engine typically does nothave detailed information about all the different operator algorithms available to the underlyingsystem, nor (typically) all the detailed statistics about the stored data that are also very important
to estimating costs of different options Furthermore, the underlying system has its own queryoptimizer, so when the underlying system chooses a different execution plan than what the DVEngine expects, it surprises the engine This is especially common if the underlying system usesunpublished, proprietary techniques that the engine is unable to understand
A fifth challenge in building the DV Engine is the problem of joining tables that exist in differentunderlying systems See Figure 2-2 for three basic approaches to performing such joins
The first option extracts data into the DV Engine and performs the join there The second optionchooses one of the underlying data stores to perform the join and orchestrates the movement ofall relevant input to the join (that is not already there) to that location A third option is a “semi-join” approach that involves performing parts of the join at both data stores The semi-joinalgorithm is described more deeply in an academic paper by Bernstein and Chiu.1
Trang 16Figure 2-2 Performing joins
The first option requires sending raw data from both tables over the network from the datasources to the DV Engine In the past, network communication was often a bottleneck indistributed query processing, so sending both tables over the network was not really an option.The other two options are more network efficient, so one of them was usually chosen However,they both require data from one data source to be loaded (or at least accessed) from the other datasource Such operations typically work differently for different systems, and thus require moresystem-specific code
The Death and Rebirth of Data Virtualization
In the previous section, we listed five challenges of data virtualization:
Semantic and syntactic differences across datasets
Performance problems introduced by the need to communicate withthe systems hosting the source data, wherever they may be
General applicability problems caused by the need for the datavirtualization system to speak multiple SQL dialects and to be able touse the API of multiple underlying systems in data federation usecases
Performance problems caused by the additional complexities of queryoptimization in data virtualization environments
Code complexity issues caused by the need to support joins acrossdata sources
When all data relevant to a task is already in one place (e.g., in a data warehouse), thesechallenges either disappear or are alleviated:
The process of bringing all data together (e.g., ETL/ELT) into onelocation usually includes a data cleaning and integration effort toreduce semantic differences across datasets
There is no longer a need to communicate across remote systems atquery time
Knowledge of only one SQL dialect is required
There is only one system to cost estimate and optimize
Trang 17 All joins are performed within the same system
All of this is why, for many decades, the industry best practice was to bring all data into onelocation prior to analysis Data virtualization never fizzled out, as there were certain use cases forwhich it was an ideal fit But it remained a niche technology, known for unfulfilled promiserather than successful use cases
Technology Trends Driving the Rebirth of Data Virtualization
Recently, data virtualization has been going through a renaissance This is driven by severaltechnology trends that we will now take a closer look at
Data lakes
One of the biggest trends driving data virtualization is the proliferation of data lakes Asdescribed in Chapter 1 , data lakes are characterized by extremely cheap data storage, open dataformats, and direct access to raw data Data lakes are becoming increasingly popular asorganizations realize that all data available to them has potential value—even unclean,ungoverned, unintegrated, or unprocessed data They therefore want to make all data availablefor internal analysis to any unit that may benefit from it Putting all this data into traditional datawarehouses or other single-system data management software is too expensive—both from atime perspective and from a cost perspective Since only a subset of the data will actually end upbringing real value to the organization, it is often preferable to at least initially dump the raw datainto cheap storage instead of paying the up-front costs of loading it into more traditional datamanagement systems
Raw data is a much better fit for data virtualization than data sitting inside complex datamanagement systems is The problems of differing SQL dialects, complex query optimization,and external join processing do not generally exist in this context This is because the raw dataformats are designed to be consumed by external systems, passing the data as is with little to noquery processing done by the storage system Since the storage system does not perform fullquery processing, there is no need for it to perform a separate query optimization task.Furthermore, instead of an SQL interface, the raw data is often exposed via a simple API inwhich the data can be extracted Alternatively, the data is exposed as raw files that can be passeddirectly to the external system When the external system is a DV Engine, it simply extracts theraw data and performs data processing within the engine—wherever that engine is running(either on the lake or on a cluster nearby)
One disadvantage of consuming raw data is that unprocessed data is typically larger thanprocessed data Take, for example, a company that wants to analyze which customers bring inthe most revenue The raw data for this analysis likely includes all transaction history for allcustomers that do business with the company Each processing stage aggregates and summarizesthis raw data until a final result set of high-revenue customers is produced Thus, the data getsprogressively smaller as it goes through successive processing stages If the DV Engine needs to
Trang 18work with the raw data, it must pull a larger amount of information from the underlying systemthan if it works with data produced from later stages of query processing.
When the raw data is in the terabytes or petabytes, a single machine cannot perform the dataprocessing by itself with high performance Instead, many machines must work together toprocess the data in parallel Thus, modern DV Engines often need to support parallel queryprocessing across a cluster of computers in order to handle the increased scale (we will discussthis in more detail in Chapter 5 ) Though building a parallel query processor is more complexthan building a single-machine query processor, this code is typically easier to maintain than thealternative of pushing more work down to the underlying systems since query processing is acore internal task that is not dependent on external systems In contrast, pushing down queryprocessing to underlying systems comes with much higher maintenance costs, since the codeneeds to be updated as the APIs and processing features of the underlying systems change.Therefore, as raw, lightweight data formats get increasingly adopted, the potential for datavirtualization systems to provide value increases (since they can process it at high speeds in
parallel), while the cost (in terms of software code creation and maintenance) of providing thisvalue decreases This has resulted in a multiplicative effect in the applicability of data
virtualization systems
The cloud
We briefly mentioned in the previous section that raw data is typically larger than processeddata Going back to our example from Figure 2-1 , let’s say that a query is submitted to a DVSystem that requires calculating the sum of all products sold on January 5 in the US dataset.Assume that the system is running on a different machine than the underlying data source systemthat stores the raw US dataset There are two clear choices for how to perform this query Oneoption is to transfer the raw dataset to the DV System and perform the query there The otheroption is to send the query to the underlying data source system and perform the query there(assuming it has processing capabilities) and only send the result of the query to the DV System.Assume that the subset of the raw dataset that has a date of January 5 has a total size of 5 GB If
so, the first option needs to ship 5 GB over the network Meanwhile, the query itself (the request
to calculate the sum) along with the result set (a single integer in our example query) likelyconsists of less than a kilobyte If so, option 1 sends a million times more data over the networkthan option 2 Although not all queries will have such extreme differences between option 1 andoption 2, option 2 almost always sends less data over the network than option 1 since, as wedescribed in the previous section, most data processing operations summarize, aggregate, and/orfilter data as data is processed Therefore, historically, database research has assumed that whengiven a choice between sending data to a query or sending a query to the data it must process, thelatter option is preferable
When networks were the bottleneck in query processing performance, it was clearly a no-brainerthat the query should be sent to the data and not vice versa Even now, when networks are less of
a bottleneck, it would seem hard to justify unnecessarily sending gigabytes or terabytes of dataover the network Nonetheless, with the emergence of the cloud, that which was previously hard
to justify has become routine
In the cloud, one pays for exactly the amount of resources one uses Processing and storage aretreated separately: processing units can be scaled up or down independent of storage units
Trang 19Although processing units typically come with attached local storage, this local storage isephemeral and disappears if the processing unit is shut down The standard practice is to treat
processing units as disposable resources—spinning them up as needed and shutting them downwhen they are idle in order to avoid paying for the idle time Therefore, long-term data isgenerally not stored on local ephemeral storage since it may get shut down at any instant Rather,this data is stored across the local network on storage units that are generally not designed toperform query processing Thus, even when the data being processed is billions of times largerthan the query that processes it, data is still sent across the network to the query processing nodesand not vice versa This is a direct result of the trend to disaggregate storage and compute, made possible by the cloud and fast, high-throughput local networks.
Disaggregating storage and compute means that raw data always gets sent over the network toget to the compute nodes It makes no difference if the compute nodes are part of the DV Engine
or part of the local underlying data store that controls the data—either way, data needs to betransferred If so, the primary reason to push queries to data instead of pushing data to queriesdisappears There is no need for the underlying data store to perform the query; instead, the DVEngine can perform the query
When this is the case, most of the challenges of data virtualization that we discussed previouslyare either eliminated or alleviated For example, the third challenge was that the DV Engineneeds deep knowledge of the API of all source data systems that it works with in order to pushqueries to those systems If instead raw data is pulled into the DV Engine, then the enginerequires a much shallower understanding of the API of the source system—just enough to knowhow to pull raw data out of it The fourth challenge was that query optimization is really hard in
DV Systems because the system has to reason about the cost of performing processing operations
in external systems If it performs all processing itself, then the query optimization problem in
DV Systems becomes close to equivalent to standard query optimization The fifth challenge washow to perform joins Option 1 from Figure 2-2 is clearly the simplest option, but was rejectedbecause it required pushing data to the query But if we are pushing data to the query anyway inthe cloud, that choice no longer needs to be rejected Thus, the paradigm of disaggregatingstorage and compute is a game changer for data virtualization
Fast networks
Early distributed systems were often bottlenecked by the network Accessing local data wasfaster than accessing data across a network, especially at scale when data consumption exceededthe maximum throughput of the network This was a major disadvantage for the leave data where it is approach as opposed to the bring all data together approach When all data is
brought together, a onetime network cost is paid to transfer all the data to one location After thisinitial cost is paid, no further (external) network communication is necessary to process queriesover that data Meanwhile, the leave data where it is approach continuously pays a network
cost at query time whenever a query needs to access data at more than one location
Over time, networks have gotten much faster At 10 Gbps (10 billion bits per second) andbeyond, data can be accessed across a network at similar rates to how fast it can be processed.Furthermore, remote direct memory access (RDMA) technology allows one machine to accessdata across the network in the memory of a different machine at similar speeds to accessing itsown memory As network communication costs decrease, the leave data where it
is approach becomes much more viable
Trang 20Cheap memory
Cheap memory enabled the trend of disaggregating storage from compute since the memory can
be used to cache large amounts of data on the compute nodes that were brought there from aprevious compute task This allows the system to avoid resending the data over the network fromthe storage nodes to the compute nodes under various scenarios when processing queries overrecently cached data is acceptable We will discuss data caching in more detail in Chapter 4 Standardization
When discussing data lakes, we stated that it is easier for a DV System to interact with raw dataformats used in data lakes rather than more complex underlying data management systems Theoverall complexity is further reduced by the consolidation of widely used raw data formats into ahandful of popular formats such as Parquet, ORC, and Avro This consolidation is driven byopen source and proprietary systems that are developed to consume data in these formats Sincethese systems generally only support a handful of formats, there is a push to publish data informats that are widely supported across these consuming systems
Machine learning
None of the trends we have discussed so far help with the first challenge of data virtualizationthat we described: the existence of semantic differences across datasets The tremendousadvancements in machine learning algorithms that have been made in the past few years havemajor potential to alleviate this challenge Machine learning can automate some of the discovery
of semantic differences across datasets along with adding the code that can automatically convertdata from one set of semantics to another in order to properly integrate with a differentdataset.2, 3 , 4 This information is generated by learning from how humans write queries across
these same datasets or other similar datasets made available to the machine learning algorithm.Although we remain far away from fully automating the data integration process, the ability ofmachine learning algorithms to automate a subset of the required tasks significantly decreasesthe complexity of successfully deploying a data virtualization system
Data Virtualization and Mainstream Adoption
Not all of the challenges of implementing DV Systems have gone away Dealing with differentdata semantics of the underlying systems is still nontrivial Cross-region and cross-cloudcommunication remains expensive, and DV Systems must be cognizant of these costs whenimplementing query processing engines
Nonetheless, it is a whole lot easier to implement data virtualization today than ever before Theemergence of data lakes, the cloud, and fast networks have made it normal to send large amounts
of data over the network during query processing—even in nonvirtualized systems DV Enginesare now taking advantage of this to avoid the complex task of pushing down processing todifferent types of storage systems, and instead pulling unprocessed data out of storage for localprocessing inside the DV Engine Of course, the data extraction needs to be somewhat intelligent
—the underlying system should do some amount of simple data filtering But nonetheless, theoverall cost of implementing a functional DV System is continuously decreasing over time, and
Trang 21end user query performance continues to improve Thus, what was once a niche technology isnow becoming mainstream.
Chapter 3 How Data Virtualization Systems Work
In this chapter, we give a brief tutorial of data virtualization systems: how they are architected,how data flows through the system in response to a request, and generally how they work Wewill specifically focus on the query processing engine within the system, which we will call the
DV Engine throughout this chapter We do not expect a reader to be able to build a DV Engineafter reading the chapter—such an effort requires years of training in advanced systemsengineering, taught at places such as the University of Maryland (the home institution of one ofthe authors of this book), along with real-world experience working on existing complexsystems
Rather, our goal is to give the reader an overview of how such systems are built, arming userswith knowledge so they will be able to avoid issues that come up during the system selection anddeployment process We start with fundamental architectural principles and then continue withmore advanced topics in the next chapter
There are trade-offs that exist when designing a DV Engine Existing engines choose particularpoints in the trade-off depending on how they expect the DV System will be used If one uses thesystem in a different way than it was designed for, poor performance and other practicalconstraints will often occur Therefore, it is important to be aware of the trade-offs that exist, theassumptions made, and the general design of the engine This results in better experiences using
DV Systems and faster (and more complete) resolution of problems that may come up
The most important trade-off that we will discuss in this chapter is the high-level differencebetween push-based DV Engines and pull-based DV Engines As described in Chapter 2 ,the first generation of DV Engines (almost) uniformly used a push-based approach, in whichmost of the query processing work was done at the data source prior to reaching the engine.More recently, pull-based systems, in which raw data is sent from the data source for processing
at the engine, have been gaining support Nonetheless, push-based systems still exist today and,for certain use cases, are the better approach Moreover, in Chapter 8 we will make the case thatthe industry has overly rotated in its embrace of the pull-based approach, and that there is roomfor a middle ground It is important for a DV System user to be aware of which approach the DVEngine is using, in order to diagnose performance issues that may occur under either approach
The Basic Architecture of Data Virtualization
Trang 22In this section, we discuss the high-level architecture of DV Engines We divide our discussioninto the three types of systems: push-based engines, pull-based engines, and hybrid engines Ourdiscussion will focus on system components that are shared by most DV Engines;
in Chapter 4 we will explore some advanced system components that may not be available in allsystems
Push-Based DV Engines
We will start by discussing the architecture of push-based engines since the architecture itself is
a little simpler (albeit more brittle and complex to implement in practice) Figure 3-1 gives ahigh-level overview of the architecture of push-based DV Engines
Figure 3-1 Push-based DV Engine architecture
Figure 3-1 includes two data sources These are data systems or platforms inside which data that
is accessible from the DV System is stored For example, Data Source 1 may be a traditionaldatabase management system (DBMS), such as MySQL or Oracle, or a multiserver databasesystem, such as Teradata or Vertica, or a cloud database system, such as the cloud versions of allthe previously mentioned systems or Snowflake or Redshift Alternatively, these data sources
Trang 23may be simple filesystems that store data in files (optionally using open data formats such asParquet or Avro and open table formats such as Iceberg) on local or distributed filesystems.1The DV Engine communicates with these data sources using their APIs Many data sourcesinclude interfaces based on SQL for both reading and writing data at the source Others provideAPIs in which data is transferred in bulk via filesystem operations Others have more advancedAPIs depending on the data—including NoSQL, graph, hierarchical, and other types of dataaccess languages In general, the DV Engine must be able to communicate with each possibleunderlying data source using the preferred language of that underlying system.Although Figure 3-1 only shows two data sources, in practice there could be dozens (or evenhundreds) of sources that feed data into a particular DV System
The information about what data sources exist (i.e., which data sources are configured to beconnected to the DV System) is stored within the Catalog of the DV Engine The Catalog
usually includes a description of how data is modeled within each data source (e.g., the schema
of the data—the names of the tables and the names and types of the attributes of those tables);high-level statistical information about the data within the data sources; and authorizationinformation indicating who is allowed to access which data sources (and which data within thosedata sources)
The DV Engine makes a subset of the Catalog viewable to external users of the DV System Thisallows the users to be aware of what data is available to be accessed and how to formulaterequests that access the data
For example, Figures 3-2 and 3-3 show the view of the Catalog that is visible to external users in
an example data virtualization system Figure 3-2 shows a list of the different regions thatcontain data being virtualized Each region can be expanded to show the list of datasets available
at that region Figure 3-3 shows that these datasets can be further expanded to show the knowninformation about the schema of each dataset, such as a list of tables and the names and types ofthose tables’ attributes
Trang 24Figure 3-2 A view of the catalog in an example data virtualization system in which regions and datasets are displayed
Trang 25Figure 3-3 A view of the same catalog as Figure 3-2 , but in which schema information about a particular dataset is displayed
Once a client submits a request to the DV System, the Query Parser component of the DV
Engine is the first component to start working on the request The Query Parser examines therequest and ensures that it is properly formatted For example, it checks to make sure that thedata sources referenced by the client’s request exist and are currently available, and thatthe referenced tables, files, or other data within those data sources also exist and are allowed to
be accessed by that client (and from that client’s location) In order to perform these checks, theparser uses information from the Catalog If there are any problems with the request, it is rejectedand an appropriate error message is returned to the client Otherwise, the Query Parser convertsthe request into an internal representation that is easier for the system to manipulate than the rawinput language used to submit the request to the DV System
After converting the request to the internal representation, the Query Parser hands off the request
to the Query Optimizer The optimizer considers all of the possible ways that the request can
be processed For example, let’s say that the request includes a join of table S from data source A(DSA) with table T from data source B (DSB) There are many different options for performingthis join Table S can be extracted out of DSA and sent to DSB Once DSB receives S, it canperform the join of S and T locally Alternatively, table T can be extracted out of DSB and sent
to DSA Once DSA receives T, it can perform the join of S and T locally A third alternative:
Trang 26both S and T can be extracted out of their underlying data sources and sent to the DV Engine andthe join performed there A fourth alternative is the semi-join approach overviewed in Figure 2-
2 If there is more than one join in the request, the optimizer must decide what order the joinsmust be performed in for any set of joins that are not performed entirely in the same underlyingsystem Furthermore, the DV Engine must decide on the particular join algorithm used for anyjoins that it performs itself—such as a nested-loops join, a hash join, a sort merge join, or any ofthe other possible join algorithms that it knows how to do.2
Most query optimizers use a cost-based technique to decide between the different options Theyestimate the cost in terms of resources used and expected processing time, and then choose thelowest-cost option Cost estimation is notoriously challenging in all data processing systemsbecause it is highly dependent on the sizes of the input data for the join and the amount of datathat successfully joins across tables To accurately estimate the latter, a relatively deepunderstanding of the semantics of the data is required, which can be hard for a nonhuman toobtain Meanwhile, the former requires an understanding of the semantics of all the query
processing operations that occurred prior to join—an even harder task This is because the size ofthe input to a join is the same thing as the output of the previous operation in the queryprocessing pipeline, which in turn is dependent on the actions of all the operations that happenedpreviously
All Query Optimizers experience the challenges discussed thus far—even those in non-DVSystems Unfortunately, the cost estimation process in push-based DV Engines adds onadditional complexities There are two reasons for this First, the DV Engine must decidewhether or not to perform a particular operation at a particular data source in addition to all ofthe other decisions a standard Query Optimizer must make To make this decision, the DVEngine needs to estimate the cost of performing the operation at the data source so it cancompare this cost to performing the operation locally However, to estimate that cost, the DVEngine needs to understand what algorithm will be used by the underlying system to performthat operation and the details of how that algorithm is implemented Since every underlyingsystem works differently, it is very hard for the DV Engine to accurately estimate the cost forevery possible underlying system Nonetheless, this is an area where recent research has shownthat machine learning can significantly help improve cost estimates for external systems in whichoperator implementation details are not fully known or understood.3, 4 , 5 , 6
The second reason for additional complexities is that the Query Optimizer relies on statisticsabout the raw data being processed in order to estimate the cost of different potential queryplans These statistics are found in the Catalog However, these statistics may become stale andare not updated immediately as the data in the underlying data source systems are updated.Therefore, the Query Optimizer in a DV System must be careful to not be overly reliant on theaccuracy of these statistics
Once the Query Optimizer chooses a plan according to which the query will be executed,
an Admission Control mechanism checks to see if the DV Engine currently has enough
resources available to process the request, according to the chosen query plan If not, the request
is queued until enough resources are available Otherwise, the plan is handed over to the Plan Executor for processing Admission control is also harder in DV Systems than in other
systems This is because even if the DV Engine itself has enough resources available to processthe request, the underlying systems may not Since the underlying systems are typically running
Trang 27on different physical hardware than the DV Engine, some communication across systems isnecessary in order to perform global admission control Many underlying systems do not provide
an interface in which data relevant to admission control can be extracted easily When this is thecase, the Admission Control mechanism needs to use indirect data to determine if there areenough resources available to the underlying systems to process a request, such as responsetimes to recently issued requests
The plan produced by the Query Optimizer includes a series of subqueries (alternatively
referred to as subrequests) Each subquery contains a description of all the work assigned to a
particular underlying data source system As a metaphor, a wedding planner may take acomplicated task—“plan a wedding”—and divide it into subtasks based on various specialistsavailable: the planner asks the caterer to help with the food, the bartender with the drinks, theflorist with the flowers, etc Once each specialist completes a task, the planner puts the results ofthe tasks together Similarly, each data source is given a subquery/subrequest that contains arequest to do part of the work of the overall plan Some data sources contain advanced queryprocessing abilities and are able to complete complex data processing tasks on their local data.Others are simple files on a filesystem, and the subquery sent to them is no more than a simpledata extraction operation Meanwhile, the Plan Executor manages the entire process: it generates,schedules, and sends subqueries to the underlying systems, receives the results, and thenperforms any further combination, aggregation, or joining of results The results are thenreturned to the client, either at the end of the query processing or during query processing asresults are generated in real time
Pull-Based DV Engines
Figure 3-4 overviews the high-level architecture of pull-based DV Engines It shares somesimilarities with Figure 3-1 , but differs in some important aspects
Trang 28Figure 3-4 Pull-based DV Engine architecture
The most obvious difference between Figures 3-1 and 3-4 is that the DV Engine in the based figure consists of a single machine, whereas the DV Engine in the pull-based figureconsists of seven machines: a leader node and six workers.7 This particular difference is
push-not fundamental In general, it is possible for a push-based DV Engine to consist of more thanone machine and a pull-based system to consist of a single machine Nonetheless, these figuresillustrate the common practice Since push-based systems push most of the query processingwork down to the underlying systems, one machine is often sufficient for the remaining workperformed by the DV Engine On the other hand, since pull-based systems pull out raw data fromthe underlying systems, they need to perform more processing locally, and require moreprocessing power to do so
The initial stages of request processing—the request parsing, optimization, and admissioncontrol—are typically performed by a single machine This single machine may be designated inadvance to perform these tasks for all requests If it is, the designated machine is called a leadernode, and other machines are prepared to take over if the leader node fails for any reason.Alternatively, some DV Systems use a leaderless approach in which different machines performthese initial steps of request processing per request Either way, the basic functionality of each
Trang 29component of initial request processing—the parsing, optimization, and admission control—isessentially the same for pull-based systems as for push-based systems
The key difference is the next stage of request processing after these initial stages complete: thePlan Executor In push-based systems, the Plan Executor is a relatively lightweight componentthat performs the final combination, aggregation, or joining of results from the subqueries sent tothe underlying system In contrast, in pull-based systems, the worker machines perform the bulk
of the query processing under the direct control of the scheduler within the Plan Executor
The Plan Executor uses information in the Catalog to divide the source data relevant to the queryinto chunks (alternatively referred to as splits) These splits are more granular than what is
used in push-based systems Going back to our metaphor, a push-based wedding planner may tellthe caterer: “Take care of the food.” Meanwhile, a pull-based wedding planner will want to knowthe name of all the chefs and waitstaff and give particular tasks to each chef and staff memberseparately The scheduler then assigns these splits to available worker machines within the DVEngine Upon receiving a split definition, a worker machine extracts the corresponding raw datafrom the underlying data source and performs the set of request processing operations that can beimmediately performed Most worker machines are implemented such that they can subdividethe data (and operator processing over this data) across the available CPUs of that workermachine Some processing operations cannot be immediately performed on the input data, butrather require data to be combined from multiple worker machines before they can occur.Therefore, communication may be required across worker machines before processingcompletes
Hybrid Approaches
We have described the architecture of push-based DV Engines and pull-based DV Engines viaaccentuating their extremes: we have assumed that push-based systems push as much processing
as possible down to the underlying systems, and that pull-based systems always pull raw data out
of the underlying system In practice, most DV Systems do not quite go to these extremes based DV Engines will sometimes pull raw or minimally processed data, while pull-based DVEngines will sometimes push processing
Push-For example, a push-based system may determine that it has superior query processing resources
or algorithms relative to an underlying data source When this is the case, it may choose to pullraw or semiprocessed data from the underlying system and do the processing locally Similarly,when running over a data source that has no processing capabilities (e.g., a simple file in afilesystem) or for which it does not have full knowledge regarding how to express and submitadvanced requests using the available API, the push-based DV Engine has no choice except topull raw data and perform query processing locally
It is also very common for pull-based DV Engines to push at least some processing down to the
underlying system The most common case for this is predicate evaluation For example, adataset may include a column that stores a user’s or customer’s zip code If a query enters thesystem that restricts the zip code—for example, the query is only interested in data with a zipcode of 01776—then it is extremely wasteful to ship the entire dataset over the network to the
Trang 30worker nodes in the DV Engine, only to immediately filter out the data from all zip codes except
01776 Instead, the pull-based system will send a request to the underlying data source to onlyship data from the dataset that meets that particular criterion Other simple operations that restrictthe data that needs to be shipped over the network (such as basic aggregations) are also oftenpushed down Pull-based systems generally only send simple requests such as predicateevaluation and aggregation down to the underlying system because they usually are easy toexpress over any data processing API and thus do not require an inordinate amount of additionalcode (along with the associated code maintenance costs) per data source type to functioncorrectly
Another case where pull-based systems will push down processing to a data source is where arequest enters the DV System that uses an identical API to an underlying system (e.g., standardSQL) and only accesses data from that same underlying system In such a scenario, the DVEngine can forward the request to the underlying system to be processed in its entirety there.Nonetheless, the DV System may still choose to parse and optimize the request in order todetermine whether the request will complete faster if it processes the request directly instead ofpassing it through to the underlying data source
Despite the existence of these hybrid approaches, it is still helpful to think about DV Enginesusing the extremes Even though push-based systems will sometimes pull data, they are stillfundamentally architected to be push-based and are designed and resourced with an expectationthat most requests will be pushed down to the underlying data sources Similarly, even thoughmost pull-based systems will sometimes push requests down to the underlying data source, theyare still fundamentally architected to be pull-based and are designed and resourced with anexpectation that large amounts of data will be pulled into the DV Engine at request processingruntime
Common Pitfalls
Before continuing our discussion of the architecture of DV Engines with some more advancedarchitectural considerations in the next chapter, we conclude this chapter with a discussion ofsome common pitfalls that come up relative to the components we have discussed so far, andhow to resolve them
Not enough query processing resources
Both pull-based and push-based systems may experience slow response times due to insufficientprocessing resources available to the DV Engine For example, if a DV Engine runs on a singlemachine and needs to scan and process 100 TB of data, there is no way this query can completequickly Even if the engine can somehow process 2 GB of data per second, it will still takearound 14 hours to process The only way to get fast response times for query processing jobs ofthat scale is to have many machines working together in parallel to complete the scan
As mentioned several times in this chapter, many push-based DV Engines run on a singlemachine since most of the time they do not need to perform much processing—the bulk of thework is pushed down to the underlying data source However, if the data source consists of
Trang 31simple files (e.g., Parquet files) sitting on a distributed filesystem in a data lake that either doesnot support query processing or whose processing capability is incompatible with the DVEngine, the push-based DV Engine will end up pulling the data and locally processing it In such
a scenario, scalability limitations are often encountered, especially when the raw input data islarge (see Chapter 5 for some examples of this)
Even for multimachine pull-based DV Engines, it occurs frequently that they need to pull from adata source that is too large for the number of worker nodes available Going back to ourprevious example, if the pull-based DV Engine has 10 worker nodes available to process 100 TB
of data, instead of taking 14 hours to process (which would be the case if there was only a singleworker node), it takes 1.4 hours to process This is still too long for most use cases Adding moreworker nodes is the most effective way to improve query performance in this scenario
It should be noted that sometimes push-based systems run into scalability issues when pull-basedsystems do not Other times, pull-based systems run into scalability issues when push-basedsystems do not The former case is what happened in our example where a single node ends uphaving to process a large data set The latter case happens when processing could have beenpushed down to a large parallel data source, such as a massively parallel processing (MPP)database management system, and instead gets pulled into a DV Engine with fewer resourcesthan the underlying MPP DBMS
Requests failing to complete
As described previously in this chapter, admission control is particularly difficult in DVSystems When it malfunctions, too much work is sent to the DV Engine or to the underlyingdata source systems This may cause requests to interfere with each other and the DV Engine togrind to a halt In such a scenario, reducing the reliance on the Admission Control componentwithin the DV Engine, and instead slowing request input to the engine, may be the only option.Data extraction bottlenecks
We will see in Chapter 5 that an extremely common performance problem in multimachine DVEngines is that data extraction from the underlying data stores is too slow: the data is unable toarrive at the worker nodes fast enough for them to be fully utilized This frequently happenswhen the Plan Executor is unable to divide the source data into multiple splits Instead, it isforced to send a single data extraction request to the underlying data source that is processed by asingle execution thread there The speed of the extraction is thus the speed that the single threadcan run, and the fact that there are multiple independent workers downstream processing the data
is irrelevant—they are simply not receiving data fast enough to fully utilize all of theirprocessing ability In many cases, this can be solved by explicitly partitioning the data in theunderlying data source For example, if the underlying data source is a relational databasesystem, adding a PARTITION BY clause to the table definitions in the schema of that sourceallows an explicit declaration of multiple different partitions at the source This allows the DVEngine to be able to split the data extraction requests by these partitions and read data in parallelfrom the data source (though we will see in Chapter 5 that not all DV Engines are capable ofdoing this automatically) An alternative solution to eliminating this bottleneck is to cache sourcedata at the DV Engine, an approach we discuss in detail in Chapter 4 Although the DV Engine
Trang 32controls its own cache, many DV platforms allow users to adjust the size of the cache along withother caching parameters Adjusting these parameters can have a significant impact on queryperformance when data extraction is a latency bottleneck.
Chapter 4 Advanced Architectural Components
In the previous chapter, we discussed the basic architectural components that are found in most
DV Systems In this chapter, we will discuss some advanced components that may not beavailable in all systems These advanced components either improve performance or addfunctionality to the DV System It is important for DV System users to understand thesecomponents and how they work since, in many cases, these components can be parameterized toadjust their functionality based on user preferences
Caching
A core tenet of data virtualization is that it does not require moving data to new locations.Instead, data is left at its original source, and the DV System extracts it on the fly as needed Onedisadvantage of this approach is that a dataset may need to be sent over the network multipletimes from the underlying data sources to the DV Engine if it is repeatedly accessed by a series
of queries Although we explained in Chapter 2 that faster networks are often able to avoidallowing this data transfer to become a bottleneck, too many unnecessary data transfers cannonetheless overload a network Furthermore, as we will see in Chapter 5 , when underlying datasources make their data available via slow interfaces such as Java Database Connectivity(JDBC), data transfer bottlenecks still very much exist today Therefore, repeatedly performingthe transfer of the same datasets is not only inefficient in terms of cost, CPU, and powerconsumption, but it can also degrade the performance of the DV System
This disadvantage is alleviated via the principle of caching, in which recently accessed dataremains temporarily stored in memory or on disk at the DV Engine and is available forsubsequent requests
There are many different ways to cache data, and some DV Systems use multiple differentcaching approaches In this section we will discuss the following five approaches:
Query cache
Block/partition cache
Database table cache
Automated pre-computation based cache
Materialized view caching
Trang 33Most DV Systems will not implement all five approaches, but rather will devote local storageresources to allocating space for the most appropriate subset of these approaches given thearchitecture of that system.
In each case, the cache functions as a (temporary) local copy of some subset of data that wouldotherwise require communication with remote data sources Over time, data may be updated atthe remote data sources, thereby rendering the data in the cache stale Most analysis tasks cantolerate a bounded degree of staleness in the cache; therefore, many DV Systems includeparameters that allow users to specify what degree of staleness is acceptable Nonetheless, aneffort must be made to reduce the degree of staleness in the cache—especially within thespecified bounds One of four approaches is typically performed to limit staleness:
The DV System automatically removes data from the cache that hasbeen present there beyond the degree of acceptable staleness
The DV System sends queries to the underlying data sources on aregular basis to check the status of data and, if anything has changed,
to refresh the cache The worst-case staleness of the cache is thusbounded by the frequency of these checks and refreshes
The DV System processes the log records of the underlying datasource to detect source data changes that may affect the current data
in the DV System cache
The underlying data source may notify the DV System directly if theunderlying data has changed
The first option is the easiest to implement, but it may remove data from the cache unnecessarily(i.e., even if it had not been updated at the underlying system) The second option is also fairlystraightforward to implement, but it consumes the most processing resources due to the repeatedcommunication between the DV System and underlying data source, along with the cost ofperforming the repeated checks at the underlying data source The third and fourth options aremore efficient, but they require a deeper integration between the DV System and the underlyingdata source The third requires the DV System to know how to read the log records of theunderlying data source (each data source may use a different format for its log records), and thefourth requires change notification functionality in the underlying data source—a feature notsupported by all data source systems
When caches run out of space, they need to remove data in the cache to make room for new data.Most caches use some form of the least recently used (LRU) algorithm to decide which
data to remove Data that has been in the cache for a long time without being used is thereforeremoved before actively used data
Query Cache
Many DV Systems implement some form of a query cache Once a query has been executed, thequery results are stored either in memory or on some form of a fast disk If, in the future, the
Trang 34same query is submitted to the DV System, then the system can reuse the previously storedresults rather than wasting resources to process the exact same query again.1
Some queries cannot be cached For example, if the query contains nondeterministic SQLfunctions such as now() or rand(), the same query may return different results every time it isrun Therefore, there is no benefit to caching the query results of previous runs.2
Query caches must enforce security rules to ensure that a user without access to run the originalquery cannot see the results of the cached query either In practice, this is done by sending thequery to the Query Parser described in Chapter 3 , since this is the component that typicallyenforces the security rules The parser is also helpful in checking to see if two queries areidentical even if they are expressed slightly differently
Of the cache refresh options we discussed previously, the second option is typically prohibitivelyexpensive for query caches This is because in order to check if the cache needs to be refreshed,the query needs to be run again If the frequency of the checks is higher than the frequency ofrepeated queries submitted to the DV System, the cost of maintaining the query cache may behigher than the processing reduction benefits it provides Therefore, query caches are typicallyimplemented via one of the other three refresh options discussed
Block/Partition Cache
Many data source systems partition the data they store For example, if the data source systemruns over a cluster of disks, the data will be partitioned across those disks Even if the systemruns on a single machine that has a single disk, the system may choose to partition data acrossthe storage of that machine to accelerate queries over the stored data—such as dividing data byzip code so that queries that access a particular zip code can quickly find the relevant data Fileformats such as Parquet and ORC, table formats such as Iceberg and Delta, and underlyingdatabase systems such as PostgreSQL, Teradata, or Oracle all support data partitioning
A partition cache stores entire source partitions at the DV System For example, the partitionassociated with the zip code 01776 may be cached at the DV System, while other zip codepartitions are not
For DV Systems that incorporate a local DV processing engine, a partition cache cansignificantly accelerate query processing However, this is only the case when the DV platformreceives mostly raw data from the underlying data storage (i.e., entire partitions of raw data)instead of pushing down the query processing to the storage system Thus, push-based DVSystems (see Chapter 3 ) generally do not include a partition cache
A partition cache does not necessarily need to use the same notion of a partition as theunderlying data source system As discussed in Chapter 3 , many DV Systems subdivide sourcepartitions into smaller chunks (often called splits), and the partition cache may choose to operate
at the granularity of a split instead of a complete source partition
Trang 35With regard to the cache refresh options we discussed previously, the second option is far morefeasible for partition caches than for query caches This is because in many cases, it is easy toimmediately determine if a partition has been updated or removed at the data source withoutissuing a separate query For example, if the source data is a raw data file in a data lake, the DVSystem can perform this check directly using file metadata
Access to the data in the partition cache needs to use the same authorization mechanisms as theunderlying data
In general, the trade-off between a partition cache and a query cache is as follows: data in apartition cache has a higher probability of being utilized in future query processing than data in aquery cache This is because this data is raw and is usable by many different types of futurequeries On the other hand, the partition cache saves less work for the DV platform relative to aquery cache, since the DV platform still has to process the rest of the query once the raw data isreceived In contrast, data in a query cache already contains the final results of query processing
Database Table Cache
In theory, a database table cache is equivalent to a partition cache in which all partitions
from a table are cached or none of them are However, in practice, partition caches and table
caches are typically used in different scenarios As mentioned in the previous section, partitioncaches cannot be used for underlying data source systems that do not partition data Even if they
do partition data, if these partitions are not exposed to external systems in a manner wherekeeping track of and extracting individual partitions is easy, the utility of partition caches isreduced significantly In such scenarios, table caches are more feasible since it is always possible
to keep track of what tables exist (and extract data from any given table) for any underlying datasource system that incorporates the concept of tables in its local model of data storage Thus,table caching becomes the primary mechanism to cache raw data when partition caching is notpossible The main downside of table caching relative to partition caching is that it is lessmemory-efficient Even if only part of a table is repeatedly accessed, the entire table must bekept in cache
In practice, partition caching is commonly used in DV Systems when working with data storedusing open data formats in a data lake Table caching is more common when working overcomplex data management systems such as SQL Server, Db2, or MySQL
Another difference (in practice) between table caching and partition caching is that while thepartition cache is typically stored in the DV System directly, the table cache may be stored on anexternal storage system or a data lake This is because the table cache typically requires morespace since entire tables must be cached One might ask: “If the table cache is stored on anexternal storage system, how is the cache helpful at all? The DV System should simply read thedata from the original data source directly since the original data source is also an externalsystem.” The answer to this question is that in many cases, the external storage system used forthe table cache is physically closer to the DV System than the data source system, or is otherwisefaster to access and read data from than the data source system This is especially true if the data
Trang 36source system has an ongoing operational workload that may slow down read requests by the DVSystem Furthermore, offloading these read requests to a different external system allows theoperational system to devote more resources to its normal operational workload.
Similar to the other caching approaches we have discussed, the security rules used for theoriginal tables must also be applied to the replica of the data that is cached
Automated Pre-Computation Based Cache
The automated pre-computation based cache is a generalization of the query cache.
As previously described, the query cache only caches the results of a previous query submitted tothe DV System In contrast, the automated pre-computation based cache attempts to cache a
larger result set than that which was returned to any particular query, so even queries that havenever been submitted to the system can still be answered from the data in the cache The DVplatform analyzes the log of previous queries that have been executed by the system, and thenattempts to extrapolate based on the pattern of these previous queries to prepare data for futurequeries by widening the scope of these previous queries For example, a query such as:
SELECT sum(sales) from sock_table
WHERE date = "01/01/1970"
would result in a single value, the sales of socks on January 1, 1970 This query can be widened
to include more aggregations (aside from only sum(sales)) and more dates (aside from only01/01/1970) as follows:
SELECT sum(sales), min(sales), max(sales),
Materialized View Caching
Both the query cache and the automated pre-computation based cache are fully automated Theuser of the DV System does not control what is stored in the cache, when data is removed, orwhich queries are redirected to these caches In contrast, a materialized view cache is fully
controlled by the DV System user If the system user is aware of a type of query that will berepeatedly sent to the system, a materialized view can be created to accelerate the performance ofthis query The user controls which materialized views are created and when they can be
Trang 37removed Meanwhile, the DV System redirects queries to these materialized views if theycontain the relevant data for any given input query
These materialized views may be stored in the DV System, the data source system, or even anexternal data lake Similar to the other caching approaches we have discussed, the equivalentsecurity rules used for the raw data must also be used for the materialized view cache
DV Engine–Initiated Writes to Underlying Data Sources
In general, the primary use case for DV Systems is for querying, analyzing, and processing datafrom the original data sources that are being virtualized In other words, data generallyflows one way through the system: from the data source through the DV Engine to the client.
Nonetheless, some DV Systems support bidirectional data flow where clients also can write orupdate data in the underlying data sources
There are several challenges that come into play when supporting bidirectional data flow
First, in some cases, the underlying data source does not give the DV System access to all datastored locally, but instead provides the system with access to a view of the data For example, anunderlying data source in Europe may contain private or personally identifiable information (PII)that is regulated by the General Data Protection Regulation (GDPR), which prevents it frombeing transferred to locations with lower levels of privacy protection If the DV Engine is located
in one of those locations, the underlying data source will only give it a view that does not includethe private information If the DV Engine wants to insert new entries into this view, it will not beable to include personal information since the view itself does not include personal information.Such writes will get rejected when they get forwarded to the original data source in which some
of the attributes of this personal data are required fields in the database
In general, writing back to views is a well-known problem in the field of database systems and is
in fact impossible in most nontrivial use cases So when the DV System is working over a view,many times bidirectional data flow is impossible
Second, as we discussed previously and in Chapter 3 , one of the primary advantages of based DV Systems is that they require a much shallower knowledge of the API of each systemthey need to work with As long as the DV System knows how to extract raw data from thatsystem, it can perform its needed functionality Supporting bidirectional flow gets rid of much ofthis advantage, since the DV System will need to acquire a deeper knowledge of the API in order
pull-to express writes back pull-to the underlying data source in a way that upholds the isolation,atomicity, and consistency guarantees of the underlying system
Third, in many cases, writes may cause consistency problems between the cache in the DVSystem and the data in the underlying data source
Trang 38In general, most modern DV Systems do not support bidirectional data flow However, based systems are more likely to support bidirectional data flow than pull-based systems sincethe second challenge is not relevant to push-based systems
push-Multiregion (and/or Multicloud) DV Systems
We have discussed at length that DV Systems are designed to run over many different types ofdata sources Until now, however, we have not discussed the physical location of those datasources In practice, it is rare for all of the data sources for a nontrivial DV System deployment
to be physically located in the same data center Instead, they are often located in different datacenters and often in different geographical regions (and countries)
There are many reasons why an organization may have data sources in different regions.Examples include:
The organization has recently migrated some of its applications to thecloud However, the rest of the applications are still running on-premises
An organization acquires or merges with another organization thatdoes business in a different part of the world and whose datasets arekept close to where the application does business
An organization that does business globally chooses to keep data close
to where it is most often accessed from for performance reasons
Regulatory requirements, such as data sovereignty rules, force theorganization to keep certain datasets in certain regions
One cloud is more cost-efficient for the access patterns for a particularapplication, but a different cloud is cheaper for the access patterns for
a different application within the same organization
When data sources exist in different regions, additional complexities are added to thearchitecture of DV Systems, especially pull-based systems This is because pulling data acrossdifferent geographical regions can be more expensive in terms of latency and money (because ofegress charges) than pulling data from sources within the same data center Furthermore, theremay be regulatory issues around the movement of data from data sources with data sovereigntyrequirements to the DV System for data processing
Multiregion DV Architecture
The fundamental systems architecture principle that is used to circumvent the issues discussed inthe previous section is that the DV System cannot be running only in a single region Rather,there need to be components of the DV System that run within the same region as each potentialdata source
Trang 39When the DV Engine needs to pull data out of a data source, it pulls it out into the component ofthe DV System running near that source If data needs to be extracted out of a data source inGermany, it is sent to DV Engine code running in Germany; if out of a data source in Japan, it issent to DV Engine code running in Japan The DV Engine code running at each region thereforeprocesses data sent to it from that region Figure 4-1 outlines this architecture.
Figure 4-1 Multiple DV Engines processing data local to the source
This figure shows a DV platform with data sources in two regions and a data consumer issuing adata processing request at one of those regions There are three potential scenarios for how thequery is processed depending on what data is accessed by that request:
If the request only accesses data within the same region as where therequest is initiated, the request is simply processed locally using thestandard DV Engine architecture: the query is parsed, optimized, andplanned locally and then processed using the pull-based or push-based
DV Engine architecture (depending on the design of the DV Systembeing used) All data transfers are local
If the request only accesses data from a remote region (e.g., it issubmitted to region 1 but only accesses data from region 2), therequest is forwarded in its entirety to region 2 Some parsing wasalready performed in region 1 (in order to determine which datasources were being accessed), but the rest of the parsing, optimizing,and planning will be done in region 2, along with the execution of therequest If the results of the request are allowed to be sent to the
Trang 40consumer in region 1 (e.g., no data sovereignty rules will be broken,and the result set is not too large), then they can be sent there.Otherwise, the consumer will have to access the results directly fromregion 2 (e.g., via a VPN)
If the request accesses data from multiple regions, then instances ofthe DV Engine in both regions must work together to process therequest In this case the region to which the request was submittedperforms the initial query parsing, optimization, and planning Duringthis planning process, data within a remote region is treated as a giantunified data source This process is very natural for push-basedsystems Just as a normal input request is divided into subrequests—one for each underlying data source—so too a subrequest will begenerated that processes data at the combined data source All dataprocessing that needs to happen on any data within that region iscombined into that subrequest and sent to that region The DV Engine
at that region receives that subrequest and further divides it intosmaller subrequests—one for each data source that needs to beaccessed at that region
However, the process is a little less natural for pull-based systemssince they initially must function like a push-based system, dividing arequest into subrequests for each region and sending the subrequest
to each region It is only after the subrequest arrives at that region thatthe system can revert back to its normal pull-based processing
The region to which the processing request was submitted is typicallyalso the one that combines the results of the subrequests andperforms any further necessary processing after combining them.However, in some cases a different region is chosen (e.g., if theintermediate result set from that region is much larger than theintermediate results from any other region) Care must be taken toensure that data sovereignty rules are not violated when sending dataacross regions at this later stage of query processing (see more later inthis section)
We described in Chapters 2 and 3 that query optimization in DV Systems is complex, and muchmore so for push-based systems since the optimizer needs to reason about cost equations andintermediate result set sizes of remote processing Since the multiregion DV architecture alwaysmust use push-based processing when sending work to remote regions, some of the complexity
of query optimization in the context of push-based systems exists in this context However, sincethe code running at remote regions are part of the same DV System, the optimizer within the DVEngine has detailed knowledge about how the different query processing options will be runthere and can estimate their anticipated costs This makes query optimization easier relative tooptimizers for standard push-based DV Engines running over foreign data sources that act as ablack box relative to the DV Engine