Designing cloud data platforms

"About the Book In Designing Cloud Data Platforms, Danil Zburivsky and Lynda Partner reveal a six-layer approach that increases flexibility and reduces costs. Discover patterns for ingesting data from a variety of sources, then learn to harness pre-built services provided by cloud vendors. What''''s Inside Best practices for structured and unstructured data sets Cloud-ready machine learning tools Metadata and real-time analytics Defensive architecture, access, and security"

Trang 2

about this book

about the authors

about the cover illustration

1 Introducing the data platform

1.1 The trends behind the change from data warehouses to data platforms 1.2 Data warehouses struggle with data variety, volume, and velocityVariety

All the V’s at once

1.3 Data lakes to the rescue? 1.4 Along came the cloud

1.5 Cloud, data lakes, and data warehouses: The emergence of cloud data platforms 1.6 Building blocks of a cloud data platform

Ingestion layerStorage layerProcessing layerServing layer

1.7 How the cloud data platform deals with the three V’sVariety

Trang 3

VolumeVelocityTwo more V’s

1.8 Common use cases

2 Why a data platform and not just a data warehouse

2.1 Cloud data platforms and cloud data warehouses: The practical aspectsA closer look at the data sources

An example cloud data warehouse–only architectureAn example cloud data platform architecture

2.2 Ingesting data

Ingesting data directly into Azure SynapseIngesting data into an Azure data platformManaging changes in upstream data sources 2.3 Processing data

Processing data in the warehouseProcessing data in the data platform 2.4 Accessing data

2.5 Cloud cost considerations 2.6 Exercise answers

3 Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google

3.1 Cloud data platform layered architectureData ingestion layer

Fast and slow storage

Trang 4

Processing layer

Technical metadata layer

The serving layer and data consumersOrchestration and ETL overlay layers

3.2 The importance of layers in a data platform architecture 3.3 Mapping cloud data platform layers to specific toolsAWS

Google CloudAzure

3.4 Open source and commercial alternativesBatch data ingestion

Streaming data ingestion and real-time analyticsOrchestration layer

3.5 Exercise answers

4 Getting data into the platform

4.1 Databases, files, APIs, and streamsRelational databases

Trang 5

Incremental table ingestionChange data capture (CDC)CDC vendors overviewData type conversion

Ingesting data from NoSQL databases

Capturing important metadata for RDBMS or NoSQL ingestion pipelines 4.3 Ingesting data from files

Tracking ingested files

Capturing file ingestion metadata 4.4 Ingesting data from streams

Differences between batch and streaming ingestionCapturing streaming pipeline metadata

4.5 Ingesting data from SaaS applicationsNo standard approach to API design

No standard way to deal with full vs incremental data exportsResulting data is typically highly nested JSON

4.6 Network and security considerations for data ingestion into the cloudConnecting other networks to your cloud data platform

5 Organizing and processing data

5.1 Processing as a separate layer in the data platform 5.2 Data processing stages

5.3 Organizing your cloud storage

Trang 6

Cloud storage containers and folders 5.4 Common data processing stepsFile format conversion

Data deduplicationData quality checks

5.5 Configurable pipelines 5.6 Exercise answers

6 Real-time data processing and analytics

6.1 Real-time ingestion vs real-time processing 6.2 Use cases for real-time data processingRetail use case: Real-time ingestion

Online gaming use case: Real-time ingestion and real-time processingSummary of real-time ingestion vs real-time processing

6.3 When should you use real-time ingestion and/or real-time processing? 6.4 Organizing data for real-time use

The anatomy of fast storageHow does fast storage scale?

Organizing data in the real-time storage

6.5 Common data transformations in real timeCauses of duplicates in real-time systemsDeduplicating data in real-time systems

Converting message formats in real-time pipelinesReal-time data quality checks

Trang 7

Combining batch and real-time data

6.6 Cloud services for real-time data processingAWS real-time processing services

Google Cloud real-time processing servicesAzure real-time processing services

7 Metadata layer architecture

7.1 What we mean by metadataBusiness metadata

Data platform internal metadata or “pipeline metadata” 7.2 Taking advantage of pipeline metadata

7.3 Metadata modelMetadata domains

7.4 Metadata layer implementation optionsMetadata layer as a collection of configuration filesMetadata database

Trang 8

Schema changes in a traditional data warehouse architectureSchema-on-read approach

8.2 Schema-management approachesSchema as a contract

Schema management in the data platformMonitoring schema changes

8.3 Schema Registry ImplementationApache Avro schemas

Existing Schema Registry implementationsSchema Registry as part of a Metadata layer 8.4 Schema evolution scenarios

Schema compatibility rules

Schema evolution and data transformation pipelines 8.5 Schema evolution and data warehouses

Schema-management features of cloud data warehouses 8.6 Exercise answers

9 Data access and security

9.1 Different types of data consumers 9.2 Cloud data warehouses

AWS RedshiftAzure SynapseGoogle BigQuery

Choosing the right data warehouse

Trang 9

9.3 Application data accessCloud relational databasesCloud key/value data storesFull-text search servicesIn-memory cache

9.4 Machine learning on the data platform

Machine learning model lifecycle on a cloud data platformML cloud collaboration tools

9.5 Business intelligence and reporting tools

Traditional BI tools and cloud data platform integrationUsing Excel as a BI tool

BI tools that are external to the cloud provider 9.6 Data security

Users, groups, and roles

Credentials and configuration managementData encryption

Network boundaries 9.7 Exercise Answers

10 Fueling business value with data platforms

10.1 Why you need a data strategy10.2 The analytics maturity journeySEE: Getting insights from data

PREDICT: Using data to predict what to do

Trang 10

DO: Making your analytics actionable

CREATE: Going beyond analytics into products

10.3 The data platform: The engine that powers analytics maturity10.4 Platform project stoppers

Time does indeed killUser adoption

User trust and the need for data governanceOperating in a platform silo

The dollar dance

1 Introducing the data platform

This chapter covers

Driving change in the world of analytics data

Understanding the growth of data volume, variety, and velocity, and why the traditional datawarehouse can’t keep up

Learning why data lakes alone aren’t the answer

Discussing the emergence of the cloud data platform

Studying the core building blocks of the cloud data platform

Viewing sample use cases for cloud data platforms

Every business, whether it realizes it or not, requires analytics It’s a fact There has always beena need to measure important business metrics and make decisions based on these measurements.Questions such as “How many items did we sell last month?” and “What’s the fastest way toship a package from A to B?” have evolved to “How many new website customers purchased apremium subscription?” and “What does my IoT data tell me about customer behavior?”

Before computers became ubiquitous, we relied on ledgers, inventory lists, a healthy dose ofintuition, and other limited, manual means of tracking and analyzing business metrics The late1980s ushered in the concept of a data warehouse—a centralized repository of structured datacombined from multiple sources—which was typically used to produce static reports Armedwith this data warehouse, businesses were increasingly able to shift from intuition-based decisionmaking to an approach based on data However, as technology and our needs evolved, we’ve

Trang 11

gradually shifted toward a new data management construct: the data platform that increasinglyresides in the cloud.

Simply put, a cloud data platform is “a cloud-native platform capable of cost-effectivelyingesting, integrating, transforming, and managing an almost unlimited amount of data of anytype data in order to facilitate analytics outcomes.” Cloud data platforms solve or significantlyimprove many of the fundamental problems and shortcomings that plague traditional datawarehouses and even modern data lakes—problems that center around data variety, volume, andvelocity, or the three V’s.

In this book, we’ll set the stage by taking a brief look at some of the core constructs of the datawarehouse and how they lead to the shortcomings outlined in the three V’s Then we’ll considerhow data warehouses and data lakes can work together to function as a data platform We’lldiscuss the key components of an efficient, robust, and flexible data platform design andcompare the various cloud tools and services that can be used in each layer of your design We’lldemonstrate the steps involved in ingesting, organizing, and processing data in the data platformfor both batch and real-time/streaming data After ingesting and processing data in the platform,we will move on to data management with a focus on the creation and use of technical metadataand schema management We’ll discuss the various data consumers and ways that data in theplatform can be consumed and then end with a discussion about how the data platform supportsthe business and a list of common nontechnical items that should be taken into consideration toensure use of the data platform is maximized.

By the time you’ve finished reading, you’ll be able to

 Design your own data platform using a modular design

 Design for the long term to ensure it is manageable, versatile, and scalable Explain and justify your design decisions to others

 Pick the right cloud tools for each part of your design Avoid common pitfalls and mistakes

 Adapt your design to a changing cloud ecosystem

1.1 The trends behind the change from data warehouses to data platformsData warehouses have, for the most part, stood the test of time and are still used in almost allenterprises But several recent trends have made their shortcomings painfully obvious.

The explosion in popularity of software as a service (SaaS) has resulted in a big increase in thevariety and number of sources of data being collected SaaS and other systems produce a varietyof data types beyond the structured data found in traditional data warehouses, includingsemistructured and unstructured data These last two data types are notoriously data warehouse

Trang 12

unfriendly, and are also prime contributors to the increasing velocity (the rate at which dataarrives into your organization) as real-time streaming starts to supplant daily batch updates andthe volume (the total amount) of data.

Another and arguably more significant trend, however, is the change of application architecturefrom monolithic to microservices Since in the microservices world there is no central operationdatabase from which to pull data, collecting messages from these microservices becomes one ofthe most important analytics tasks To keep up with these changes, a traditional data warehouserequires rapid, expensive, and ongoing investments in hardware and software upgrades Withtoday’s pricing models, that eventually becomes extremely cost prohibitive.

There’s also growing pressure from business users and data scientists who use modern analyticstools that can require access to raw data not typically stored in data warehouses This growingdemand for self-service access to data also puts stresses on the rigid data models associated withtraditional data warehouses.

1.2 Data warehouses struggle with data variety, volume, and velocity

This section explains why a data warehouse alone just won’t deliver on the growth in datavariety, volume, and velocity being experienced today, and how combining a data lake with adata warehouse to create a data platform can address the challenges associated with today’s data:variety, volume, and velocity.

The following diagram (figure 1.1) illustrates how a relational warehouse typically has an ETLtool or process that delivers data into tables in the data warehouse on a schedule It also hasstorage, compute (i.e., processing), and SQL services all running on a single physical machine.

Figure 1.1 Traditional data warehouse design

This single-machine architecture significantly limits flexibility For example, you may not beable to add more processing capacity to your warehouse without affecting storage.

Trang 13

1.2.1 Variety

Variety is indeed the spice of life when it comes to analytics But traditional data warehouses aredesigned to work exclusively with structured data (see figure 1.2) This worked well when mostingested data came from other relational data systems, but with the explosion of SaaS, socialmedia, and IoT (Internet of Things), the types of data being demanded by modern analytics aremuch more varied and now includes unstructured data such as text, audio, and video.

SaaS vendors, under pressure to make data available to their customers, started buildingapplication APIs using the JSON file format as a popular way to exchange data between systems.While this format provides a lot of flexibility, it comes with a tendency to change schemas oftenand without warning—making it only semistructured In addition to JSON, there are otherformats such as Avro or Protocol Buffers that produce semistructured data, for developers ofupstream applications to choose from Finally, there are binary, image, video, and audio data—truly unstructured data that’s in high demand by data science teams Data warehouses weren’tdesigned to deal with anything but structured data, and even then, they aren’t flexible enough toadapt to the frequent schema changes in structured data that the popularity of SaaS systems hasmade commonplace.

Figure 1.2 Handling of a range of data varieties and processing options are limited in atraditional data warehouse.

Inside a data warehouse, you’re also limited to processing data either in the data warehouse’sbuilt-in SQL engine or a warehouse-specific stored procedure language This limits your abilityto extend the warehouse to support new data formats or processing scenarios SQL is a greatquery language, but it’s not a great programming language because it lacks many of the toolstoday’s software developers take for granted: testing, abstractions, packaging, libraries forcommon logic, and so on ETL (extract, transform, load) tools often use SQL as a processing

Trang 14

language and push all processing into the warehouse This, of course, limits the types of dataformats you can deal with efficiently.

1.2.2 Volume

Data volume is everyone’s problem In today’s internet-enabled world, even a small organizationmay need to process and analyze terabytes of data IT departments are regularly being asked tocorral more and more data Clickstreams of user activity from websites, social media data, third-party data sets, and machine-generated data from IoT sensors all produce high-volume data setsthat businesses often need to access.

Figure 1.3 In traditional data warehouses, storage and processing are coupled.

In a traditional data warehouse (figure 1.3), storage and processing are coupled together,significantly limiting scalability and flexibility To accommodate a surge in data volume intraditional relational data warehouses, bigger servers with more disk, RAM, and CPU to processthe data must be purchased and installed This approach is slow and very expensive, because youcan’t get storage without compute, and buying more servers to increase storage means that youare likely paying for compute that you might not need, or vice versa Storage appliances evolvedas a solution to this problem but did not eliminate the challenges of easily scaling compute andstorage at a cost-effective ratio The bottom line is that in a traditional data warehouse design,processing large volumes of data is available only to organizations with significant IT budgets.1.2.3 Velocity

Data velocity, the speed at which data arrives into your data system and is processed, might notbe a problem for you today, but with analytics going real-time, it’s just a question of when, notif With the increasing proliferation of sensors, streaming data is becoming commonplace Inaddition to the growing need to ingest and process streaming data, there’s increasing demand toproduce analytics in as close to real-time as possible.

Trang 15

Traditional data warehouses are batch-oriented: take nightly data, load it into a staging area,apply business logic, and load your fact and dimension tables This means that your data andanalytics are delayed until these processes are completed for all new data in a batch Streamingdata is available more quickly but forces you to deal with each data point separately as it comesin This doesn’t work in a data warehouse and requires a whole new infrastructure to deliver dataover the network, buffer it in memory, provide reliability of computation, etc.

1.2.4 All the V’s at once

The emergence of artificial intelligence and its popular subset, machine learning, creates atrifecta of V’s When data scientists become users of your data systems, volume and varietychallenges come into play all at once Machine learning models love data—lots and lots of it(i.e., volume) Models developed by data scientists usually require access not just to theorganized, curated data in the data warehouse, but also to the raw source-file data of all typesthat’s typically not brought into the data warehouse (i.e., variety) Their models are computeintensive, and when run against data in a data warehouse, put enormous performance pressure onthe system, especially when they run against data arriving in near-real time (velocity) Withcurrent data warehouse architectures, these models often take hours or even days to run Theyalso impact warehouse performance for all other users while they’re running Finding a way togive data scientists access to high-volume, high-variety data will allow you to capitalize on thepromise of advanced analytics while reducing its impact on other users and, if done correctly, itcan keep costs lower.

1.3 Data lakes to the rescue?

A data lake, as defined by TechTarget’s WhatIs.com is “A storage repository that holds a vastamount of raw data in its native format until it is needed.” Gartner Research adds a bit morecontext in its definition: “A collection of storage instances of various data assets additional to theoriginating data sources These assets are stored in a near-exact (or even exact) copy of thesource format As a result, the data lake is an unintegrated, non-subject-oriented collection ofdata.”

The concept of a data lake evolved from these megatrends mentioned previously, asorganizations desperately needed a way to deal with increasing numbers of data formats andgrowing volumes and velocities of data that traditional data warehouses couldn’t handle Thedata lake was to be the place where you could bring any data you want, from different sources,structured, unstructured, semistructured, or binary It was the place where you could store andprocess all your data in a scalable manner.

After the introduction of Apache Hadoop in 2006, data lakes became synonymous with theecosystem of open source software utilities, known simply as “Hadoop,” that provided asoftware framework for distributed storage and processing of big data using a network of manycomputers to solve problems involving massive amounts of data and computation While mostwould argue that Hadoop is more than a data lake, it did address some of the variety, velocity,and volume challenges discussed earlier in this chapter:

Trang 16

 Variety—Hadoop’s ability to do schema on read (versus the data warehouse’s schema on

write) meant that any file in any format could be immediately stored on the system, andprocessing could take place later Unlike data warehouses, where processing could onlybe done on the structured data in the data warehouse, processing in Hadoop could bedone on any data type.

 Volume—Unlike the expensive, specialized hardware often required for warehouses,

Hadoop systems took advantage of distributed processing and storage across lessexpensive commodity hardware that could be added in smaller increments as needed.This made storage less expensive, and the distributed nature of processing made it easierand faster to do processing because the workload could be split among many servers. Velocity—When it came to streaming and real-time processing, ingesting and storing

streaming data was easy and inexpensive on Hadoop It was also possible, with the helpof some custom code, to do real-time processing on Hadoop using products such as Hiveor MapReduce or, more recently, Spark.

Hadoop’s ability to cost-effectively store and process huge amounts of data in its native formatwas a step in the right direction towards handing variety, volume, and velocity of today’s dataestate, and for almost a decade, it was the de facto standard for data lakes in the data center.But Hadoop did have shortcomings:

 It is a complex system with many integrated components that run on hardware in a datacenter This makes it difficult to maintain and requires a team of highly skilled supportengineers to keep the system secure and operational.

 It isn’t easy for users who want to access the data Its unstructured approach to storage,while more flexible than the very structured and curated data warehouse, is often toodifficult for business users to make sense of.

 From a developer perspective, its use of an “open” toolset makes it very flexible, but itslack of cohesiveness makes it challenging to use For example, you can install anylanguage, library, or utility onto a Hadoop framework to process data, but you wouldhave to know all those languages and libraries instead of using a generic interface such asSQL.

 Storage and compute are not separate, meaning that while the same hardware can be usedfor both storage and compute, it can only be deployed effectively in a static ratio Thislimits its flexibility and cost-effectiveness.

 Adding hardware to scale the system often takes months, resulting in a cluster that iseither chronically over or underutilized.

Trang 17

Inevitably a better answer came along—one that had the benefits of Hadoop, eliminated itsshortcomings, and brought even more flexibility to designers of data systems Along camethe cloud.

1.4 Along came the cloud

The advent of the public cloud, with its on-demand storage, compute resource provisioning, andpay-per-usage pricing model, allowed data lake design to move beyond the limitations ofHadoop The public cloud allowed the data lake to include more flexibility in design andscalability and be more cost effective while drastically reducing the amount of support required.Data warehouses and data lakes have moved to the cloud and are increasingly offered asa platform as a service (PaaS), defined by Wikipedia as “a category of cloud computing servicesthat provides a platform allowing customers to develop, run, and manage applications withoutthe complexity of building and maintaining the infrastructure typically associated withdeveloping and launching an app.” Using PaaS allows organizations to take advantage ofadditional flexibility and cost-effective scalability There’s also a new generation of dataprocessing frameworks available only in the cloud that combine scalability with support formodern programming languages and integrate well into the overall cloud paradigm.

The advent of the public cloud changed everything when it came to analytics data systems Itallowed data lake design to move beyond the limitations of Hadoop and allowed for the creationof a combined data lake and data warehouse solution that went far beyond what was available onpremises.

The cloud brought so many things, but topping the list were the following:

 Elastic resources —Whether you’re talking storage or compute, you can get either from

your favorite cloud vendor: the amount of that resource is allocated to you exactly as youneed it; and it grows and shrinks as your needs change—automatically or by request. Modularity—Storage and compute are separate in a cloud world No longer do you have

to buy both when you need only one, which optimizes your investment.

 Pay per use—Nothing is more irksome than paying for something you aren’t using In a

cloud world, you only pay for what you use so you no longer have to invest inoverprovisioned systems in anticipation of future demand.

 Cloud turns capital investment, capital budgets, and capital amortization intooperational expense—This is tied to pay per use Compute and storage resources are now

utilities rather than owned infrastructure.

 Managed services are the norm—In an on-premises world, human resources are needed

for the operation, support, and updating of a data system In a cloud world, much of thesefunctions are done by the cloud provider and are included in the use of the services.

Trang 18

 Instant availability—Ordering and deploying a new server can take months Ordering and

deploying a cloud service takes minutes.

 A new generation of cloud-only processing frameworks—There’s a new generation of

data processing frameworks available only in the cloud that combine scalability withsupport for modern programming languages and integrate well into the overall cloudparadigm.

 Faster feature introduction—Data warehouses have moved to the cloud and are

increasingly offered as PaaS, allowing organizations to take instant advantage of newfeatures.

Let’s look at an example: Amazon Web Services (AWS) EMR.

AWS EMR is a cloud data platform for processing data using open source tools It is offered as amanaged service from AWS and allows you to run Hadoop and Spark jobs on AWS All youneed to do to create a new cluster is to specify how many virtual machines you need and whattype of machines you want You also need to provide a list of software you want to install on thecluster, and AWS will do the rest for you In several minutes you have a fully functional clusterup and running Compare that to months of planning, procuring, deploying, and configuring anon-premises Hadoop cluster! Additionally, AWS EMR allows you to store data on AWS S3 andprocess the data on an AWS EMR cluster without permanently storing any data on AWS EMRmachines This unlocks a lot of flexibility in the number of clusters you can run and theirconfiguration and allows you to create ephemeral clusters that can be disposed of once their jobis done.

1.5 Cloud, data lakes, and data warehouses: The emergence of cloud dataplatforms

The argument for a data lake is tied to the dramatic increases in variety, volume, and velocity oftoday’s analytic data, along with the limitations of traditional data warehouses to accommodatethese increases We’ve described how a data warehouse alone struggles to cost-effectivelyaccommodate the variety of data that IT must make available It’s also more expensive andcomplicated to store and process these growing volumes and velocities of data in a datawarehouse, instead of in a combination of a data lake and a data warehouse.

A data lake easily and cost-effectively handles an almost unlimited variety, volume, and velocityof data The caveat is that it’s not usually organized in a way that’s useful to most users—business users in particular Much of the data in a data lake is also ungoverned, which presentsother challenges It may be that in the future a modern data lake will completely replace the datawarehouse, but for now, based on what we see in all our customer environments, a data lake isalmost always coupled with a data warehouse The data warehouse serves as the primarygoverned data consumption point for business users, while direct user access to the largelyungoverned data in a data lake is typically reserved for data exploration either by advanced users,such as data scientists, or other systems.

Trang 19

Until recently, the data warehouse and/or associated ETL tools are where the majority of dataprocessing took place But today that processing can occur in the data lake itself, movingperformance-impacting processing from the more expensive data warehouse to the lessexpensive data lake This also provides for new forms of processing, such as streaming, as wellas the more traditional batch processing supported by data warehouses.

While the distinction between a data lake and data warehouse continues to blur, they each havedistinct roles to play in the design of a modern analytics platform There are many good reasonsto consider a data lake in addition to a cloud data warehouse instead of simply choosing one orthe other A data lake can help balance your users’ desire for immediate access to all the dataagainst the organization’s need to ensure data is properly governed in the warehouse.

The bottom line is that the combination of new processing technologies available in the cloud, acloud data warehouse, and a cloud data lake enable you to take better advantage of themodularity, flexibility, and elasticity offered in the cloud to meet the needs of the broadestnumber of use cases The resulting solution is a modern data platform: cost effective, flexible,

and capable of ingesting, integrating, transforming, and managing all the V’s to facilitateanalytics outcomes.

The resulting analytics data platform can be far more capable than anything the data center canpossibly provide Designing a cloud data platform to take advantage of new technologies andcloud services to address the needs of the new data consumers is the subject of this book.

1.6 Building blocks of a cloud data platform

The purpose of a data platform is to ingest, store, process, and make data available for analysisno matter which type of data comes in—and in the most cost-efficient manner possible Toachieve this, well-designed data platforms use a loosely coupled architecture where each layer isresponsible for a specific function and interacts with other layers via their well-defined APIs.The foundational building blocks of a data platform are ingestion, storage, processing, andserving layers, as illustrated in figure 1.4.

Trang 20

Figure 1.4 Well-designed data platforms use a loosely coupled architecture where each layer isresponsible for a specific function.

1.6.1 Ingestion layer

The ingestion layer is all about getting data into the data platform It’s responsible for reachingout to various data sources such as relational or NoSQL databases, file storage, or internal orthird-party APIs, and extracting data from them With the proliferation of different data sourcesthat organizations want to feed their analytics, this layer must be very flexible To this end, theingestion layer is often implemented using a variety of open source or commercial tools, eachspecialized to a specific data type.

One of the most important characteristics of a data platform’s ingestion layer is that this layershould not modify and transform incoming data in any way This is to make sure that the raw,unprocessed data is always available in the lake for data lineage tracking and reprocessing.1.6.2 Storage layer

Once we’ve acquired the data from the source, it must be stored This is where data lake storagecomes into play An important characteristic of a data lake storage system is that it must bescalable and inexpensive, so as to accommodate the vast amounts and velocity of data beingproduced today The scalability requirement is also driven by the need to store all incoming datain its raw format, as well as the results of different data transformations or experiments that datalake users apply to the data.

A standard way to obtain scalable storage in a data center is to use a large disk array or Attached Storage These enterprise-level solutions provide access to large volumes of storage,

Trang 21

Network-but have two key drawbacks: they’re usually expensive, and they typically come with apredefined capacity This means you must buy more devices to get more storage.

Given these factors, it’s not surprising that flexible storage was one of the first services offeredby cloud vendors Cloud storage doesn’t impose any restrictions on the types of files you canupload—you’ve got free rein to bring in text files like CSV or JSON and binary files like Avro,Parquet, images, or video—just about anything can be stored in the data lake This ability tostore any file format is an important foundation of a data lake because it allows you to store raw,unprocessed data and delay its processing until later.

For users who have worked with Network-Attached Storage or Hadoop Distributed FileSystem (HDFS), cloud storage may look and feel very similar to one of those systems But thereare some important differences:

 Cloud storage is fully managed by a cloud provider This means you don’t need to worryabout maintenance, software or hardware upgrades, etc.

 Cloud storage is elastic This means cloud vendors will only allocate the amount ofstorage you need, growing or shrinking the volume as requirements dictate You nolonger need to overprovision storage system capacity in anticipation of future demand. You only pay for the capacity you use.

 There are no compute resources directly associated with cloud storage From an end-userperspective, there are no virtual machines attached to cloud storage—this means largevolumes of data can be stored without having to take on idle compute capacity When thetime comes to process the data, you can easily provision the required compute resourceson demand.

Today, every major cloud provider offers a cloud storage service—and for good reason As dataflows through the data lake, cloud storage becomes a central component Raw data is stored incloud storage and awaits processing, the processing layer saves the results back to cloud storage,and users access either raw or processed data in an ad hoc fashion.

1.6.3 Processing layer

After data has been saved to cloud storage in its original form, it can now be processed to make itmore useful The processing of data is arguably the most interesting part of building a data lake.While the data lake’s design makes it possible to perform analysis directly on the raw data, thismay not be the most productive and efficient method Usually, data is transformed to somedegree to make it more user-friendly for analysts, data scientists, and others.

There are several technologies and frameworks available for implementing a processing layer inthe cloud data lake, unlike traditional data warehouses, which typically limited you to a SQL

engine provided by your database vendor However, while SQL is a great query language, it isnot a particularly robust programming language For example, it’s difficult to extract common

Trang 22

data-cleaning steps into a separate, reusable library in pure SQL, simply because it lacks many ofthe abstraction and modularity features of modern programming languages such as Java, Scala,or Python SQL also doesn’t support unit or integration testing It’s very difficult to makeiterative data transformations or data-cleaning code without good test coverage Despite theselimitations, SQL is still widely used in data lakes for analyzing data, and in fact many of the dataservice components provide a SQL interface.

Another limitation of SQL—in this case, not the language itself, but its implementation inRDBMs—is that all data processing must happen inside the database engine This limits theamount of computational resources available for data processing tasks to how many CPU, RAM,or disks are available in a single database server Even if you’re not processing extremely largedata volumes, you may need to process the same data multiple times to satisfy different datatransformation or data governance requirements Having a data processing framework that canscale to handle any amount of data, along with cloud compute resources you can tap intoanytime, makes solving this problem possible.

Several data processing frameworks have been developed that combine scalability with supportfor modern programming languages and integrate well into the overall cloud paradigm Mostnotable among these are

 Apache Spark Apache Beam Apache Flink

There are other, more specialized frameworks out there, but this book will focus on these three.At a high level, each one allows you to write data transformation, validation, or cleaning tasksusing one of the modern programming languages (usually Java, Scala, or Python) Theseframeworks then read the data from scalable cloud storage, split it into smaller chunks (if thedata volume requires it), and finally process these chunks using flexible cloud computeresources.

It’s also important, when thinking about data processing in the data lake, to keep in mind thedistinction between batch and stream processing Figure 1.5 shows that the ingestion layer savesdata to cloud storage, with the processing layer reading data from this storage and saving resultsback to it.

Trang 23

Figure 1.5 Processing differs between batch and streaming data.

This approach works very well for batch processing because while cloud storage is inexpensiveand scalable, it’s not particularly fast Reading and writing data can take minutes even formoderate volumes of data More and more use cases now require significantly lower processingtimes (seconds or less) and are generally solved with stream-based data processing In this case,also shown in the preceding diagram, the ingestion layer must bypass cloud storage and senddata directly to the processing layer Cloud storage is then used as an archive where data isperiodically dumped but isn’t used when processing all that streaming data.

Processing data in the data platform typically includes several distinct steps including schemamanagement, data validation, data cleaning, and the production of data products We’ll coverthese steps in greater detail in chapter 5.

1.6.4 Serving layer

The goal of the serving layer is to prepare data for consumption by end users, be they people orother systems The increasing demands from a variety of users in most organizations who needfaster access to more data is a huge IT challenge in that these users often have different (or even

Trang 24

no) technology backgrounds They also typically have different preferences as to which toolsthey want to use to access and analyze data.

Business users often want access to reports and dashboards with rich self-service capabilities.The popularity of this use case is such that when we talk about data platforms, we almost alwaysdesign them to include a data warehouse.

Power users and analysts want to run ad hoc SQL queries and get responses in seconds Datascientists and developers want to use the programming languages they’re most comfortable withto prototype new data transformations or build machine learning models and share the resultswith other team members Ultimately, you’ll typically have to use different, specializedtechnologies for different access tasks But the good news is that the cloud makes it easy forthem to coexist in a single architecture For example, for fast SQL access, you can load data fromthe lake into a cloud data warehouse.

To provide data lake access to other applications, you can load data from the lake into a fast key/value or document store and point the application to that And for data science and engineeringteams, a cloud data lake provides an environment where they can work with the data directly incloud storage by using a processing framework such as Spark, Beam, or Flink Some cloudvendors also support managed notebook environments such as Jupyter Notebook or ApacheZeppelin Teams can use these notebooks to build a collaborative environment where they canshare the results of their experiments along with performing code reviews and other activities.The main benefit of the cloud, in this case, is that several of these technologies are offered asplatform as a service (PaaS), which shifts the operations and support of these functions to thecloud provider Many of these services are also offered through a pay-as-you-go pricing model,making them more accessible for organizations of any size.

1.7 How the cloud data platform deals with the three V’s

The following sections explain how variety, volume, and velocity work with cloud platforms.1.7.1 Variety

A cloud data platform is well positioned to adapt to all this data variety because of its layereddesign The data platform’s ingestion layer can be implemented as a collection of tools, eachdealing with a specific source system or data type Or it can be implemented as a single ingestionapplication with a plug-and-play design that allows you to add and remove support for differentsource systems as required For example, Kafka Connect and Apache NiFi are examples of plug-and-play ingestion layers that adapt to different data types At the storage layer, cloud storagecan accept data in any format because it’s a generic file system—meaning you can store JSON,CSV, video, audit data, or any other data type There are no data type limits associated withcloud storage, which means you can introduce new types of data easily.

Finally, using a modern data processing framework such as Apache Spark or Beam means you’reno longer confined by the limitations of the SQL programming language Unlike SQL, in Spark

Trang 25

you can easily use existing libraries for parsing and processing popular file formats or implementa parser yourself if there’s no support for it today.

1.7.2 Volume

The cloud provides tools that can store, process, and analyze lots of data without a large, upfrontinvestment in hardware, software, and support The separation of storage and compute and pay-as-you-use pricing in the cloud data platform makes handling large data volumes in the cloudeasier and less expensive Cloud storage is elastic, the amount of storage grows and shrinks asyou need it, and the many tiers of pricing for different types of storage (both hot and cold) meansyou pay only for what you need in terms of both capacity and accessibility.

On the compute side, processing large volumes of data is also best done in the cloud and outsidethe data warehouse You’ll likely need a lot of compute capacity to clean and validate all thisdata, and it’s unlikely you’ll be running these jobs continuously, so you can take advantage ofthe elasticity of the cloud to provision a required cluster on demand and destroy it afterprocessing is complete By running these jobs in the data platform but outside the datawarehouse, you also won’t negatively impact the performance of the data warehouse for users,and you might also save a substantial amount of money because the processing will use datafrom less expensive storage.

While cloud storage is almost always the least expensive way to store raw data, processed data ina data warehouse is the de facto standard for business users, and the same elasticity applies tocloud data warehouses offered by Google, AWS, and Microsoft Cloud data warehouse servicessuch as Google BigQuery, AWS Redshift, and Azure Synapse provide either an easy way toscale warehouse capacity up and down on demand, or, like Google BigQuery, introduce theconcept of paying only for the resources a particular query has consumed With cloud data lakes,processing large volumes of data is available to budgets of almost any size These cloud datawarehouses couple on-demand scaling with an almost endless array of pricing options that can fitany budget.

1.7.3 Velocity

Think about running a predictive model to recommend a next best offer (NBO) to a user on yourwebsite A cloud data lake allows the incorporation of streaming data ingestion and analyticsalongside more traditional business intelligence needs such as dashboards and reporting Mostmodern data processing frameworks have robust support for real-time processing, allowing youto bypass the relatively slow cloud storage layer and have your ingestion layer send streamingdata directly to the processing layer.

With elastic cloud compute resources, there’s no longer any need to share real-time workloadswith your batch workloads—you can have dedicated processing clusters for each use case, oreven for different jobs, if needed The processing layer can then send data to differentdestinations: to a fast key/value store to be consumed by an application, to cloud storage forarchiving purposes, or to a cloud warehouse for reporting and ad hoc analytics.

Trang 26

When data scientists become users of your data systems, volume and variety challenges comeinto play all at once Machine learning models love data—lots and lots of it (i.e., volume).Models developed by data scientists usually require access not just to the organized, curated datain the data warehouse, but also to the raw, source-file data of all types that’s typically notbrought into the data warehouse (i.e., variety) Their models are compute intensive, and whenrun against data in a data warehouse, put enormous performance pressure on the system Withcurrent data warehouse architectures, these models often take hours or even days to run Theyalso impact warehouse performance for all other users while they’re running Giving datascientists access to the high-volume, high-variety data in the data lake makes everyone happierand will likely keep costs lower.

1.7.4 Two more V’s

Veracity and value are two other V’s that should factor into your choice of a data platform overjust a data warehouse Turning data into value only happens when your data users, be theypeople, models, or other systems, get timely access to data and use it effectively.

The beauty of a data lake is that you can now give people access to more data The downside,though, is that you’re also providing access to data that’s not necessarily as clean and organizedand well governed as it tends to be in a data warehouse The veracity or correctness of the data isa major consideration of any big data project, and while data governance is a topic big enoughfor a book of its own, many big data projects balance the need for data governance (ensuringveracity) with the need for access to more data to drive value This can be accomplished by usingthe data platform not just as a source of raw data to produce governed data sets for the datawarehouse, but as an ungoverned or lightly governed data repository where users can explore theentirety of their data, knowing that it hasn’t yet been blessed for corporate reporting Weincreasingly see data governance as an iterative, more Agile process when data lakes areinvolved—where once the exploratory phase is complete and models appear to produce a goodoutput, the data moves into the data warehouse, where it becomes part of a governed dataset.1.8 Common use cases

Understanding the various use cases for data platforms is important when you design and planyour own data platform Without this context, you risk winding up with a data swamp thatdoesn’t actually deliver real business value.

One of the most common data platform use cases is driven by the need for a 360-degree view ofan organization’s customers Customers engage with or talk about organizations in many waysusing many different systems, from social media to e-commerce to online chats to call centerconversations and more The data from these engagements is both structured and unstructured,originates from many different sources, and is of varying degrees of quality, volume, andvelocity Integrating all these touchpoints into a single view of the customer opens the doors to aplethora of improved business outcomes—an improved customer experience as they interactwith different parts of the business, better personalization in marketing, dynamic pricing,reduced churn, improved cross-selling, and much more.

Trang 27

A second common use case for a data lake is IoT, where data from machines and sensors can becombined to create operational insights and efficiencies ranging from proactively predictingequipment failures on a factory floor or in the field, to monitoring skier (a person who movesover snow on skis) performance and location via RFID tags The data from sensors tends to bevery high volume and with a high degree of uncertainty, which makes it well suited for a datalake Traditional data warehouses don’t just struggle with this data; the sheer amount of dataproduced in an IoT use case makes a traditional data warehouse-based project extremelyexpensive and shifts the balance away from a good return on that investment for all but a smallnumber of cases.

The emergence of advanced analytics using machine learning and AI has also driven theadoption of data lakes, because these techniques require the processing of large data sets—oftenmuch larger than can be cost-effectively stored or processed in a data warehouse In this regard,the data lake, with its ability to cost-effectively store almost unlimited amounts of raw data, is adata scientist’s dream come true And the ability to process that data without impactingperformance for other analytic consumers is another big benefit.

 Pressure from the business to get more accurate insights faster, cheaper, and with anincreasing level of confidence is growing along with the volume, velocity, and variety ofdata needed to produce these insights All of this is putting immense pressure ontraditional data warehouses and paving the way for the emergence of a new solution. Traditional data warehouses or data lakes alone can’t meet today’s rapidly changing data

requirements, but when combined with new cloud services and processing frameworksavailable only in the cloud, they create a powerful and flexible analytics data platformthat addresses a wide range of use cases and data consumers.

 Data platform design revolves around concepts of flexibility and cost effectiveness.Public cloud, with its on-demand storage, compute resources provisioning, and pay-per-usage pricing model, fits the data platform design perfectly.

 Well-designed data platforms use a loosely coupled architecture where each layer isresponsible for a specific function The foundational building blocks of a data platformare individual layers designed for ingestion, storage, processing, and serving.

 Key use cases for data platforms include a 360-degree view of an organization’scustomers, IoT, and machine learning.

2 Why a data platform and not just a datawarehouse

This chapter covers

Trang 28

Answering “Why a data platform?” and “Why build in the cloud?”

Comparing data platform to data warehouse–only solutions

Processing differences in structured and semi-structured data

Comparing cloud costs for data warehouse and data platform

We’ve covered what a data platform is (as a reminder, it is “a cloud-native platform capable ofcost-effectively ingesting, integrating, transforming, and managing an almost unlimited amountof data of any type data in order to facilitate analytics outcomes”), what drives the need for adata platform, and how changes in data will shape your data platform Now we will explore inmore detail why a cloud data platform provides more capabilities as opposed to a datawarehouse–only architecture In this chapter, we will equip you with the knowledge necessary tomake a solid argument for a data platform and will walk you through several examplesdemonstrating the difference between the two approaches (data warehouse–only and dataplatform).

While designing the best data platform is what we want you to be able to do when you’vefinished this book, we also know from experience that knowing the “why” behind your dataplatform project will not only help you make better decisions along the way, but you’ll also beready to justify why it makes sense from a business perspective to embark on a cloud dataplatform project This chapter will equip you with solid business and technical reasons forchoosing a data platform so you’ll be ready for that moment when someone asks, “Why are youdoing this?”

We’ll use a simple but common analytics use case to demonstrate how a data warehouse solutioncompares to a data platform example, introducing some of the key differences between the twooptions.

We will start by describing two potential architectures: one that is centered around a cloud datawarehouse only and another one that uses broader design principles to define a data platform.Then we will walk through examples showing how to load and work with the data in bothsolutions We will specifically focus on what happens to the data platform pipelines when thereare changes to source data structure and look at how data platform architecture can help youanalyze semistructured data at scale Because similar outcomes can be achieved by directlyingesting data into the cloud data warehouse, we’ll also walk through loading and working withthe same data in a data warehouse alone.

We’ll also explore the main difference between delivering and analyzing data in the traditionalwarehouse versus a data platform environment You will see how each solution deals withchanges to the source schema and how they work with large volumes of semistructured data,such as JSON We will also compare the cost and performance characteristics of each.

By the end of this chapter, you’ll understand how a data platform compares to meeting the samebusiness goals with a data warehouse.

Trang 29

2.1 Cloud data platforms and cloud data warehouses: The practical aspectsIn this section, we’ll use an example cloud analytics challenge to illustrate the differencesbetween data platforms and data warehouse architectures We’ll also introduce Azure as thecloud platform for this example and describe an Azure cloud data warehouse and an Azure dataplatform architecture.

Imagine that we’ve been tasked with building a small reporting solution for our organization.The Marketing department in our organization has data from their email marketing campaignsthat is stored in a relational database—let’s assume it’s a MySQL RDBMS for this scenario.They also have clickstream data that captures all website user activity that is then stored in aCSV file that is available to us via an internal SFTP server.

Our main analytical goal is to combine campaign data with the clickstream data to identify userswho landed on our website using links from specific email marketing campaigns and the parts ofthe site they visited In our example, Marketing wants this analysis done repeatedly, so our datapipelines must bring data into our cloud environment on a regular basis and must be resilient tochanges in the upstream source We would also like our solution to be performant and costefficient As usual, the deadline for this is “yesterday.”

To illustrate the differences between a data platform approach to dealing with the first three V’sof data (volume, variety, and velocity) versus a more traditional warehouse approach, let’sconsider two simplified implementations: (1) a data platform with a data warehouse and (2) atraditional data warehouse We will use Azure as our cloud platform of choice for theseexamples We could equally have used similar examples on AWS or Google Cloud Platform, butAzure allows us to easily emulate a traditional warehouse with Azure Synapse Azure Synapse isa fully managed and scalable warehousing solution from Microsoft, based on a very popular MSSQL Server database engine This way, one of our example architectures will be very close towhat you might see in an on-premises data warehouse setup.

2.1.1 A closer look at the data sources

Our simplified email marketing campaign data consists of a single table, as shown in figure 2.1.The table includes the unique identifier of the campaign (campaign_id), a list of target emailaddresses the campaign was sent to (email), a unique code included in a link to our website foreach specific user (unique_code), and a date when the campaign was sent (send_date) Realmarketing automation systems are, of course, more complex and include many different tables,but this is good enough for our purpose.

Trang 30

Figure 2.1 Example marketing campaign table

Clickstream data with its lack of fixed schema is semistructured data derived from webapplication logs and includes details about the pages visited, information about the visitor’sbrowser and operating system, session identifiers, etc.

NOTE In general terms, semistructured data is any data that doesn’t fit nicely into a relationalmodel This means that semistructured data cannot be represented as a flat table with columnsand rows, where each cell contains a value of a certain type: integers, dates, strings, etc JSONdocuments are a common example of semistructured data.

These logs will look different depending on the application that was used to generate them Forour example, we will use a simplified representation of the clickstream log, shown in figure 2.2,that only includes the details we need for our use case.

Figure 2.2 Example clickstream data

For our scenario, we will assume that clickstream log is a large (hundreds of GBs) text file inCSV format that includes three columns: a UNIX timestamp of when the event happened(timestamp); a content column containing the details about the page URL, unique visitoridentifier, and browser info (content_json); and other details (other_ details) One thing that youwill notice about this example is that the content_json column in our hypothetical CSV file is aJSON document with many nested fields This is a common layout for this type of data and willrequire extra steps to process.

Our task is illustrated in figure 2.3 It is to design a cloud data platform capable of integratingthese two data sources in a performant and cost-efficient manner and to make this integrated dataavailable to the Marketing team for analysis.

Trang 31

Figure 2.3 Cloud analytics platform problem statement

Our goal here is not only to describe two different cloud architectures, but to highlight theimportant differences between the two, focusing on what happens to the data platform pipelineswhen there are changes to source data structure and how the two architectures analyzesemistructured data at scale In the next section, we will walk you through the solution to ouranalytics problem (design a cloud data platform capable of integrating these two data sources ina performant and cost-efficient manner and make this integrated data available to the marketingteam for analysis) using a cloud data warehouse architecture and then a cloud dataplatform architecture.

2.1.2 An example cloud data warehouse–only architecture

A cloud data warehouse architecture is quite similar to a traditional enterprise data warehousesolution Figure 2.4 shows how the center of this architecture is a relational data warehouse thatis responsible for storing, processing, and serving data to the end users There is also an extract,transform, load (ETL) process that loads the data from the sources (clickstream data via CSVfiles and email campaign data from a MySQL database) into the warehouse.

Trang 32

Figure 2.4 Example cloud data warehouse-only architecture on Azure.

Our example cloud data warehouse–only architecture consists of two PaaS services running onAzure: Azure Data Factory and Azure SQL Data Warehouse (Azure Synapse).

NOTE In all of our examples we will use platform-as-a-service (PaaS) offerings wherepossible PaaS enables one of the most powerful promises of the cloud—getting platforms upand running in minutes, not days.

Azure Data Factory is a fully managed PaaS ETL service that allows you to create pipelines byconnecting to various data sources, ingesting data, performing basic transformations such asuncompressing files or changing file formats, and loading data into a target system forprocessing and serving In our example cloud data warehouse–only architecture, we will useData Factory to read email campaign data from a MySQL table as well as to fetch filescontaining clickstream data from an SFTP server We will also use Data Factory to load data intoAzure Synapse.

Our example warehouse, Azure Synapse, is a fully managed warehouse service based on MSSQL Server technology Fully managed in this case means that you don’t need to install,configure, and manage the database server yourself Instead, you need only choose how muchcomputational and storage capacity you require, and Azure will take care of the rest While thereare certain limitations to fully managed PaaS offerings such as Azure Synapse and Azure DataFactory, they make it very easy for people who are not MS SQL Server experts to implementarchitectures such as our cloud data warehouse–only architecture and to program relativelycomplex pipelines quickly.

Trang 33

In the next section, we will describe an alternative architecture—a cloud data platform designthat provides more flexibility than a data warehouse design.

2.1.3 An example cloud data platform architecture

A cloud data platform architecture is inspired by the concept of a data lake and is in fact acombination of a data lake and a data warehouse created for the age of cloud A cloud dataplatform consists of several layers, each responsible for a particular aspect of the data pipeline:ingestion, storage, processing, and serving Let’s look at our example cloud data platformarchitecture, which can be seen in figure 2.5.

Figure 2.5 Example cloud data platform architecture

Our cloud data platform architecture consists of these Azure PaaS services: Azure Data Factory

 Azure Blob Storage Azure Synapse Azure Databricks

While they may look similar (both use Azure Data Factory for ingestion and both use AzureSynapse for serving), there are several key differences between a cloud data warehouse–onlyarchitecture and cloud data platform architecture In the cloud data platform, while we are using

Trang 34

Azure Data Factory to connect and extract data from the source systems instead of loading itdirectly into the warehouse, we will save the source data into a landing area on Azure Blobstorage (often known as “the lake”) This allows us to preserve the original data format and helpswith data variety challenges as well as provides other benefits.

Once the data has landed in Azure Blob Storage, we’ll use Apache Spark running on the AzureDatabricks managed service (PaaS) to process it As with all PaaS services, we get simple setupand ongoing management, allowing us to create new Spark clusters without needing to manuallyinstall and configure any software It also provides an easy-to-use notebook environment whereyou can execute Spark commands against the data in the lake and see the results right away,without having to compile and submit Spark programs to the cluster.

While Spark and other distributed data processing frameworks can help you process various dataformats and almost infinitely large data volumes, these tools aren’t well suited for serving what

is sometimes called interactive queries Interactive in this case means that the query response is

usually expected within seconds or less, not minutes For these use cases, a well-designedrelational warehouse can typically provide a faster query performance than can Spark Also thereare many off-the-shelf reporting and BI tools that integrate much better with an RDBMSdatabase than with a distributed system such as Spark and are easier to use for less technicalusers.

Exercise 2.1

In our examples in this section, which of the following is the main difference between a clouddata warehouse architecture and a data platform architecture?

1 Data platform uses only serverless technologies.

2 Data platform uses Azure functions for data ingestion.

3 Data warehouses can connect to the data sources directly to perform the ingestion.

4 Data platform adds a “data lake” layer to offload data processing from the datawarehouse.

2.2 Ingesting data

This section covers how to load data into Azure Synapse and an Azure data platform using AzureData Factory We’ll also look at what happens to ingestion pipelines where the source schemachanges.

Using a managed service like Azure Data Factory makes creating a pipeline to get data into thedata platform or data warehouse a relatively easy task Data Factory comes with a set of built-inconnectors to various sources, allows for basic data transformations, and supports saving data tothe most popular destinations.

Trang 35

There are, however, fundamental differences between how a data ingestion pipeline works in acloud data platform versus a cloud data warehouse–only implementation In this section, we willhighlight these differences.

2.2.1 Ingesting data directly into Azure Synapse

An Azure Data Factory pipeline consists of several key components: (1) linked services, (2)input and output data sets, and (3) activities Figure 2.6 shows how these components worktogether to load data from a MySQL table into Azure Synapse.

Figure 2.6 Azure Data Factory ingestion pipeline for Azure Synapse

A linked service describes a connection to a specific data source (in this case, MySQL) or datasink (in this case, Azure Synapse) These services would include the location of the data source,credentials that will be used to connect to it, etc A data set is a specific object in the connectedservice It could be a table in the MySQL database that we want to read or a destination table inAzure Synapse An important property of a data set is a schema For Data Factory to be able toload data from MySQL into Azure Synapse, it needs to know the schema of the source and

destination tables This information is required up front, meaning the schema must be available

to the pipeline before the pipeline can be executed To be more specific, the schema for the inputdata source can be inferred by the Data Factory automatically, but the output schema, andespecially the mapping of the input to output schemas, must be provided The Data Factory UIprovides a quick way to fetch schema from data sources, but if you are building pipelineautomation using the Data Factory API to construct pipelines, then you need to provide the

Trang 36

schema yourself In the example in figure 2.6, the MySQL source schema and the Azure Synapseschema are similar, but in other cases this may not be true because of data types mismatch, etc.2.2.2 Ingesting data into an Azure data platform

Our data platform architecture takes a different approach to data ingestion (figure 2.7) In a dataplatform architecture, the primary destination for the data is Azure Blob storage, which, for thisuse case, can be thought of as an infinitely scalable filesystem.

Figure 2.7 Azure Data Factory ingestion pipeline for a cloud data platform

The main difference between this ingestion pipeline and the previous one is that Azure BlobStorage Data Factory service doesn’t require a schema to be specified up front In our use case,each ingestion from MySQL is saved as a text file in Azure Blob Storage without concern for thesource columns and data types Our cloud data platform design has an extra layer of processingdata, which will be implemented using Apache Spark running on the Azure Databricks platformto convert source text files into more efficient binary formats This way we can combine theflexibility of text files on Azure Blob Storage with the efficiency of binary formats.

In the data platform design, the fact that you are no longer required to manually provide outputschema and its mapping to the source schema is important for two reasons: (1) a typicalrelational database can contain hundreds of tables, which translates into a lot of manual effortand increased chance of errors, and (2) it is highly resilient to change, the subject of thenext section.

Trang 37

2.2.3 Managing changes in upstream data sources

Source datasets are never static Developers who support our marketing campaign managementsoftware will be constantly adding new features This can result in new columns added to thesource table and/or columns being renamed or deleted Building data ingestion and processingpipelines that deal with these types of changes is one of the most important tasks for a dataarchitect or a data engineer.

Figure 2.8 Upstream schema changes break our data warehouse ingestion pipeline.

Let’s imagine that a new version of our email marketing software introduced a new columncalled “country” in the source data set What will happen to our ingestion pipeline in our datawarehouse–only architecture? Figure 2.8 explains.

Our data warehouse–only Data Factory pipeline requires both input and output schemas, so,without intervention, a change in the source schema means our two schemas will be out of sync.Data Factory maps input to output columns by position because it supports column renaming inthe output This means the next time the ingestion pipeline runs, inserts into Azure Synapse willfail, because it will expect an integer unique_code column where a varchar region column willarrive from source An operator will need to go and adjust the output data set schema manuallyand restart the pipeline We will discuss schema management in great detail in chapter 8.

NOTE Data Factory, like any generic ETL overlay tool, will allow you to copy data from avariety of sources into a variety of destinations Some of these destinations, such as Azure Blob

Trang 38

Storage, don’t care about input schema, but others, such as databases and Azure Synapse, requirea strict schema to be defined up front While you can alter the destination schema and add newcolumns, the behavior of this operation will depend on the type of destination Some databaseswill lock the full table for the duration of the schema change, making it completely unavailableto end users In other cases, space depending on data size in the table, a schema adjustmentoperation can take hours to run There is no single way to deal with schema changes in themultitude of data destinations in a unified way, so Data Factory and other ETL tools delegate thisresponsibility to the platform operators.

Figure 2.9 Data platform ingestion pipeline is resilient to upstream schema changes.

Resilience to the upstream schema changes is one of the benefits of the Data Platformarchitecture over a data warehouse–only approach As shown in figure 2.9, in a data platformimplementation, if the source schema changes, since the output destination is a file on AzureBlob Storage, the output schema is not required, and the ingestion pipeline will simply create anew file with a new schema and continue working.

While a data platform ingestion pipeline will continue to work as expected if the upstreamschema changes, there are other problems that come with source schema changes At some pointalong the way, a consumer of the data, either an analytics user or a scheduled job, will need towork with the changed data, and to work with it they will need to know that a new column wasadded We will look into approaches on dealing with schema changes in the data platform inchapter 8.

Exercise 2.2

Trang 39

Our example data platform architecture is more resilient to the changes in the upstreamdata sources because (choose one)

1 It uses Apache Spark for data processing, which utilizes Resilient DistributedDatasets (RDDs).

2 It saves data into Azure Blob Storage first, which doesn’t require strict schema definition.

3 It uses Azure Blob Storage as a primary data store, which provides extra redundancy.

4 It uses Azure Functions for data ingestion, which provides extra flexibility.2.3 Processing data

Now that we’ve covered some of the key ideas involved with ingestion, let’s consider processing—specifically, how processing data to answer our analytics problem differs when using SQLrunning on a data warehouse versus using Apache Spark on a cloud data platform We’ll discussthe pros and cons of both approaches.

We saw in the previous section that ingesting data into the data warehouse and data platformrequire different approaches The differences don’t stop there In this section, we will explorehow processing data to answer analytical questions is different in the two systems.

Let’s recall the two data sources we are using as an example use case First, we have themarketing campaign table in figure 2.10, with the following columns.

Figure 2.10 Example marketing campaigns table

We also have clickstream data from our website that comes to us in the following structured format (figure 2.11).

Trang 40

semi-Figure 2.11 Example clickstream data

While we have data split into individual columns, we also have a content_json column thatcontains a complex JSON document with multiple attributes and nested values.

To demonstrate how the approach to working with such data in the data warehouse and dataplatform environments differs, let’s consider the following request for information from ourMarketing team: When users land on our website from campaign X, what other pages do theyvisit? Our example data set may be artificial, but this type of request is a common one.

2.3.1 Processing data in the warehouse

Since we are using Azure Synapse as the destination for our data source in the warehouse-onlydesign, we will need to work with two relational tables that we have loaded using our ingestionprocess—one for clickstream data and one for email marketing campaign data Let’s assume

tables in our Azure Synapse are called campaigns and clicks for campaigns information and

clickstream data, correspondingly A campaigns table is a straightforward mapping of our sourcetable to the destination table since both data sources are relational in nature Azure Synapse willcontain the same columns What about clickstream data and its nested JSON documents? There

is currently no dedicated data types in Azure Synapse to represent JSON (note that some cloud

data warehouses such as Google BigQuery have native support for JSON data structures), butyou can store JSON documents in standard text columns and use a built-in JSON_VALUE function to access specific attributes inside the document.

The following listing is an example of an Azure Synapse query that can answer our Marketingrequest.

Listing 2.1 Azure Synapse querySELECT

DISTINCT SUBSTRING(