The Azure Data Lakehouse Toolkit Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

CHAPTER 4 Synapse Analytics Synapse Analytics is a data and analytics platform as a service that unifies data integration, data warehousing, big data analytics, reporting, CI/CD, and much more within the Modern Azure Data platform. It supports a variety of tools such as workspaces

Trang 2

The Azure Data

Lakehouse Toolkit

Building and Scaling Data

Lakehouses on Azure with Delta

Lake, Apache Spark, Databricks,

Synapse Analytics, and Snowflake

This work is subject to copyright All rights are reserved by the Publisher,whether the whole or part of the material is concerned, specifically therights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microfilms or in any other physical way, andtransmission or information storage and retrieval, electronic adaptation,

Trang 3

computer software, or by similar or dissimilar methodology now known orhereafter developed.

Trademarked names, logos, and images may appear in this book Ratherthan use a trademark symbol with every occurrence of a trademarked name,logo, or image we use the names, logos, and images only in an editorialfashion and to the benefit of the trademark owner, with no intention ofinfringement of the trademark

The use in this publication of trade names, trademarks, service marks, andsimilar terms, even if they are not identified as such, is not to be taken as anexpression of opinion as to whether or not they are subject to proprietaryrights

While the advice and information in this book are believed to be true andaccurate at the date of publication, neither the authors nor the editors northe publisher can accept any legal responsibility for any errors or omissionsthat may be made The publisher makes no warranty, express or implied,with respect to the material contained herein

Managing Director, Apress Media LLC: Welmoed Spahr

Acquisitions Editor: Jonathan Gennick

Development Editor: Laura Berendson

Coordinating Editor: Jill Balzano

Cover designed by eStudioCalamar

Cover image designed by Freepik (www.freepik.com)

Distributed to the book trade worldwide by Springer Science+BusinessMedia New York, 1 New York Plaza, Suite 4600, New York, NY 10004-

1562, USA Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail ny@

orders-springer-sbm.com, or visit www.springeronline.com Apress Media, LLC is

a California LLC and the sole member (owner) is Springer Science +

Trang 4

Business Media Finance Inc (SSBM Finance Inc) SSBM Finance Inc is a

Delaware corporation.

For information on translations, please e-mail

booktranslations@springernature.com; for reprint,

paperback, or audio rights, please e-mail

bookpermissions@springernature.com

Apress titles may be purchased in bulk for academic, corporate, or

promotional use eBook versions and licenses are also available for mosttitles For more information, reference our Print and eBook Bulk Sales webpage at http://www.apress.com/bulk-sales

Any source code or other supplementary material referenced by the author

in this book is available to readers on GitHub via the book’s product page,located at www.apress.com/ For more detailed information, please visithttp://www.apress.com/source- code

Printed on acid-free paper

For Cayden and Christina

Trang 7

Storing and Serving

��

��16

Trang 10

Part II: Data Platforms

Trang 18

Security and Governance

Trang 22

Delta Tables

��

Trang 23

��

Trang 28

Schema Evolution Using Delta Format

��

��239

Trang 29

Insert Data into Tables

Trang 31

Create and Run a Pipeline

Trang 34

Create Fact Table

Trang 37

��329

Trang 40

Read the Parquet Files to a Data Frame

Trang 41

Part VI: Advanced Capabilities

Advanced Schema Evolution

Load Data to Azure Data Lake Storage Gen2

Trang 45

Verify Python Version in Visual Studio Code Terminal

��

��420

Set Up Wheel Directory Folders and Files

Trang 47

Install Wheel to Databricks Library

��

��428

Create Databricks Notebook

Trang 49

Files in Databricks Repos

Implement Other Access and Visibility Controls

��

�� 442

Trang 50

Table Access Control

Trang 51

About the Author

Ron L’Esteve is a professional author, trusted technology

leader, and digital innovation strategist residing in Chicago,

IL, USA He is well known for his impactful books and

award-winning article publications about Azure Data and AI

Trang 52

architecture and engineering He possesses deep technical

skills and experience in designing, implementing, and

delivering modern Azure Data and AI projects for numerous

clients around the world

Having several Azure Data, AI, and Lakehouse

certifications under his belt, Ron has been a trusted and

go-to technical advisor for some of the largest and most impactful Azure

innovation and transformation

Ron is a gifted presenter and trainer, known for his innate ability to clearlyarticulate

and explain complex topics to audiences of all skill levels He applies apractical and

Trang 53

business-oriented approach by taking transformational ideas from concept

to scale He

is a true enabler of positive and impactful change by championing a growthmindset

xv

About the Technical Reviewer

Diego Poggioli is an analytics architect with over ten years of

experience managing big data and machine learning projects

At Avanade, he is currently helping clients to identify,

document, and translate business strategy and requirements

into solutions and services that help clients to achieve their

business outcomes using analytics

With a passion for software development, cloud, and

serverless innovations, he has worked on various product

development initiatives spanning IaaS, PaaS, and SaaS

He is also interested in the emerging topics of artificial intelligence,

machine

learning, and large-scale data processing and analytics

He holds several Microsoft Azure and AWS certifications

His hobbies are handcraft with driftwood, playing with his two kids, andlearning

Trang 54

new technologies He currently lives in Bologna, Italy.

Trang 55

understanding of these various technologies and how they fit into the

modern data and

analytics Lakehouse paradigm by supporting the needs of ingestion,

role in the Lakehouse

The Data Lakehouse paradigm on Azure, which leverages Apache Sparkand Delta

Lake heavily, has become a popular choice for big data engineering, ELT(extraction,

loading, and transformation), AI/ML, real-time data processing, reporting,and querying

use cases In some scenarios of the Lakehouse paradigm, Spark coupledwith MPP is

great for big data reporting and BI Analytics platforms that require heavyquerying and

performance capabilities since MPP is based on the traditional RDBMS andbrings

with it the best features of SQL Server, such as automated query tuning,data shuffling,

ease of analytics platform management, even data distribution based on aprimary

Trang 56

key, and much more As the Lakehouse matures, specifically with DeltaLake, it begins

to demonstrate its capabilities of supporting many critical features, such asACID

(atomicity, consistency, isolation, and durability)-compliant transactions forbatch and

streaming jobs, data quality enforcement, and highly optimized

performance improvement in the Lakehouse by using partitioning,

indexing, and other

tuning options You will also learn about the capabilities of Delta Lakewhich include

Trang 57

such as building and installing custom Python libraries, implementing

Prior to the introduction of massively parallel processing (MPP)

architectures in the early

Trang 58

1990s, the analytics database market was dominated by symmetrical

Trang 59

data processing (text, images, video, and more) Additionally, it offers thecapability for

large-scale advanced analytics (AI, ML, text/sentiment analysis, and more).Seeing the many

benefits that Spark and the modern Data Lakehouse platform have to offer,customers

are interested in understanding and getting started with the Data Lakehouseparadigm

3

R L’Esteve, The Azure Data Lakehouse Toolkit,

https://doi.org/10.1007/978-1-4842-8233-5_1

Chapter 1 the Data Lakehouse paraDigm

Given the multitude of cloud offerings which include Amazon Web

Trang 60

The Data Lakehouse paradigm on Azure leverages Apache Spark and DeltaLake

heavily Apache Spark is an open source unified analytics engine for scale data

large-processing which provides an interface for programming clusters whichincludes data

parallelism and fault tolerance Similarly, Delta Lake is also open sourceand provides a

reliable Data Lake Storage layer which runs on top of an existing DataLake It provides

ACID-compliant transactions, scalable metadata handling, unified

Trang 61

management, even data distribution based on a primary key, and muchmore This

may be a more common scenario as the Lakehouse paradigm is in its

optimized performance tuning

While Spark has traditionally been designed for large data processing, the

advancement of Spark is a hot industry topic that would help with aligningthe

Lakehouse paradigm with the best features of traditional RDBMS to

Trang 62

and how they all come together and contribute to realizing the modern DataLakehouse

architecture Azure’s modern resource-based consumption model for PaaSand SaaS

services empowers developers, engineers, and end users to use the

platforms, tools, and

technologies that best serve their needs That being said, there are manyAzure resources

that serve various purposes in the Lakehouse architecture The capabilities

of these tools

will be covered in greater detail in subsequent sections From a high level,they serve the

purpose of ingesting, storing, processing, serving, and consuming data

From a compute perspective, Apache Spark is the gold standard for allthings

Lakehouse It is a multi-language engine for executing data engineering,data science,

and machine learning on single-node machines or clusters and is prevalent

Trang 63

would certainly be a sound option.

With storage being cheap within the Data Lake comes the idea of the

Trang 64

through GUI-designed ETL pipelines.

From a data serving and consumption perspective, there are a variety oftools on

the Azure platform including Synapse Analytics Serverless and DedicatedPools for

storage and ad hoc querying, along with Power BI (PBI) for robust

in Figure 1-1, also supports deep advanced analytics and AI use cases withcognitive services, Azure ML, Databricks ML, and Power BI AI/ML

capabilities Finally, it is

critical to build in DevOps best practices within any data platform, and theLakehouse

Trang 65

supports multi-environment automated continuous integration and delivery(CI/CD)

best practices using Azure DevOps All these modern Lakehouse

Figure 1-1 Lakehouse architectural paradigm

Ingestion and Processing

This initial portion of any data architecture is the ingestion process Datasources range

from on-premises to a variety of cloud sources There are a few Azureresources that are

typically used for the data ingestion process This includes Data Factory,Databricks,

and custom functions and connectors This section will further exploresome of these

components and their capabilities

Trang 66

SQL Server Integration Services (SSIS) on-premises toolset Data Factorysupports tight

6

integration with on-premises data sources using the self-hosted IntegrationRuntime (IR)

The IR is the compute infrastructure used by Data Factory and SynapsePipelines to

provide data integration capabilities across different network environmentsincluding

on-premises This capability has positioned this service as a tool of choicefor integrating

a combination of on-premises and cloud sources using reusable and

Trang 67

to provide data integration capabilities, such as data flows and data

movement ADF has

the following three IR types:

• Azure Integration Runtime: All patching, scaling, and maintenance

of the underlying infrastructure are managed by Microsoft, and the IRcan only access data stores and services in public networks

• Self-hosted Integration Runtimes: The infrastructure and hardware

are managed by you, and you will need to address all the patching,

scaling, and maintenance The IR can access resources in both public

and private networks

• Azure-SSIS Integration Runtimes: VMs running the SSIS engine

allow you to natively execute SSIS packages All the patching, scaling,and maintenance are managed by Microsoft The IR can access

resources in both public and private networks

Data Factory also supports complex transformations with its Mapping DataFlow

service which can be used to build transformations, Slowly Changing

Trang 68

Mapping Data Flows are visually designed data transformations in AzureData

Factory Data flows allow data engineers to develop data transformationlogic without

writing code The resulting data flows are executed as activities withinAzure Data

Factory pipelines that use scaled out Apache Spark clusters We will

explore some of

the capabilities of Mapping Data Flows in Chapters 11 through 15 for datawarehouse ETL using Slowly Changing Dimensions (SCD) Type I, bigData Lake aggregations,

7

incremental upserts, and Delta Lake There are many other use cases

capabilities of Mapping Data Flows

Additionally, data can be transformed through Stored Procedure activities

in the

regular Copy Data activity of ADF

Trang 69

There are three different cluster types available in Mapping Data Flows:general

purpose, memory optimized, and compute optimized The following is adescription

of each:

• General purpose: Use the default general-purpose cluster when you

intend to balance performance and cost This cluster will be ideal for

most data flow workloads

• Memory optimized: Use the more costly per-core memory

optimized clusters if your data flow has many joins and lookups sincethey can store more data in memory and will minimize any out-of-

memory errors you may get If you experience any out-of-memory

errors when executing data flows, switch to a memory-optimized

Azure IR configuration

• Compute optimized: Use the cheaper per-core-priced

compute-optimized clusters for non-memory-intensive data transformations

such as filtering data or adding derived columns

ADF also has a number of built-in and custom activities which integratewith other

Azure services ranging from Databricks, Functions, Login Apps, Synapse,and more

Trang 70

Data Factory also has connectors for other cloud sources including OracleCloud,

Snowflake, and more Figure 1-2 shows a list of the various activities thatare supported by ADF When expanded, each activity contains a long list ofcustomizable activities

for ELT While Data Factory is typically designed for batch ELT, its robustevent-driven

scheduling triggers can also support event-driven real-time processes,

ingestion frameworks Data Factory also securely integrates with Key Vaultfor secret and

credential management Synapse Pipelines within Synapse Analytics

workspace has a

Trang 71

very similar UI as Databricks as it continues to evolve into the Azure

Figure 1-2 ADF activities

Data Factory supports over 90 sources and sinks as part of its ingestion andload

process In the subsequent chapters, you will learn how to create pipelines,datasets,

linked services, and activities in ADF The following list shows and definesthe various

components of a standard ADF pipeline:

• Pipelines are logical grouping of activities that together

perform a task

• Activities define actions to perform on your data (e.g., copy data

activity, ForEach loop activity, etc.)

9

• Datasets are named views of data that simply points or references the

Tiêu đề	Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake
Tác giả	Ron L’Esteve
Thể loại	Book
Năm xuất bản	2022
Thành phố	Chicago, IL

Định dạng
Số trang	700
Dung lượng	8,35 MB