CHAPTER 4 Synapse Analytics Synapse Analytics is a data and analytics platform as a service that unifies data integration, data warehousing, big data analytics, reporting, CI/CD, and much more within the Modern Azure Data platform. It supports a variety of tools such as workspaces
Trang 2The Azure Data
Lakehouse Toolkit
Building and Scaling Data
Lakehouses on Azure with Delta
Lake, Apache Spark, Databricks,
Synapse Analytics, and Snowflake
Copyright © 2022 by Ron L’Esteve
This work is subject to copyright All rights are reserved by the Publisher,whether the whole or part of the material is concerned, specifically therights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, andtransmission or information storage and retrieval, electronic adaptation,
Trang 3computer software, or by similar or dissimilar methodology now known orhereafter developed.
Trademarked names, logos, and images may appear in this book Ratherthan use a trademark symbol with every occurrence of a trademarked name,logo, or image we use the names, logos, and images only in an editorialfashion and to the benefit of the trademark owner, with no intention ofinfringement of the trademark
The use in this publication of trade names, trademarks, service marks, andsimilar terms, even if they are not identified as such, is not to be taken as anexpression of opinion as to whether or not they are subject to proprietaryrights
While the advice and information in this book are believed to be true andaccurate at the date of publication, neither the authors nor the editors northe publisher can accept any legal responsibility for any errors or omissionsthat may be made The publisher makes no warranty, express or implied,with respect to the material contained herein
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Jonathan Gennick
Development Editor: Laura Berendson
Coordinating Editor: Jill Balzano
Cover designed by eStudioCalamar
Cover image designed by Freepik (www.freepik.com)
Distributed to the book trade worldwide by Springer Science+BusinessMedia New York, 1 New York Plaza, Suite 4600, New York, NY 10004-
1562, USA Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail ny@
orders-springer-sbm.com, or visit www.springeronline.com Apress Media, LLC is
a California LLC and the sole member (owner) is Springer Science +
Trang 4Business Media Finance Inc (SSBM Finance Inc) SSBM Finance Inc is a
Delaware corporation.
For information on translations, please e-mail
booktranslations@springernature.com; for reprint,
paperback, or audio rights, please e-mail
bookpermissions@springernature.com
Apress titles may be purchased in bulk for academic, corporate, or
promotional use eBook versions and licenses are also available for mosttitles For more information, reference our Print and eBook Bulk Sales webpage at http://www.apress.com/bulk-sales
Any source code or other supplementary material referenced by the author
in this book is available to readers on GitHub via the book’s product page,located at www.apress.com/ For more detailed information, please visithttp://www.apress.com/source- code
Printed on acid-free paper
For Cayden and Christina
Trang 7Storing and Serving
�����������������������������
�����������������������������
�����������������������������
�����������������������������16
Trang 10Part II: Data Platforms
Trang 18Security and Governance
Trang 22Delta Tables
�����������������������������
Trang 23�����������������������������
Trang 28Schema Evolution Using Delta Format
�����������������������������
�����������������������������
����������������������������239
Trang 29Insert Data into Tables
Trang 31Create and Run a Pipeline
Trang 34Create Fact Table
Trang 37����������������������������329
Trang 40Read the Parquet Files to a Data Frame
Trang 41Part VI: Advanced Capabilities
Advanced Schema Evolution
Load Data to Azure Data Lake Storage Gen2
Trang 45Verify Python Version in Visual Studio Code Terminal
�����������������������������
�����������������������������420
Set Up Wheel Directory Folders and Files
Trang 47Install Wheel to Databricks Library
�����������������������������
�����������������������������
����������������������������428
Create Databricks Notebook
Trang 49Files in Databricks Repos
Implement Other Access and Visibility Controls
�����������������������������
�����������������������������
�������������� 442
Trang 50Table Access Control
Trang 51About the Author
Ron L’Esteve is a professional author, trusted technology
leader, and digital innovation strategist residing in Chicago,
IL, USA He is well known for his impactful books and
award-winning article publications about Azure Data and AI
Trang 52architecture and engineering He possesses deep technical
skills and experience in designing, implementing, and
delivering modern Azure Data and AI projects for numerous
clients around the world
Having several Azure Data, AI, and Lakehouse
certifications under his belt, Ron has been a trusted and
go-to technical advisor for some of the largest and most impactful Azure
innovation and transformation
Ron is a gifted presenter and trainer, known for his innate ability to clearlyarticulate
and explain complex topics to audiences of all skill levels He applies apractical and
Trang 53business-oriented approach by taking transformational ideas from concept
to scale He
is a true enabler of positive and impactful change by championing a growthmindset
xv
About the Technical Reviewer
Diego Poggioli is an analytics architect with over ten years of
experience managing big data and machine learning projects
At Avanade, he is currently helping clients to identify,
document, and translate business strategy and requirements
into solutions and services that help clients to achieve their
business outcomes using analytics
With a passion for software development, cloud, and
serverless innovations, he has worked on various product
development initiatives spanning IaaS, PaaS, and SaaS
He is also interested in the emerging topics of artificial intelligence,
machine
learning, and large-scale data processing and analytics
He holds several Microsoft Azure and AWS certifications
His hobbies are handcraft with driftwood, playing with his two kids, andlearning
Trang 54new technologies He currently lives in Bologna, Italy.
Trang 55understanding of these various technologies and how they fit into the
modern data and
analytics Lakehouse paradigm by supporting the needs of ingestion,
role in the Lakehouse
The Data Lakehouse paradigm on Azure, which leverages Apache Sparkand Delta
Lake heavily, has become a popular choice for big data engineering, ELT(extraction,
loading, and transformation), AI/ML, real-time data processing, reporting,and querying
use cases In some scenarios of the Lakehouse paradigm, Spark coupledwith MPP is
great for big data reporting and BI Analytics platforms that require heavyquerying and
performance capabilities since MPP is based on the traditional RDBMS andbrings
with it the best features of SQL Server, such as automated query tuning,data shuffling,
ease of analytics platform management, even data distribution based on aprimary
Trang 56key, and much more As the Lakehouse matures, specifically with DeltaLake, it begins
to demonstrate its capabilities of supporting many critical features, such asACID
(atomicity, consistency, isolation, and durability)-compliant transactions forbatch and
streaming jobs, data quality enforcement, and highly optimized
performance improvement in the Lakehouse by using partitioning,
indexing, and other
tuning options You will also learn about the capabilities of Delta Lakewhich include
Trang 57such as building and installing custom Python libraries, implementing
Prior to the introduction of massively parallel processing (MPP)
architectures in the early
Trang 581990s, the analytics database market was dominated by symmetrical
Trang 59data processing (text, images, video, and more) Additionally, it offers thecapability for
large-scale advanced analytics (AI, ML, text/sentiment analysis, and more).Seeing the many
benefits that Spark and the modern Data Lakehouse platform have to offer,customers
are interested in understanding and getting started with the Data Lakehouseparadigm
3
© Ron L’Esteve 2022
R L’Esteve, The Azure Data Lakehouse Toolkit,
https://doi.org/10.1007/978-1-4842-8233-5_1
Chapter 1 the Data Lakehouse paraDigm
Given the multitude of cloud offerings which include Amazon Web
Trang 60The Data Lakehouse paradigm on Azure leverages Apache Spark and DeltaLake
heavily Apache Spark is an open source unified analytics engine for scale data
large-processing which provides an interface for programming clusters whichincludes data
parallelism and fault tolerance Similarly, Delta Lake is also open sourceand provides a
reliable Data Lake Storage layer which runs on top of an existing DataLake It provides
ACID-compliant transactions, scalable metadata handling, unified
Trang 61management, even data distribution based on a primary key, and muchmore This
may be a more common scenario as the Lakehouse paradigm is in its
optimized performance tuning
While Spark has traditionally been designed for large data processing, the
advancement of Spark is a hot industry topic that would help with aligningthe
Lakehouse paradigm with the best features of traditional RDBMS to
Trang 62Chapter 1 the Data Lakehouse paraDigm
and how they all come together and contribute to realizing the modern DataLakehouse
architecture Azure’s modern resource-based consumption model for PaaSand SaaS
services empowers developers, engineers, and end users to use the
platforms, tools, and
technologies that best serve their needs That being said, there are manyAzure resources
that serve various purposes in the Lakehouse architecture The capabilities
of these tools
will be covered in greater detail in subsequent sections From a high level,they serve the
purpose of ingesting, storing, processing, serving, and consuming data
From a compute perspective, Apache Spark is the gold standard for allthings
Lakehouse It is a multi-language engine for executing data engineering,data science,
and machine learning on single-node machines or clusters and is prevalent
Trang 63would certainly be a sound option.
With storage being cheap within the Data Lake comes the idea of the
Trang 64through GUI-designed ETL pipelines.
From a data serving and consumption perspective, there are a variety oftools on
the Azure platform including Synapse Analytics Serverless and DedicatedPools for
storage and ad hoc querying, along with Power BI (PBI) for robust
Chapter 1 the Data Lakehouse paraDigm
in Figure 1-1, also supports deep advanced analytics and AI use cases withcognitive services, Azure ML, Databricks ML, and Power BI AI/ML
capabilities Finally, it is
critical to build in DevOps best practices within any data platform, and theLakehouse
Trang 65supports multi-environment automated continuous integration and delivery(CI/CD)
best practices using Azure DevOps All these modern Lakehouse
Figure 1-1 Lakehouse architectural paradigm
Ingestion and Processing
This initial portion of any data architecture is the ingestion process Datasources range
from on-premises to a variety of cloud sources There are a few Azureresources that are
typically used for the data ingestion process This includes Data Factory,Databricks,
and custom functions and connectors This section will further exploresome of these
components and their capabilities
Trang 66SQL Server Integration Services (SSIS) on-premises toolset Data Factorysupports tight
6
Chapter 1 the Data Lakehouse paraDigm
integration with on-premises data sources using the self-hosted IntegrationRuntime (IR)
The IR is the compute infrastructure used by Data Factory and SynapsePipelines to
provide data integration capabilities across different network environmentsincluding
on-premises This capability has positioned this service as a tool of choicefor integrating
a combination of on-premises and cloud sources using reusable and
Trang 67to provide data integration capabilities, such as data flows and data
movement ADF has
the following three IR types:
• Azure Integration Runtime: All patching, scaling, and maintenance
of the underlying infrastructure are managed by Microsoft, and the IRcan only access data stores and services in public networks
• Self-hosted Integration Runtimes: The infrastructure and hardware
are managed by you, and you will need to address all the patching,
scaling, and maintenance The IR can access resources in both public
and private networks
• Azure-SSIS Integration Runtimes: VMs running the SSIS engine
allow you to natively execute SSIS packages All the patching, scaling,and maintenance are managed by Microsoft The IR can access
resources in both public and private networks
Data Factory also supports complex transformations with its Mapping DataFlow
service which can be used to build transformations, Slowly Changing
Trang 68Mapping Data Flows are visually designed data transformations in AzureData
Factory Data flows allow data engineers to develop data transformationlogic without
writing code The resulting data flows are executed as activities withinAzure Data
Factory pipelines that use scaled out Apache Spark clusters We will
explore some of
the capabilities of Mapping Data Flows in Chapters 11 through 15 for datawarehouse ETL using Slowly Changing Dimensions (SCD) Type I, bigData Lake aggregations,
7
Chapter 1 the Data Lakehouse paraDigm
incremental upserts, and Delta Lake There are many other use cases
capabilities of Mapping Data Flows
Additionally, data can be transformed through Stored Procedure activities
in the
regular Copy Data activity of ADF
Trang 69There are three different cluster types available in Mapping Data Flows:general
purpose, memory optimized, and compute optimized The following is adescription
of each:
• General purpose: Use the default general-purpose cluster when you
intend to balance performance and cost This cluster will be ideal for
most data flow workloads
• Memory optimized: Use the more costly per-core memory
optimized clusters if your data flow has many joins and lookups sincethey can store more data in memory and will minimize any out-of-
memory errors you may get If you experience any out-of-memory
errors when executing data flows, switch to a memory-optimized
Azure IR configuration
• Compute optimized: Use the cheaper per-core-priced
compute-optimized clusters for non-memory-intensive data transformations
such as filtering data or adding derived columns
ADF also has a number of built-in and custom activities which integratewith other
Azure services ranging from Databricks, Functions, Login Apps, Synapse,and more
Trang 70Data Factory also has connectors for other cloud sources including OracleCloud,
Snowflake, and more Figure 1-2 shows a list of the various activities thatare supported by ADF When expanded, each activity contains a long list ofcustomizable activities
for ELT While Data Factory is typically designed for batch ELT, its robustevent-driven
scheduling triggers can also support event-driven real-time processes,
Chapter 1 the Data Lakehouse paraDigm
ingestion frameworks Data Factory also securely integrates with Key Vaultfor secret and
credential management Synapse Pipelines within Synapse Analytics
workspace has a
Trang 71very similar UI as Databricks as it continues to evolve into the Azure
Figure 1-2 ADF activities
Data Factory supports over 90 sources and sinks as part of its ingestion andload
process In the subsequent chapters, you will learn how to create pipelines,datasets,
linked services, and activities in ADF The following list shows and definesthe various
components of a standard ADF pipeline:
• Pipelines are logical grouping of activities that together
perform a task
• Activities define actions to perform on your data (e.g., copy data
activity, ForEach loop activity, etc.)
9
Chapter 1 the Data Lakehouse paraDigm
• Datasets are named views of data that simply points or references the