Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 630 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
630
Dung lượng
4,46 MB
Nội dung
1 A Gentle Introduction to Spark What is Apache Spark? Spark’s Basic Architecture Spark Applications Using Spark from Scala, Java, SQL, Python, or R Key Concepts Starting Spark SparkSession DataFrames Partitions Transformations Lazy Evaluation Actions Spark UI 10 A Basic Transformation Data Flow 11 DataFrames and SQL 2 Structured API Overview Spark’s Structured APIs DataFrames and Datasets Schemas Overview of Structured Spark Types Columns Rows Spark Value Types Encoders Overview of Spark Execution Logical Planning Physical Planning Execution 3 Basic Structured Operations Chapter Overview Schemas Columns and Expressions Columns Expressions Records and Rows Creating Rows DataFrame Transformations Creating DataFrames Select & SelectExpr Converting to Spark Types (Literals) Adding Columns Renaming Columns Reserved Characters and Keywords in Column Names Removing Columns Changing a Column’s Type (cast) Filtering Rows 10 Getting Unique Rows 11 Random Samples 12 Random Splits 13 Concatenating and Appending Rows to a DataFrame 14 Sorting Rows 15 Limit 16 Repartition and Coalesce 17 Collecting Rows to the Driver 4 Working with Different Types of Data Chapter Overview Where to Look for APIs Working with Booleans Working with Numbers Working with Strings Regular Expressions Working with Dates and Timestamps Working with Nulls in Data Drop Fill Replace Working with Complex Types Structs Arrays split Array Contains Explode Maps Working with JSON User-Defined Functions 5 Aggregations What are aggregations? Aggregation Functions count Count Distinct Approximate Count Distinct First and Last Min and Max Sum sumDistinct Average Variance and Standard Deviation 10 Skewness and Kurtosis 11 Covariance and Correlation 12 Aggregating to Complex Types Grouping Grouping with expressions Grouping with Maps Window Functions Rollups Cube Pivot User-Defined Aggregation Functions 6 Joins What is a join? Join Expressions Join Types Inner Joins Outer Joins Left Outer Joins Right Outer Joins Left Semi Joins Left Anti Joins Cross (Cartesian) Joins Challenges with Joins Joins on Complex Types Handling Duplicate Column Names 10 How Spark Performs Joins Node-to-Node Communication Strategies 7 Data Sources TheData Source APIs Basics of Reading Data Basics of Writing Data Options CSV Files CSV Options Reading CSV Files Writing CSV Files JSON Files JSON Options Reading JSON Files Writing JSON Files Parquet Files Reading Parquet Files Writing Parquet Files ORC Files Reading Orc Files Writing Orc Files SQL Databases Reading from SQL Databases Query Pushdown Writing to SQL Databases Text Files Reading Text Files Writing Out Text Files Advanced IO Concepts Reading Data in Parallel Writing Data in Parallel Writing Complex Types 8 Spark SQL Spark SQL Concepts What is SQL? BigData and SQL: Hive BigData and SQL: Spark SQL How to Run Spark SQL Queries SparkSQL Thrift JDBC/ODBC Server Spark SQL CLI Spark’s Programmatic SQL Interface Tables Creating Tables Inserting Into Tables Describing Table Metadata Refreshing Table Metadata Dropping Tables Views Creating Views Dropping Views Databases Creating Databases Setting The Database Dropping Databases Select Statements Case When Then Statements Advanced Topics Complex Types Functions Spark Managed Tables Subqueries Correlated Predicated Subqueries Conclusion 9 Datasets What are Datasets? Encoders Creating Datasets Case Classes Actions Transformations Filtering Mapping Joins Grouping and Aggregations When to use Datasets 10 10 Low Level API Overview The Low Levethose of which you are likely to succeed in using in practice today, and discuss some of the libraries that make this possible There are associated tradeoffs with these implementations but for the most part, Spark is not structured for model parallelization because of synchronous communication overhead and immutability This does not mean that Spark is not used for deep learning workloads because the volume of libraries proves otherwise Below is an incomplete list of different ways that Spark can be used in conjunction with deep learning note in the following examples we use the term “small data” and “big data” to differentiate that which can fit on a single node and that which must be distributed To be clear this is not actuall small data (say 100s of rows) this is many gigabytes of data that can still fit on one machine Distributed training of many deep learning models “Small learning, small data” Spark can parallelize work efficiently when there is little communication required between the nodes This makes it an excellent tool for performing distributed training of one deep learning model per worker node that might have different architectures or initialization There are many libraries that take advantage of Spark in this way Distributed usage of deep learning models “Small model, big data” As we mentioned in the previous bullet, Spark makes it extremely easy to parallelize tasks across a large number of machines One wonderful thing about machine learning research is that many models are available to the public as pretrained deep learning models that you can use without having to perform any training yourself These can things like identify humans in an image or provide a translation of a Chinese character into an english word or phrase Spark makes it easy for you to get immediate value out of these networks by applying them, at scale, to your own data If you are lookign to get started with Spark and deep learning, get started here! Large Scale ETL and preprocessing leading to learning a deep learning model on a single node “Small learning, big data” This is often referred to as “learn small with big data” Rather than trying to collect all of your data onto one node right away you can use Spark to iterate over your entire (distributed) dataset on the driver itself with the toLocalIterator method You can, of course, use Spark for feature generation and simply collect the dataset to a large node as well but this does limit the total datasize that you can train on Distributed training of a large deep learning model “Big learning, big data” This use cases stretches Spark more than any other As you saw throughout the book, Spark has its own notions of how to schedule transformations and communication across a cluster The efficiency of Spark’s ability to perform large scale data manipulation with little overhead, at times, conflicts with the type of system that a can efficiently train a single, massive deep learning model This is a fruitful area of research and some of the below projects attempt to bring this functionality to Spark ... Complex Types 8 Spark SQL Spark SQL Concepts What is SQL? Big Data and SQL: Hive Big Data and SQL: Spark SQL How to Run Spark SQL Queries SparkSQL Thrift JDBC/ODBC Server Spark SQL CLI Spark s Programmatic... Introduction to Spark What is Apache Spark? Spark s Basic Architecture Spark Applications Using Spark from Scala, Java, SQL, Python, or R Key Concepts Starting Spark SparkSession DataFrames Partitions... Table Metadata Refreshing Table Metadata Dropping Tables Views Creating Views Dropping Views Databases Creating Databases Setting The Database Dropping Databases Select Statements Case When Then Statements