In this course, we will provide a deep-dive into Spark as a framework, understand it's design, how to optimally utilize it's design, and how to develop effective machine learning applications with Spark on HDInsight.
# About the Course Spark has become the most popular and perhaps most important distributed data processing framework for Hadoop. In particular, it is particularly amenable to machine learning and interactive data workloads, and can provide an order of magnitude greater performance than traditional Hadoop data processing tools. In this course, we will provide a deep-dive into Spark as a framework, understand it's design, how to optimally utilize it's design, and how to develop effective machine learning applications with Spark on HDInsight. The course covers the fundamentals of Spark, it's core APIs and design, relational data processing with Spark SQL, the fundamentals of Spark job execution, performance tuning, tracking and debugging. Users will get hands-on experience with processing streaming data with Spark streaming, training machine learning algorithms with Spark ML and R Server on Spark, as well as HDInsight configuration and platform specific considerations such as remote developing and access with Livy and IntelliJ, secure Spark, multi-user notebooks with Zeppelin, and virtual networking with other HDInsight clusters. # Skills Taught At the end of the course you will have acquired the following skills: - Understand Spark's fundamental design and mechanics - Optimize Spark SQL queries - Build data pipelines with SparkSQL and DataFrames - Analyze Spark jobs using Spark UIs and logs - Create Streaming and Machine Learning jobs # Prerequisites There are a few things you will need in order to properly follow the course materials: - Hadoop - Administration, Configuration, and Security # Agenda Day One - Spark on HDInsight Overview - Spark Clusters on HDInsight - Developer Tools and Remote Debugging with IntelliJ IDEA - Submitting Spark Jobs Remotely Using Livy - Spark Fundamentals - Functional Programming, Scala and the Collections API - Cluster Architecture - RDDs - Parallel, Distributed Memory Data Structures - Spark SQL/DataFrames - Relational Data Processing with Spark - Sharing Metastore and Storage Accounts with Hadoop/Hive Clusters and Spark Clusters - DataFrames API - Collection of Rows with a Consistent Schema - Integrated APIs for Mixing Relational, Graph, and ML Jobs - Exploring Relational Data with Spark SQL - Catalyst Query Optimization - Optimizing Joins in Spark SQL - Broadcat Joins versus Merge Joins - Creating Custom UDFs for Spark SQL - Caching Spark DataFrames, Saving to Parquet Day Two - Spark Job Execution, Performance Tuning, Tracking and Debugging - Jobs, Stages, and Tasks - Spark Contexts, Applications, the Driver Program and Spark Executors - Partitions and Shuffles - Understanding Data Locality - Monitoring Spark Jobs with the Spark WebUI - Managing Spark Thrift Servers and Changing YARN Resource Allocations - Managing Interactive Livy Sessions and their Resources - Monitoring Spark Jobs with Spark UI - Viewing Spark Job Graphs, and Understanding Spark Stages - Spark Streaming - Creating Spark Streaming Applications Using Spark DStreams APIs - DStreams, Stateful, and Stateless Streams - Comparison of DStreams and RDDs - Transformers for DStreams - Persisting Long Term Data in HBase, Hive or SQL - Creating Spark Structured Streams - Using DataFrames and DataSets API to Create Streaming DataFrames and DataSets - Window Transformations for Stateful and Stateless Operations Day Three - Spark Machine Learning and Graph Analytics - MLLib and Spark ML - Understanding API Patterns - Featurizing DataFrames using Transformers - Developing Machine Learning Pipelines with Spark ML - Cross-Validation and Hyperparameter Tuning - Training ML Models on Text Data: Tokenization, TF/IDF, and Topic Modeling with LDA - Using Evaluators to Evaluate Machine Learning Models - Unsupervised Learning and Clustering - Managing Models with ModelDB - Understanding Graph Analytics and Graph Operators - Vertex and Edge Classes - Mapping Operations - Measuring Connectedness - Training Graph Algorithms with GraphX - Performance and Monitoring - Reducing Memory Allocation with Serialization - Checkpointing - Visualizing Networks with SparkR, d3 and Jupyter- # Technologies Covered - HDInsight (Hadoop & Spark) - Machine Learning - Spark - HDInsight - R Server # Materials - https://github.com/Azure/Spark-HDI-Course