Course Outline
PySpark & Machine Learning
Module 1: Foundations of Big Data & Spark
- A survey of the Big Data ecosystem and Spark's pivotal role in modern data platforms
- Comprehending Spark architecture: drivers, executors, cluster managers, lazy evaluation, DAGs, and execution planning
- Distinguishing between RDD and DataFrame APIs and determining the appropriate use cases for each
- Establishing and configuring SparkSession alongside understanding the basics of application configuration
Module 2: PySpark DataFrames
- Ingesting and exporting data from enterprise sources and various formats (CSV, JSON, Parquet, Delta)
- Manipulating PySpark DataFrames: performing transformations, actions, column expressions, filtering, joins, and aggregations
- Executing advanced operations such as window functions, managing timestamps, and handling nested data structures
- Implementing data quality checks and drafting reusable, maintainable PySpark code
Module 3: Efficient Processing of Large Datasets
- Grasping performance fundamentals: partitioning strategies, shuffle mechanics, caching, and persistence
- Utilizing optimization techniques such as broadcast joins and execution plan analysis
- Processing large datasets effectively and adhering to best practices for scalable data workflows
- Understanding schema evolution and modern storage formats prevalent in enterprise environments
Module 4: Feature Engineering at Scale
- Conducting feature engineering with Spark MLlib: managing missing values, encoding categorical variables, and scaling features
- Designing reusable preprocessing steps and preparing datasets for integration into Machine Learning pipelines
- An overview of feature selection techniques and strategies for handling imbalanced datasets
Module 5: Machine Learning with Spark MLlib
- Training regression and classification models at scale, including Linear Regression, Logistic Regression, Decision Trees, and Random Forests
- Comparing models and interpreting outcomes within distributed Machine Learning workflows
Module 6: End-to-End ML Pipelines
- Constructing comprehensive Machine Learning pipelines that combine preprocessing, feature engineering, and modeling
- Applying train/validation/test split strategies
- Conducting cross-validation and hyperparameter tuning using grid search and random search methods
- Structuring reproducible Machine Learning experiments
Module 7: Model Evaluation & Practical ML Decision Making
- Selecting appropriate evaluation metrics for regression and classification challenges
- Detecting overfitting and underfitting, and making informed decisions regarding model selection
- Interpreting feature importance and gaining insight into model behavior
Module 8: Production & Enterprise Practices
- Saving and loading models within Spark
- Implementing batch inference workflows on extensive datasets
- Understanding the Machine Learning lifecycle within enterprise contexts
- An introduction to versioning, experiment tracking concepts, and fundamental testing strategies
Practical Outcome
Requirements
Participants are expected to possess the following background knowledge:
Fundamental Python programming skills, including familiarity with functions, data structures, and libraries
A solid grasp of data analysis principles, such as datasets, transformations, and aggregations
Basic proficiency in SQL and relational database concepts
Introductory knowledge of Machine Learning principles, including training datasets, features, and evaluation metrics
Familiarity with command-line interfaces and basic software development practices is advantageous
Prior experience with Pandas, NumPy, or comparable data processing libraries is beneficial but not mandatory.
Testimonials (1)
I liked that it was practical. Loved to apply the theoretical knowledge with practical examples.