Course Outline
Introduction
Understanding Hadoop's Architecture and Key Concepts
Understanding the Hadoop Distributed File System (HDFS)
- Overview of HDFS and its Architectural Design
- Interacting with HDFS
- Performing Basic File Operations on HDFS
- Overview of HDFS Command Reference
- Overview of Snakebite
- Installing Snakebite
- Using the Snakebite Client Library
- Using the CLI Client
Learning the MapReduce Programming Model with Python
- Overview of the MapReduce Programming Model
- Understanding Data Flow in the MapReduce Framework
- Map
- Shuffle and Sort
- Reduce
- Using the Hadoop Streaming Utility
- Understanding How the Hadoop Streaming Utility Works
- Demo: Implementing the WordCount Application on Python
- Using the mrjob Library
- Overview of mrjob
- Installing mrjob
- Demo: Implementing the WordCount Algorithm Using mrjob
- Understanding How a MapReduce Job Written with the mrjob Library Works
- Executing a MapReduce Application with mrjob
- Hands-on: Computing Top Salaries Using mrjob
Learning Pig with Python
- Overview of Pig
- Demo: Implementing the WordCount Algorithm in Pig
- Configuring and Running Pig Scripts and Pig Statements
- Using the Pig Execution Modes
- Using the Pig Interactive Mode
- Using the Pic Batch Mode
- Understanding the Basic Concepts of the Pig Latin Language
- Using Statements
- Loading Data
- Transforming Data
- Storing Data
- Extending Pig's Functionality with Python UDFs
- Registering a Python UDF File
- Demo: A Simple Python UDF
- Demo: String Manipulation Using Python UDF
- Hands-on: Calculating the 10 Most Recent Movies Using Python UDF
Using Spark and PySpark
- Overview of Spark
- Demo: Implementing the WordCount Algorithm in PySpark
- Overview of PySpark
- Using an Interactive Shell
- Implementing Self-Contained Applications
- Working with Resilient Distributed Datasets (RDDs)
- Creating RDDs from a Python Collection
- Creating RDDs from Files
- Implementing RDD Transformations
- Implementing RDD Actions
- Hands-on: Implementing a Text Search Program for Movie Titles with PySpark
Managing Workflow with Python
- Overview of Apache Oozie and Luigi
- Installing Luigi
- Understanding Luigi Workflow Concepts
- Tasks
- Targets
- Parameters
- Demo: Examining a Workflow that Implements the WordCount Algorithm
- Working with Hadoop Workflows that Control MapReduce and Pig Jobs
- Using Luigi's Configuration Files
- Working with MapReduce in Luigi
- Working with Pig in Luigi
Summary and Conclusion
Requirements
- Experience with Python programming
- Basic familiarity with Hadoop
Testimonials (5)
Trainer's preparation & organization, and quality of materials provided on github.
Mateusz Rek - MicroStrategy Poland Sp. z o.o.
Course - Impala for Business Intelligence
practical things of doing, also theory was served good by Ajay
Dominik Mazur - Capgemini Polska Sp. z o.o.
Course - Hadoop Administration on MapR
The VM I liked very much The Teacher was very knowledgeable regarding the topic as well as other topics, he was very nice and friendly I liked the facility in Dubai.
Safar Alqahtani - Elm Information Security
Course - Big Data Analytics in Health
Liked very much the interactive way of learning.
Luigi Loiacono
Course - Data Analysis with Hive/HiveQL
I mostly liked the trainer giving real live Examples.