Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Brief introduction to Python and Scala

Fundamentals (Theory):

  • Architecture Overview
  • RDD Concepts
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Practical Workshop: Mastering Basics in the Databricks Environment:

  • Exercises using the RDD API
  • Essential action and transformation functions
  • PairRDDs
  • Join operations
  • Optimizing with caching strategies
  • Exercises using the DataFrame API
  • SparkSQL
  • DataFrame operations: select, filter, group, sort
  • User Defined Functions (UDFs)
  • Introduction to the DataSet API
  • Streaming capabilities

Practical Workshop: Deployment in the AWS Environment:

  • Fundamentals of AWS Glue
  • Comparing AWS EMR and AWS Glue
  • Example jobs across both environments
  • Analysis of pros and cons

Additional Topics:

  • Introduction to Apache Airflow for orchestration

Requirements

Programming skills (preferably in Python or Scala)

Basic knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories