Get in Touch

Course Outline

  • Introduction
    • History and core concepts of Hadoop
    • The Hadoop ecosystem
    • Different Hadoop distributions
    • High-level architecture
    • Debunking Hadoop myths
    • Common challenges in Hadoop (hardware and software)
    • Labs: Discussing students’ Big Data projects and associated problems
  • Planning and installation
    • Selecting software and Hadoop distributions
    • Cluster sizing and planning for future growth
    • Selecting hardware and network infrastructure
    • Understanding rack topology
    • Installation procedures
    • Implementing multi-tenancy
    • Directory structure and log management
    • Performance benchmarking
    • Labs: Installing the cluster and running performance benchmarks
  • HDFS operations
    • Core concepts: horizontal scaling, replication, data locality, and rack awareness
    • Nodes and daemons: NameNode, Secondary NameNode, HA Standby NameNode, DataNode
    • Health monitoring strategies
    • Command-line and browser-based administration tools
    • Expanding storage and replacing defective drives
    • Labs: Familiarizing with HDFS command lines
  • Data ingestion
    • Using Flume for ingesting logs and other data into HDFS
    • Using Sqoop for importing data from SQL databases to HDFS and exporting back to SQL
    • Implementing Hadoop data warehousing with Hive
    • Copying data between clusters using distcp
    • Leveraging S3 as a complementary storage solution to HDFS
    • Best practices and architectures for data ingestion
    • Labs: Setting up and utilizing Flume and Sqoop
  • MapReduce operations and administration
    • Parallel computing prior to MapReduce: comparing HPC versus Hadoop administration
    • Understanding MapReduce cluster loads
    • Nodes and Daemons: JobTracker and TaskTracker
    • Walkthrough of the MapReduce User Interface
    • Configuring MapReduce
    • Job configuration details
    • Strategies for optimizing MapReduce
    • Robust MapReduce implementation: guidance for programmers
    • Labs: Running MapReduce examples
  • YARN: New architecture and capabilities
    • YARN design goals and implementation architecture
    • New components: ResourceManager, NodeManager, Application Master
    • Installing YARN
    • Job scheduling mechanisms within YARN
    • Labs: Investigating job scheduling behaviors
  • Advanced topics
    • Hardware monitoring techniques
    • Cluster-wide monitoring
    • Adding and removing servers, upgrading Hadoop versions
    • Backup, recovery, and business continuity planning
    • Oozie job workflows
    • Hadoop High Availability (HA)
    • Hadoop Federation
    • Securing your cluster with Kerberos
    • Labs: Setting up monitoring systems
  • Optional tracks
    • Cloudera Manager: For cluster administration, monitoring, and routine tasks, including installation and usage. All exercises and labs in this track are performed within the Cloudera Distribution including Apache Hadoop (CDH5) environment.
    • Ambari: For cluster administration, monitoring, and routine tasks, including installation and usage. All exercises and labs in this track are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).

Requirements

  • Comfortable with basic Linux system administration
  • Basic scripting skills

Prior knowledge of Hadoop and Distributed Computing is not required, as these topics will be introduced and explained during the course.

Lab environment

Zero Install: There is no need to install Hadoop software on students’ personal machines. A functional Hadoop cluster will be provided for student use.

Students will need the following items:

  • An SSH client (Linux and Mac systems come with SSH clients built-in; for Windows, PuTTY is recommended)
  • A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
 21 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories