Get in Touch

Course Outline

Introduction to AIOps

  • Defining AIOps and its significance
  • Contrasting traditional monitoring with AIOps-driven observability
  • Overview of AIOps architecture and essential components

Collecting and Normalizing Operational Data

  • Types of observability data: metrics, logs, and traces
  • Ingesting data from diverse sources (servers, containers, cloud environments)
  • Leveraging agents and exporters (Prometheus, Beats, Fluentd)

Data Correlation and Anomaly Detection

  • Time series correlation and statistical techniques
  • Applying ML models for anomaly detection
  • Identifying incidents across distributed systems

Alerting and Noise Reduction

  • Designing intelligent alert rules and thresholds
  • Techniques for suppression, deduplication, and alert grouping
  • Integrating with Alertmanager, Slack, PagerDuty, or Opsgenie

Root Cause Analysis and Visualization

  • Using dashboards to visualize metrics and identify trends
  • Exploring events and timelines to conduct RCA
  • Tracing issues across layers using distributed tracing tools

Automation and Remediation

  • Triggering automated scripts or workflows in response to incidents
  • Integrating with ITSM systems (ServiceNow, Jira)
  • Use cases: self-healing, scaling, and traffic rerouting

Open Source and Commercial AIOps Platforms

  • Overview of tools: Prometheus, Grafana, ELK, Moogsoft, Dynatrace
  • Evaluation criteria for selecting an AIOps platform
  • Demo and hands-on practice with a selected stack

Summary and Next Steps

Requirements

  • A foundational understanding of IT operations and system monitoring concepts
  • Hands-on experience with monitoring tools or dashboards
  • Familiarity with standard log and metric formats

Audience

  • Operations teams managing infrastructure and applications
  • Site Reliability Engineers (SREs)
  • Teams focused on IT monitoring and observability
 14 Hours

Number of participants


Price per participant

Upcoming Courses

Related Categories