🚀 Deployment

🚀 Ready to deploy? Skip to the Deployment section to start deploying the infrastructure with Terraform.

⚠️ Disclaimer: This repository includes intentional fault injection and stress test scenarios designed to demonstrate the AWS DevOps Agent's investigation capabilities. These scripts deliberately introduce issues such as memory leaks, network partitions, database stress, and service latency. Do not run these scripts in production environments. They are intended for learning and demonstration purposes only.

📦 Source Code: The source code for the Retail Store Sample Application can be found at: https://github.com/aws-containers/retail-store-sample-app

Getting Started

Install Git

If you don't have Git installed, install it first:

# Linux (Debian/Ubuntu)
sudo apt-get update && sudo apt-get install git

# Linux (RHEL/CentOS/Amazon Linux)
sudo yum install git

# macOS (using Homebrew)
brew install git

# Verify installation
git --version

Clone the Repository

# Clone the repository
git clone https://github.com/aws-samples/AmazonEKS-Devops-agent-sample.git

# Navigate to the project directory
cd AmazonEKS-Devops-agent-sample

🔧 Troubleshooting Git Clone Issues? If you're encountering issues with git clone, you can download the repository as a ZIP file instead:
Navigate to the repository in your browser: https://github.com/aws-samples/AmazonEKS-Devops-agent-sample

Click the Code button → Download ZIP
Extract the ZIP file to your desired location:
unzip AmazonEKS-Devops-agent-sample-main.zip
cd AmazonEKS-Devops-agent-sample-main

Lab Introduction & Goals

This hands-on lab demonstrates how to deploy, operate, and troubleshoot a production-grade microservices application on Amazon EKS using the AWS DevOps Agent. You'll gain practical experience with real-world scenarios including fault injection, observability, and automated incident investigation.

What You'll Learn

Deploy the EKS Cluster with Retail Sample App - Deploy a complete microservices architecture to Amazon EKS using Terraform, including all backend dependencies and observability tooling.
Understand the Microservices Architecture - Explore how the five core microservices (UI, Catalog, Carts, Orders, Checkout) interact with each other and their backend dependencies.
Work with AWS Managed Backend Services - Configure and operate production-grade AWS services that power the application.
Experience Observability in Action - Use CloudWatch Container Insights, Application Signals, Amazon Managed Prometheus, and Amazon Managed Grafana to monitor application health and performance.
Leverage the AWS DevOps Agent - See how the DevOps Agent automatically detects, investigates, and helps resolve infrastructure and application issues.

Architecture Overview

The Retail Store Sample App is a deliberately over-engineered e-commerce application designed to demonstrate microservices patterns and AWS service integrations:

Microservices Data Flow

The following diagram shows how the microservices communicate with each other and their backend data stores:

Observability Architecture

The comprehensive observability stack provides full visibility into application and infrastructure health:

Note: An editable Draw.io version of the architecture diagram is available at docs/retail-store-architecture.drawio

Microservice Components

Component	Language	Container Image	Helm Chart	Description
UI	Java	Link	Link	Store user interface
Catalog	Go	Link	Link	Product catalog API
Cart	Java	Link	Link	User shopping carts API
Orders	Java	Link	Link	User orders API
Checkout	Node	Link	Link	API to orchestrate the checkout process

Service Communication

The services communicate using synchronous HTTP REST calls within the Kubernetes cluster:

Source	Target	Protocol	Endpoint	Purpose
UI	Catalog	HTTP	`http://catalog.catalog.svc:80`	Fetch product listings and details
UI	Carts	HTTP	`http://carts.carts.svc:80`	Manage shopping cart operations
UI	Orders	HTTP	`http://orders.orders.svc:80`	View order history and status
UI	Checkout	HTTP	`http://checkout.checkout.svc:80`	Initiate checkout process
Checkout	Orders	HTTP	`http://orders.orders.svc:80`	Create new orders

Infrastructure Components

The Terraform modules in this repository provision the following AWS resources:

Compute & Orchestration:

Amazon EKS (v1.34) - Kubernetes cluster with EKS Auto Mode enabled
- General-purpose and system node pools
- Network Policy Controller enabled
- All control plane logging (API, audit, authenticator, controller manager, scheduler)

Networking:

Amazon VPC - Custom VPC with public/private subnets across 3 AZs
- NAT Gateway for private subnet internet access
- VPC Flow Logs with 30-day retention
- Kubernetes-tagged subnets for ELB integration

Databases:

Amazon Aurora MySQL (v8.0) - Catalog service database
- db.t3.medium instance class
- Storage encryption enabled
Amazon Aurora PostgreSQL (v15.10) - Orders service database
- db.t3.medium instance class
- Storage encryption enabled
Amazon DynamoDB - Carts service NoSQL database
- Global secondary index on customerId
- On-demand capacity mode

Messaging & Caching:

Amazon MQ (RabbitMQ) (v3.13) - Message broker for Orders service
- mq.t3.micro instance type
- Single-instance deployment
Amazon ElastiCache (Redis) - Session/cache store for Checkout service
- cache.t3.micro instance type

Observability Stack:

Amazon CloudWatch Container Insights - Enhanced container monitoring with Application Signals
Amazon Managed Service for Prometheus (AMP) - Metrics collection and storage
- EKS Managed Prometheus Scraper
- Scrapes: API server, kubelet, cAdvisor, kube-state-metrics, node-exporter, application pods
Amazon Managed Grafana - Visualization and dashboards
- Prometheus, CloudWatch, and X-Ray data sources
AWS X-Ray - Distributed tracing
Network Flow Monitoring Agent - Container network observability

EKS Add-ons:

metrics-server
kube-state-metrics
prometheus-node-exporter
aws-efs-csi-driver
aws-secrets-store-csi-driver-provider
amazon-cloudwatch-observability (with Application Signals)
aws-network-flow-monitoring-agent
cert-manager

Observability Stack - Deep Dive

The Retail Store Sample App includes a comprehensive observability stack that provides full visibility into application and infrastructure health. This section details the instrumentation, metrics collection, and visualization capabilities.

Application Instrumentation

Each microservice is instrumented for observability:

Service	Language	Prometheus Metrics	OpenTelemetry Tracing	Application Signals
UI	Java	✅ `/actuator/prometheus`	✅ OTLP	✅ Auto-instrumented
Catalog	Go	✅ `/metrics`	✅ OTLP	❌ (Go not supported)
Carts	Java	✅ `/actuator/prometheus`	✅ OTLP	✅ Auto-instrumented
Orders	Java	✅ `/actuator/prometheus`	✅ OTLP	✅ Auto-instrumented
Checkout	Node.js	✅ `/metrics`	✅ OTLP	✅ Auto-instrumented

Application Signals Auto-Instrumentation: Java and Node.js services are automatically instrumented via pod annotations:

# Java services (UI, Carts, Orders)
instrumentation.opentelemetry.io/inject-java: "true"

# Node.js services (Checkout)
instrumentation.opentelemetry.io/inject-nodejs: "true"

Note: The Catalog service (Go) does not support Application Signals auto-instrumentation. It uses manual OpenTelemetry SDK instrumentation.

CloudWatch Container Insights

Container Insights provides enhanced observability for EKS clusters with the following capabilities:

Metrics Collected:

Container CPU/memory utilization and limits
Pod network I/O (bytes received/transmitted)
Container restart counts
Cluster, node, and pod-level aggregations

Application Signals Features:

Automatic service map generation
Request latency percentiles (p50, p95, p99)
Error rates and HTTP status code distribution
Service dependency visualization
SLO monitoring and alerting

CloudWatch Application Signals (APM)

Application Signals provides Application Performance Monitoring (APM) capabilities for your microservices. Four of the five services are auto-instrumented:

Service	Language	Auto-Instrumented	APM Features
UI	Java	✅ Yes	Traces, metrics, service map
Carts	Java	✅ Yes	Traces, metrics, service map
Orders	Java	✅ Yes	Traces, metrics, service map
Checkout	Node.js	✅ Yes	Traces, metrics, service map
Catalog	Go	❌ No	Manual OTEL instrumentation only

Accessing Application Signals Console:

Open the CloudWatch Console
In the left navigation, click Application Signals → Services
You will see the 4 instrumented services listed:
- ui (Java)
- carts (Java)
- orders (Java)
- checkout (Node.js)

Key APM Features in Application Signals:

Service Map: Visual representation of service dependencies and traffic flow
- Navigate to Application Signals → Service Map
- See real-time connections between UI → Catalog, UI → Carts, Checkout → Orders, etc.
Service Details: Click on any service to view:
- Request rate (requests/second)
- Latency percentiles (p50, p95, p99)
- Error rate and fault rate
- Top operations and endpoints
Traces: Distributed tracing across services
- Navigate to Application Signals → Traces
- Filter by service, operation, or latency
- View end-to-end request flow across microservices
SLO Monitoring: Set Service Level Objectives
- Define availability and latency targets
- Get alerts when SLOs are breached

Note: The Catalog service (Go) does not appear in Application Signals because Go auto-instrumentation is not supported. However, it still sends traces via manual OpenTelemetry SDK instrumentation visible in X-Ray.

Container Logs Collection:

Container logs from all pods are automatically collected by Fluent Bit and sent to CloudWatch Logs. The logs are organized into the following log groups:

Log Group	Description
`/aws/containerinsights/retail-store/application`	Application container logs (stdout/stderr) from all pods
`/aws/containerinsights/retail-store/dataplane`	Kubernetes dataplane component logs
`/aws/containerinsights/retail-store/host`	Node-level host logs
`/aws/containerinsights/retail-store/performance`	Performance metrics in log format

Viewing Container Logs:

# View recent logs for a specific service using CloudWatch Logs Insights
aws logs start-query \
  --log-group-name "/aws/containerinsights/retail-store/application" \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter kubernetes.namespace_name = "catalog" | sort @timestamp desc | limit 50'

# Or use kubectl for real-time logs
kubectl logs -n catalog -l app.kubernetes.io/name=catalog --tail=100 -f

Log Structure: Each log entry includes Kubernetes metadata for easy filtering:

kubernetes.pod_name - Pod name
kubernetes.namespace_name - Namespace
kubernetes.container_name - Container name
kubernetes.host - Node instance ID
log_processed - Parsed JSON log content (if applicable)

Access Container Insights:

Open CloudWatch Console
Navigate to Container Insights → Performance monitoring
Select your EKS cluster from the dropdown
Explore metrics by: Cluster, Namespace, Service, Pod, or Container
For logs, navigate to Logs → Log groups → /aws/containerinsights/retail-store/application

Amazon Managed Prometheus (AMP)

AMP provides a fully managed Prometheus-compatible monitoring service.

Metrics Scrape Configuration:

The EKS Managed Prometheus Scraper collects metrics from multiple sources:

Key Metrics Available:

Source	Metrics	Use Case
kube-state-metrics	`kube_pod_status_phase`, `kube_deployment_status_replicas`	Kubernetes object states
node-exporter	`node_cpu_seconds_total`, `node_memory_MemAvailable_bytes`	Node hardware/OS metrics
cAdvisor	`container_cpu_usage_seconds_total`, `container_memory_usage_bytes`	Container resource usage
API Server	`apiserver_request_total`, `apiserver_request_duration_seconds`	Control plane performance
Application Pods	Custom application metrics	Business and application KPIs

Amazon Managed Grafana

📌 Optional: Amazon Managed Grafana is optional for this lab. The primary focus is on the AWS DevOps Agent, which automatically analyzes metrics from CloudWatch and Prometheus. Configure Grafana only if you want to manually review and visualize metrics through custom dashboards.

Grafana provides visualization and dashboarding for all collected metrics.

Pre-configured Data Sources:

Prometheus - AMP workspace for Kubernetes and application metrics
CloudWatch - AWS service metrics (RDS, DynamoDB, ElastiCache, etc.)
X-Ray - Distributed traces and service maps

Accessing Grafana:

Get the Grafana workspace URL from Terraform output:
```
terraform output grafana_workspace_endpoint
```
Sign in using AWS IAM Identity Center (SSO)
Navigate to Dashboards to view pre-built visualizations

Configuring the Prometheus Data Source:

The Prometheus data source must be manually configured in Grafana to query metrics from Amazon Managed Prometheus (AMP).

Get your AMP workspace endpoint:

terraform output prometheus_workspace_endpoint

In Grafana, navigate to Connections → Data sources → Add data source → Prometheus
Configure the data source with these settings:
- Name: Amazon Managed Prometheus (or your preferred name)
- URL: Your AMP workspace endpoint (e.g., https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
Note: The Prometheus endpoint URL is unique to your deployment. Get it from the Terraform output above.
Under Authentication, enable SigV4 auth:
- Toggle SigV4 auth to ON
- Default Region: us-east-1 (or your deployment region)
- Leave Assume Role ARN empty (Grafana uses its workspace IAM role automatically)
Under HTTP Method, select POST
Click Save & test to verify the connection

Troubleshooting: If you receive a 403 Forbidden error, ensure SigV4 auth is enabled. Amazon Managed Grafana automatically uses its workspace IAM role for authentication - no manual credentials are needed.

Recommended Dashboards to Import:

How to Import a Dashboard:

In Grafana, click Dashboards in the left sidebar
Click New → Import
Enter the Grafana ID from the table below in the "Import via grafana.com" field
Click Load
Select your Prometheus data source (the one you configured above)
Click Import

The dashboard will be added to your Grafana instance and start displaying metrics immediately.

Dashboard	Grafana ID	Description
Control Plane
Kubernetes API Server	15761	API server request rates, latencies, and error rates
etcd	3070	etcd cluster health, leader elections, and disk I/O
Kubernetes Controller Manager	12122	Controller work queue depths and reconciliation metrics
Kubernetes Scheduler	12123	Scheduler latency, pending pods, and preemption metrics
Kube State Metrics
Kubernetes Cluster (via kube-state-metrics)	13332	Comprehensive cluster state overview
Kubernetes Deployment Statefulset Daemonset	8588	Workload replica status and rollout progress
Kubernetes Resource Requests vs Limits	13770	Resource allocation vs actual usage
Kubernetes Pod Status	15759	Pod phase distribution and container states
Node Exporter
Node Exporter Full	1860	Comprehensive node hardware and OS metrics
Node Exporter for Prometheus	11074	Simplified node metrics overview
Node Problem Detector	15549	Node conditions and kernel issues
Network & Conntrack
Kubernetes Networking	12125	Pod and service network traffic
Node Network and Conntrack	14996	Connection tracking table usage and network stats
CoreDNS	14981	DNS query rates, latencies, and cache hit ratios
General Kubernetes
Kubernetes Cluster Monitoring	315	Cluster-wide resource utilization
Kubernetes Pods	6336	Pod-level metrics and logs
Kubernetes Namespace Resources	14678	Per-namespace resource consumption
AWS RDS	707	RDS database performance
AWS DynamoDB	12637	DynamoDB table metrics

Prometheus Node Exporter

Node Exporter exposes hardware and OS-level metrics from each Kubernetes node.

Key Metrics:

node_cpu_seconds_total - CPU time spent in each mode
node_memory_MemTotal_bytes - Total memory
node_memory_MemAvailable_bytes - Available memory
node_filesystem_size_bytes - Filesystem size
node_network_receive_bytes_total - Network bytes received
node_load1, node_load5, node_load15 - System load averages

Useful PromQL Queries:

# CPU utilization percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory utilization percentage
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk utilization percentage
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

Kube State Metrics

Kube State Metrics generates metrics about the state of Kubernetes objects.

Key Metrics:

kube_pod_status_phase - Pod phase (Pending, Running, Succeeded, Failed, Unknown)
kube_pod_container_status_restarts_total - Container restart count
kube_deployment_status_replicas_available - Available replicas
kube_node_status_condition - Node conditions (Ready, MemoryPressure, DiskPressure)
kube_horizontalpodautoscaler_status_current_replicas - HPA current replicas

Useful PromQL Queries:

# Pods not in Running state
kube_pod_status_phase{phase!="Running",phase!="Succeeded"} == 1

# Deployments with unavailable replicas
kube_deployment_status_replicas_unavailable > 0

# Container restarts in last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0

Network Flow Monitoring

The Network Flow Monitoring Agent provides container network observability.

Capabilities:

Service-to-service traffic flow visualization
Network latency between pods
Packet loss detection
TCP connection metrics
Network policy effectiveness monitoring

Access Network Flow Insights:

Open CloudWatch Console
Navigate to Network Monitoring → Network Flow Monitor
View traffic flows between services in the retail store application

OpenTelemetry Instrumentation

OpenTelemetry provides distributed tracing across all microservices.

Configuration:

# OTEL Instrumentation settings (from Terraform)
OTEL_SDK_DISABLED: "false"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
OTEL_RESOURCE_PROVIDERS_AWS_ENABLED: "true"
OTEL_METRICS_EXPORTER: "none"  # Metrics via Prometheus
OTEL_JAVA_GLOBAL_AUTOCONFIGURE_ENABLED: "true"

Trace Propagation:

W3C Trace Context (tracecontext)
W3C Baggage (baggage)

Sampling: Always-on sampling for complete trace visibility

Viewing Observability Data

CloudWatch Container Insights:

# Get cluster name
CLUSTER_NAME=$(terraform output -raw cluster_name)

# View in AWS Console
echo "https://console.aws.amazon.com/cloudwatch/home#container-insights:infrastructure"

Amazon Managed Grafana:

# Get Grafana endpoint
terraform output grafana_workspace_endpoint

Prometheus Queries (via Grafana):

# Get AMP workspace endpoint
terraform output prometheus_workspace_endpoint

How Observability + DevOps Agent Work Together

The AWS DevOps Agent leverages the comprehensive observability stack to automatically investigate and diagnose issues:

Resource Discovery - All resources are tagged with devopsagent = "true", enabling automatic discovery of related infrastructure components.
Metrics Correlation - The agent queries Amazon Managed Prometheus and CloudWatch to identify anomalies in:
- Pod CPU/memory utilization
- Request latency (p50, p95, p99)
- Error rates and HTTP status codes
- Database connection pools and query performance
Log Analysis - CloudWatch Logs from EKS control plane and application pods are analyzed for:
- Error patterns and stack traces
- Connection timeouts and failures
- Resource exhaustion warnings
Trace Investigation - X-Ray traces help identify:
- Slow service dependencies
- Failed downstream calls
- Latency bottlenecks in the request path
Network Insights - Network Flow Monitoring reveals:
- Traffic patterns between services
- Network policy violations
- Connectivity issues

When you inject faults using the provided scripts, the DevOps Agent can automatically detect symptoms, correlate signals across the observability stack, and provide root cause analysis with remediation recommendations.

🚀 Deployment

Prerequisites

Before deploying and running fault injection scenarios, install the following tools:

1. AWS CLI

# Linux (x86_64)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# macOS
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /

# Verify installation
aws --version

Configure AWS credentials:

aws configure
# Enter your AWS Access Key ID, Secret Access Key, and default region (us-east-1)

2. Terraform

# Linux/macOS using tfenv (recommended)
git clone https://github.com/tfutils/tfenv.git ~/.tfenv
echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
tfenv install 1.5.0
tfenv use 1.5.0

# Or direct installation (Linux)
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
unzip terraform_1.5.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/

# macOS using Homebrew
brew tap hashicorp/tap
brew install hashicorp/tap/terraform

# Verify installation
terraform --version

3. kubectl

# Linux
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

# macOS using Homebrew
brew install kubectl

# Verify installation
kubectl version --client

4. Helm (optional, for chart deployments)

# Linux/macOS
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# macOS using Homebrew
brew install helm

# Verify installation
helm version

5. AWS IAM Identity Center (SSO) for Amazon Managed Grafana

📌 Optional: Amazon Managed Grafana is disabled by default in this deployment. Grafana requires AWS IAM Identity Center (SSO) to be configured, and if SSO is not set up, the Terraform deployment will fail. The AWS DevOps Agent does not require Grafana - it directly queries CloudWatch and Prometheus for automated analysis.

To enable Grafana, you must:

First configure AWS IAM Identity Center in your account
Set enable_grafana = true in your Terraform variables

Setup Guide: Enable IAM Identity Center for Amazon Managed Grafana

Quick Steps to Enable Grafana:

Open the IAM Identity Center console
Click Enable if not already enabled
Create users or groups that will access Grafana

Deploy with Grafana enabled:

terraform apply -var="enable_grafana=true"

After deployment, assign yourself as Grafana admin:
- Go to Amazon Managed Grafana console
- Choose All workspaces from the left navigation
- Select the retail-store-grafana workspace
- Choose the Authentication tab
- Choose Configure users and user groups
- Select the checkbox next to your SSO user and choose Assign user
- Select your user and choose Make admin
For detailed instructions, see Manage user and group access to Amazon Managed Grafana workspaces

Terraform Deployment

Navigate to the EKS deployment directory:

cd terraform/eks/default

What Terraform Will Create

When you run terraform apply, the following resources will be provisioned:

EKS Cluster & Compute:

Amazon EKS cluster (v1.34) with EKS Auto Mode enabled
IAM roles for cluster and node management
EKS managed add-ons (metrics-server, kube-state-metrics, prometheus-node-exporter, etc.)

Networking:

New VPC with public and private subnets across 3 Availability Zones
NAT Gateway for private subnet internet access
VPC Flow Logs for network traffic analysis
Security groups for all components

Application Dependencies:

Amazon DynamoDB - Table for Carts service with GSI on customerId
Amazon Aurora MySQL - Database for Catalog service
Amazon Aurora PostgreSQL - Database for Orders service
Amazon MQ (RabbitMQ) - Message broker for Orders service
Amazon ElastiCache (Redis) - Cache for Checkout service
Application Load Balancer - Managed by EKS Auto Mode for ingress

Observability Stack:

Amazon CloudWatch Container Insights with Application Signals
Amazon Managed Service for Prometheus (AMP) with EKS scraper
Amazon Managed Grafana workspace (optional, requires enable_grafana = true and AWS SSO)
AWS X-Ray integration
Network Flow Monitoring Agent

Retail Store Application:

All five microservices (UI, Catalog, Carts, Orders, Checkout) deployed to dedicated namespaces

Step-by-Step Deployment

# 1. Navigate to the full EKS deployment directory
cd terraform/eks/default

# 2. Initialize Terraform (downloads providers and modules)
terraform init

# 3. Review the execution plan
#    This shows all resources that will be created
terraform plan

# 4. Apply the configuration
#    Type 'yes' when prompted to confirm
#    This takes approximately 20-30 minutes
terraform apply

# 5. Note the outputs - you'll need these for kubectl configuration
#    Look for: cluster_name, region, and any endpoint URLs
terraform output

Optional: Customize Cluster Name and Region

By default, the cluster is named retail-store and deployed to us-east-1. You can customize these values:

# Deploy with custom cluster name and region
terraform apply -var="cluster_name=my-retail-cluster" -var="region=us-west-2"

# Or create a terraform.tfvars file for persistent configuration
cat > terraform.tfvars <<EOF
cluster_name = "my-retail-cluster"
region       = "us-west-2"
EOF
terraform apply

Variable	Default	Description
`cluster_name`	`retail-store`	Name of the EKS cluster
`region`	`us-east-1`	AWS region for deployment
`enable_grafana`	`false`	Enable Amazon Managed Grafana (requires AWS SSO)

Optional: Enable Amazon Managed Grafana

⚠️ Important: Grafana requires AWS IAM Identity Center (SSO) to be configured in your account. If SSO is not set up, Terraform will fail when enable_grafana=true. See Prerequisites - AWS IAM Identity Center for setup instructions.

# To deploy with Grafana enabled (requires AWS SSO):
terraform apply -var="enable_grafana=true"

Configure EKS Access Entry (Required Manual Step)

Important: After the EKS cluster is created, you must manually add your IAM role to the cluster's access entries. Terraform does not configure this automatically.

Steps to add your IAM role:

Open the Amazon EKS Console
Select your cluster (default name: retail-store, or your custom cluster_name)
Navigate to Access tab → IAM access entries
Click Create access entry
Configure the access entry:
- IAM principal ARN: Enter your IAM role ARN (any IAM user or role with required permissions, or an admin user)
- Type: Standard
Click Next
Add access policy:
- Policy name: AmazonEKSClusterAdminPolicy
- Access scope: Cluster
Click Create

Alternative: Using AWS CLI

# Get your current IAM identity
aws sts get-caller-identity

# Get cluster name from Terraform output (or use your custom name)
CLUSTER_NAME=$(terraform output -raw cluster_name)

# Create access entry (replace YOUR_ROLE_ARN with your actual role ARN)
aws eks create-access-entry \
  --cluster-name $CLUSTER_NAME \
  --principal-arn YOUR_ROLE_ARN \
  --type STANDARD

# Associate the admin policy
aws eks associate-access-policy \
  --cluster-name $CLUSTER_NAME \
  --principal-arn YOUR_ROLE_ARN \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope type=cluster

Resource Tagging

All AWS resources created by this Terraform deployment are tagged with:

devopsagent = "true"

This tag enables the AWS DevOps Agent to automatically discover and monitor resources associated with this retail store application. The agent uses this tag to:

Identify resources for automated investigation during incidents
Correlate related resources across EKS, RDS, DynamoDB, and other AWS services
Scope troubleshooting and root cause analysis to the correct infrastructure

Configure kubectl Access

After the EKS cluster is deployed, configure kubectl to access it:

# Update kubeconfig using Terraform outputs
aws eks update-kubeconfig \
  --name $(terraform output -raw cluster_name) \
  --region $(terraform output -raw region)

# Or manually specify your cluster name and region
aws eks update-kubeconfig --name retail-store --region us-east-1

# Verify cluster access
kubectl get nodes

# Verify all pods are running
kubectl get pods -A

Verify Application Deployment

# Check all retail store services are running
kubectl get pods -A | grep -E "carts|catalog|orders|checkout|ui"

# Get the UI Ingress URL (ALB)
kubectl get ingress -n ui

Application Access (UI Service)

The Retail Sample App UI is exposed via an AWS Application Load Balancer (ALB) created automatically by the AWS Load Balancer Controller.

Get the Application URL

After deployment, get the ALB URL from Terraform output:

# Get the application URL
terraform output retail_app_url

# Or get it directly from the Ingress resource
kubectl get ingress -n ui ui

The ALB URL will look like: http://k8s-ui-ui-xxxxxxxxxx-xxxxxxxxxx.us-east-1.elb.amazonaws.com

Note: It may take 2-3 minutes for the ALB to be provisioned and become healthy after deployment.

Verify Application Deployment

# Check all retail store services are running
kubectl get pods -A | grep -E "carts|catalog|orders|checkout|ui"

# Check the UI Ingress status
kubectl get ingress -n ui

# Verify the ALB target group is healthy
kubectl describe ingress ui -n ui

Troubleshooting

Issue	Cause	Solution
ALB URL returns 503	Target group unhealthy	Check pod health: `kubectl get pods -n ui`
ALB not provisioned	AWS LB Controller issue	Check controller logs: `kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller`
Ingress has no address	ALB still provisioning	Wait 2-3 minutes and check again

AWS DevOps Agent Integration

AWS DevOps Agent is a frontier AI agent that helps accelerate incident response and improve system reliability. It automatically correlates data across your operational toolchain, identifies probable root causes, and recommends targeted mitigations. This section provides step-by-step guidance for integrating the DevOps Agent with your EKS-based Retail Store deployment.

Note: AWS DevOps Agent is currently in public preview and available in the US East (N. Virginia) Region (us-east-1). While the agent runs in us-east-1, it can monitor applications deployed in any AWS Region.

Create an Agent Space

An Agent Space defines the tools and infrastructure that AWS DevOps Agent has access to.

For more details, see the AWS DevOps Agent documentation.

Step 1: Access the Console

Sign in to the AWS Management Console
Ensure you're in the US East (N. Virginia) region (us-east-1)
Navigate to the AWS DevOps Agent console

Step 2: Create the Agent Space

Click Create Agent Space +
In the Agent Space details section, provide:
- Name: retail-store-eks-workshop
- Description (Optional): Add details about the Agent Space's purpose

Step 3: Configure Primary Account Access

In the Give this Agent Space AWS resource access section:

Select Auto-create a new AWS DevOps Agent role
(Optional) Update the Agent Space role name

Note: You must have IAM permissions to create new roles to use this option.

Step 4: Configure Resource Discovery Tags

By default, all CloudFormation stacks and their resources will be discovered. Since this sample uses Terraform (not CloudFormation), you need to add a tag during Agent Space creation so the agent can discover your resources.

In the Include AWS tags section:

Click Add tag
Enter:

Tag Key	Tag Value
`eksdevopsagent`	`true`

Important: All resources in this sample are tagged with eksdevopsagent=true. This ensures the DevOps Agent discovers the EKS cluster, services, Aurora databases, DynamoDB tables, and all related infrastructure.

Step 5: Enable the Web App

The Web App is where you interact with AWS DevOps Agent for incident investigations.

Select Auto-create a new AWS DevOps Agent role
Review the permissions that will be granted to the role

Step 6: Submit

Click Submit and wait for the Agent Space to be created (typically 1-2 minutes).

Verify Setup

Once configured, the Configure Web App button should become Admin access. Clicking it should open the Web App and authenticate successfully.

Configure EKS Access for DevOps Agent (Required)

AWS DevOps Agent needs access to the Kubernetes API to describe your Kubernetes cluster objects, retrieve pod logs and cluster events.

⚠️ Important: This step is required for the DevOps Agent to investigate Kubernetes-level issues like pod failures, resource constraints, and deployment problems.

For detailed instructions, refer to the official AWS documentation: AWS EKS access setup

Configure via DevOps Agent Console

Open the DevOps Agent Console
Select your Agent Space: retail-store-eks-workshop
Navigate to Capabilities → Cloud → Primary Source → Edit
Follow the setup instructions provided in the console

What This Enables

With EKS access configured, the DevOps Agent can:

Capability	Description
Pod Status	Check running, pending, failed pods
Events	View Kubernetes events for troubleshooting
Deployments	Examine deployment configurations
Services	Check service endpoints and selectors
HPA	Monitor autoscaling status
Logs	Access pod logs via CloudWatch
Resource Usage	View CPU/memory from metrics server

Troubleshooting

Error: Access entry already exists

The entry may have been created during infrastructure deployment
Verify it has the correct policy attached

Error: Invalid principal ARN

Ensure you copied the complete ARN including the role name
Verify the role exists in IAM

Agent still getting 401 errors

Wait 1-2 minutes for the access entry to propagate
Verify the policy AmazonEKSClusterAdminPolicy is associated

View Topology Graph

The Topology view provides a visual map of your system components and their relationships. AWS DevOps Agent automatically builds this topology by analyzing your infrastructure.

Accessing the Topology View

Open your Agent Space in the AWS Console
Click the Topology tab
View the automatically discovered resources and relationships

Understanding Relationships

The DevOps Agent automatically detects:

Relationship Type	Example	How Detected
Service Dependencies	UI → Catalog	Network traffic analysis, service mesh
Database Connections	Orders → Aurora PostgreSQL	Security group rules, connection strings
Message Queue Links	Orders → RabbitMQ	Environment variables, connection configs
Cache Dependencies	Checkout → Redis	Pod configurations, endpoint references

Operator Access

Operator access allows your on-call engineers and DevOps team to interact with the AWS DevOps Agent through a dedicated web application.

Enabling Operator Access

From the Agent Space Console
- Navigate to your Agent Space
- Click Operator access in the left navigation
- Click Enable operator access if not already enabled
Access Methods

Option A: Direct Console Access
- Click the Operator access link in your Agent Space
- This opens the DevOps Agent web app directly
- Requires AWS Console authentication
Option B: AWS IAM Identity Center (Recommended for Teams)
- Configure IAM Identity Center for your organization
- Create a permission set for DevOps Agent access
- Assign users/groups to the permission set
- Users can access via the Identity Center portal

How the Agent Interacts with EKS

The DevOps Agent interacts with your EKS cluster through:

Read-Only Kubernetes API Access
- Lists pods, deployments, services, events
- Reads pod logs for error analysis
- Checks resource utilization metrics
CloudWatch Container Insights
- Queries container metrics (CPU, memory, network)
- Analyzes Application Signals data
- Reviews performance anomalies
AWS API Calls
- Describes EKS cluster configuration
- Checks node group status
- Reviews security group rules

Safety Mechanisms

AWS DevOps Agent includes several safety mechanisms:

Mechanism	Description
Read-Only by Default	The agent only reads data; it does not modify resources
Scoped Access	Access is limited to resources within the Agent Space
Audit Logging	All agent actions are logged to CloudTrail
Investigation Boundaries	Investigations are scoped to specific incidents
Human-in-the-Loop	Mitigation recommendations require human approval

Approval Workflows

When the DevOps Agent identifies a mitigation:

Recommendation Generated - Agent proposes a fix (e.g., "Scale up deployment")
Human Review - Operator reviews the recommendation in the web app
Approval Required - Operator must explicitly approve any changes
Implementation Guidance - Agent provides detailed specs for implementation

Important: The DevOps Agent does not automatically make changes to your infrastructure. All mitigations are recommendations that require human approval and manual implementation.

Starting an Investigation

From the Operator Web App:

Click Start Investigation
Choose a starting point:
- Latest alarm - Investigate the most recent CloudWatch alarm
- High CPU usage - Analyze CPU utilization across resources
- Error rate spike - Investigate application error increases
- Custom - Describe the issue in your own words
Provide investigation details:
- Investigation details - Describe what you're investigating
- Date and time - When the incident occurred
- AWS Account ID - The account containing the affected resources
Click Start and watch the investigation unfold in real-time

Investigation Prompts for Fault Injection Scenarios

After injecting a fault using the scripts in the Fault Injection Scenarios section, use these prompts to start a DevOps Agent investigation:

Scenario	Investigation Details
Catalog Latency	"Product pages are loading slowly."
Network Partition	"Website is unreachable."
RDS Security Group Block	"Catalog pod is crashing."
Cart Memory Leak	"Intermittent cart failures, pods restarting."
DynamoDB Stress Test	"Slow performance and occasional failures."

Investigation Flow:

Tip: For detailed investigation prompts with specific metrics and starting points, see the "DevOps Agent Investigation Prompts" section under each Fault Injection Scenario.

Interacting During Investigations

You can interact with the agent during investigations:

Ask clarifying questions: "Which logs did you analyze?"
Provide context: "Focus on the orders namespace"
Steer the investigation: "Check the RDS connection pool metrics"
Request AWS Support: Create a support case with one click

Reading Documentation

For the most up-to-date information about AWS DevOps Agent, refer to the official documentation:

Official Resources

Resource	URL	Description
Product Page	https://aws.amazon.com/devops-agent	Overview and sign-up
AWS News Blog	Launch Announcement	Detailed walkthrough
IAM Reference	Service Authorization Reference	IAM actions and permissions

Key Concepts to Understand

Agent Spaces - Logical boundaries for resource grouping
Topology - Visual map of infrastructure relationships
Investigations - Automated root cause analysis sessions
Mitigations - Recommended fixes with implementation guidance
Integrations - Connections to observability and CI/CD tools

Supported Integrations

AWS DevOps Agent integrates with:

Observability Tools:

Amazon CloudWatch (native)
Datadog
Dynatrace
New Relic
Splunk
Grafana/Prometheus (via MCP)

CI/CD & Source Control:

GitHub Actions
GitLab CI/CD

Incident Management:

ServiceNow (native)
PagerDuty (via webhooks)
Slack (for notifications)

Custom Tools:

Bring Your Own MCP Server for custom integrations

Best Practices for This Lab

Tag All Resources - Ensure devopsagent = "true" tag is applied
Enable Container Insights - Already configured in Terraform
Configure Alarms - Set up CloudWatch alarms for key metrics
Use Fault Injection - Test the agent's investigation capabilities
Review Recommendations - Learn from the agent's analysis

Fault Injection Scenarios

This repository includes fault injection scripts for simulating production-like issues during demos and training sessions. These scenarios help demonstrate how DevOps agents and monitoring tools can detect and diagnose real-world problems.

Prerequisites

EKS cluster deployed with the retail store application
kubectl configured to access the cluster
AWS CLI configured with appropriate permissions

Setup

Navigate to the fault injection directory:

cd fault-injection

Make all fault injection scripts executable:

chmod +x *.sh

Available Scenarios

Scenario	Inject Script	Rollback Script	Symptom
Catalog Latency	`inject-catalog-latency.sh`	`rollback-catalog.sh`	Product pages are loading slowly
Network Partition	`inject-network-partition.sh`	`rollback-network-partition.sh`	Website is unreachable
RDS Security Group Block	`inject-rds-sg-block.sh`	`rollback-rds-sg-block.sh`	Catalog pod is crashing
Cart Memory Leak	`inject-cart-memory-leak.sh`	`rollback-cart-memory-leak.sh`	Intermittent cart failures, pods restarting
DynamoDB Stress Test	`inject-dynamodb-stress.sh`	`rollback-dynamodb-stress.sh`	Slow performance and occasional failures

1. Catalog Service Latency Injection

Simulates high latency and CPU stress in the Catalog microservice.

Run the scenario:

# Inject the fault
./fault-injection/inject-catalog-latency.sh

# Rollback
./fault-injection/rollback-catalog.sh

Symptom: Product pages are loading slowly.

Sample Prompt:

"Product pages are loading slowly. The catalog service seems to be responding with high latency. Can you investigate what's causing the performance degradation?"

2. Network Partition

Blocks ingress traffic to the UI service using Kubernetes NetworkPolicy.

Run the scenario:

# Inject the fault
./fault-injection/inject-network-partition.sh

# Rollback
./fault-injection/rollback-network-partition.sh

Symptom: Website is unreachable.

Sample Prompt:

"The retail store website is completely unreachable. Users are reporting connection timeouts when trying to access the site. Can you investigate the network connectivity issue?"

3. RDS Security Group Misconfiguration

Simulates an accidental security group change that blocks EKS nodes from connecting to RDS instances.

Run the scenario:

# Inject the fault
./fault-injection/inject-rds-sg-block.sh

# Rollback
./fault-injection/rollback-rds-sg-block.sh

Symptom: Catalog pod is crashing.

Sample Prompt:

"The catalog pod keeps crashing and restarting. It was working fine earlier today but now it can't seem to stay healthy. Can you investigate why the catalog service is failing?"

4. Cart Memory Leak

Simulates a memory leak in the Cart service causing OOMKill and pod restarts.

Run the scenario:

# Inject the fault
./fault-injection/inject-cart-memory-leak.sh

# Rollback
./fault-injection/rollback-cart-memory-leak.sh

Symptom: Intermittent cart failures, pods restarting.

Sample Prompt:

"Users are experiencing intermittent cart failures. Sometimes adding items to cart works, sometimes it doesn't. I've noticed the cart pods are restarting frequently. Can you investigate what's causing the instability?"

5. DynamoDB Stress Test

Deploys a stress pod that hammers DynamoDB with read requests, causing throttling.

Run the scenario:

# Inject the fault
./fault-injection/inject-dynamodb-stress.sh

# Rollback (instant - no data cleanup needed)
./fault-injection/rollback-dynamodb-stress.sh

Symptom: Slow performance and occasional failures.

Sample Prompt:

"The cart service is experiencing slow performance and occasional failures. Users are complaining about delays when viewing or updating their shopping carts. Can you investigate the DynamoDB-related issues?"

Demo Workflow

For a training session, follow this workflow:

Verify baseline - Ensure all services are healthy before injection

kubectl get pods -A | grep -E "carts|catalog|orders|checkout|ui"

Choose a scenario - Select one fault injection scenario from the table above
Inject the fault - Run the inject script and wait for symptoms to appear
Observe symptoms - Use monitoring tools (CloudWatch, Prometheus, pod logs) to observe the impact
Let DevOps Agent investigate - Allow automated investigation to detect root cause
Rollback - Run the rollback script to restore normal operation
Verify recovery - Confirm all services return to healthy state

Cleanup - Destroying the Lab Environment

When you're finished with the lab, it's important to clean up all AWS resources to avoid ongoing charges. This section provides detailed instructions for completely removing the environment.

Why a Cleanup Script?

Running terraform destroy alone may not remove all resources because:

AWS GuardDuty automatically creates VPC endpoints and security groups for runtime monitoring - these block subnet/VPC deletion
CloudWatch Container Insights creates log groups dynamically when the agent starts
Kubernetes resources (Helm releases, namespaces) can cause provider errors during destroy

The cleanup script handles all of these edge cases by cleaning up AWS auto-provisioned resources BEFORE running terraform destroy.

Quick Cleanup (Recommended)

Use the provided destroy script for a complete cleanup:

# Make the script executable
chmod +x scripts/destroy-environment.sh

# Run the cleanup script (uses defaults: CLUSTER_NAME=retail-store, AWS_REGION=us-east-1)
./scripts/destroy-environment.sh

# Or override defaults
CLUSTER_NAME=my-cluster AWS_REGION=us-west-2 ./scripts/destroy-environment.sh

The script will:

Get VPC ID for the cluster
Delete VPC endpoints and GuardDuty security groups (prevents subnet deletion failures)
Remove Kubernetes resources from Terraform state (prevents provider errors)
Run terraform destroy to remove all Terraform-managed resources
Final VPC cleanup if it still exists
Clean up orphaned CloudWatch log groups

Manual Cleanup Steps

If you prefer to clean up manually or need to troubleshoot:

Step 1: Get VPC ID

VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:environment-name,Values=retail-store" --query "Vpcs[0].VpcId" --output text --region us-east-1)
echo "VPC ID: $VPC_ID"

Step 2: Clean Up GuardDuty Resources FIRST

This must be done BEFORE terraform destroy to prevent subnet deletion failures:

# Delete VPC endpoints created by GuardDuty
ENDPOINTS=$(aws ec2 describe-vpc-endpoints --filters "Name=vpc-id,Values=$VPC_ID" --query "VpcEndpoints[*].VpcEndpointId" --output text --region us-east-1)
for ep in $ENDPOINTS; do
    echo "Deleting VPC endpoint: $ep"
    aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $ep --region us-east-1
done

# Wait for endpoints to be deleted
sleep 30

# Delete GuardDuty security groups
SG_IDS=$(aws ec2 describe-security-groups --filters "Name=vpc-id,Values=$VPC_ID" "Name=group-name,Values=GuardDuty*" --query "SecurityGroups[*].GroupId" --output text --region us-east-1)
for sg in $SG_IDS; do
    echo "Deleting GuardDuty security group: $sg"
    aws ec2 delete-security-group --group-id $sg --region us-east-1
done

Step 3: Remove Kubernetes Resources from Terraform State

cd terraform/eks/default

# Remove aws_auth ConfigMap
terraform state rm 'kubernetes_config_map_v1_data.aws_auth' 2>/dev/null || true

# Remove Helm releases (prevents Kubernetes provider errors)
terraform state rm 'helm_release.ui' 'helm_release.catalog' 'helm_release.carts' 'helm_release.orders' 'helm_release.checkout' 2>/dev/null || true

# Remove Kubernetes namespaces
terraform state rm 'kubernetes_namespace.ui' 'kubernetes_namespace.catalog' 'kubernetes_namespace.carts' 'kubernetes_namespace.orders' 'kubernetes_namespace.checkout' 'kubernetes_namespace.rabbitmq' 2>/dev/null || true

Step 4: Run Terraform Destroy

terraform destroy -auto-approve

Step 5: Final VPC Cleanup (if needed)

# Try to delete VPC if it still exists
aws ec2 delete-vpc --vpc-id $VPC_ID --region us-east-1 2>/dev/null || true

Step 6: Clean Up CloudWatch Log Groups

# Delete Container Insights log groups
for lg in $(aws logs describe-log-groups --log-group-name-prefix /aws/containerinsights/retail-store --query "logGroups[*].logGroupName" --output text --region us-east-1); do
    echo "Deleting log group: $lg"
    aws logs delete-log-group --log-group-name "$lg" --region us-east-1
done

# Delete EKS cluster log groups
for lg in $(aws logs describe-log-groups --log-group-name-prefix /aws/eks/retail-store --query "logGroups[*].logGroupName" --output text --region us-east-1); do
    echo "Deleting log group: $lg"
    aws logs delete-log-group --log-group-name "$lg" --region us-east-1
done

Verify Cleanup

After running the cleanup, verify all resources are removed:

# Check for remaining EKS clusters
aws eks list-clusters --region us-east-1

# Check for remaining VPCs with retail-store tag
aws ec2 describe-vpcs --filters "Name=tag:environment-name,Values=retail-store" --region us-east-1

# Check for remaining CloudWatch log groups
aws logs describe-log-groups --log-group-name-prefix /aws/containerinsights/retail-store --region us-east-1
aws logs describe-log-groups --log-group-name-prefix /aws/eks/retail-store --region us-east-1

# Check Terraform state is empty
cd terraform/eks/default
terraform state list

Troubleshooting Cleanup Issues

Issue: VPC deletion hangs or fails

Cause: GuardDuty or other AWS services created resources in the VPC
Solution: Use the cleanup script or manually delete VPC endpoints and security groups first

Issue: Terraform provider errors during destroy

Cause: Kubernetes provider can't connect to deleted cluster
Solution: Remove Kubernetes resources from state before destroying (Step 1 above)

Issue: Log groups still exist after destroy

Cause: Container Insights creates log groups outside of Terraform
Solution: Manually delete using AWS CLI (Step 4 above)

Issue: "resource not found" errors

Cause: Resource was already deleted manually or by another process
Solution: These errors are safe to ignore; the resource is already gone

Security

See CONTRIBUTING for more information.

License

This project is licensed under the MIT-0 License.

This package depends on and may incorporate or retrieve a number of third-party software packages (such as open source packages) at install-time or build-time or run-time ("External Dependencies"). The External Dependencies are subject to license terms that you must accept in order to use this package. If you do not accept all of the applicable license terms, you should not use this package. We recommend that you consult your company's open source approval policy before proceeding.

Provided below is a list of External Dependencies and the applicable license identification as indicated by the documentation associated with the External Dependencies as of Amazon's most recent review.

THIS INFORMATION IS PROVIDED FOR CONVENIENCE ONLY. AMAZON DOES NOT PROMISE THAT THE LIST OR THE APPLICABLE TERMS AND CONDITIONS ARE COMPLETE, ACCURATE, OR UP-TO-DATE, AND AMAZON WILL HAVE NO LIABILITY FOR ANY INACCURACIES. YOU SHOULD CONSULT THE DOWNLOAD SITES FOR THE EXTERNAL DEPENDENCIES FOR THE MOST COMPLETE AND UP-TO-DATE LICENSING INFORMATION.

YOUR USE OF THE EXTERNAL DEPENDENCIES IS AT YOUR SOLE RISK. IN NO EVENT WILL AMAZON BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, CONSEQUENTIAL, SPECIAL, INCIDENTAL, OR PUNITIVE DAMAGES (INCLUDING FOR ANY LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, OR COMPUTER FAILURE OR MALFUNCTION) ARISING FROM OR RELATING TO THE EXTERNAL DEPENDENCIES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, EVEN IF AMAZON HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS AND DISCLAIMERS APPLY EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW.

Dependency	License
MariaDB Community Edition	LICENSE
MySQL Community Edition	LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 967 Commits
docs		docs
fault-injection		fault-injection
terraform		terraform
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

aws-samples/sample-devops-agent-eks-workshop

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Getting Started

Install Git

Clone the Repository

Lab Introduction & Goals

What You'll Learn

Architecture Overview

Microservices Data Flow

Observability Architecture

Microservice Components

Service Communication

Infrastructure Components

Observability Stack - Deep Dive

Application Instrumentation

CloudWatch Container Insights

CloudWatch Application Signals (APM)

Amazon Managed Prometheus (AMP)

Amazon Managed Grafana

Prometheus Node Exporter

Kube State Metrics

Network Flow Monitoring

OpenTelemetry Instrumentation

Viewing Observability Data

How Observability + DevOps Agent Work Together

🚀 Deployment

Prerequisites

1. AWS CLI

2. Terraform

3. kubectl

4. Helm (optional, for chart deployments)

5. AWS IAM Identity Center (SSO) for Amazon Managed Grafana

Terraform Deployment

What Terraform Will Create

Step-by-Step Deployment

Configure EKS Access Entry (Required Manual Step)

Resource Tagging

Configure kubectl Access

Verify Application Deployment

Application Access (UI Service)

Get the Application URL

Verify Application Deployment

Troubleshooting

AWS DevOps Agent Integration

Create an Agent Space

Step 1: Access the Console

Step 2: Create the Agent Space

Step 3: Configure Primary Account Access

Step 4: Configure Resource Discovery Tags

Step 5: Enable the Web App

Step 6: Submit

Verify Setup

Configure EKS Access for DevOps Agent (Required)

Configure via DevOps Agent Console

What This Enables

Troubleshooting

View Topology Graph

Accessing the Topology View

Understanding Relationships

Operator Access

Enabling Operator Access

How the Agent Interacts with EKS

Safety Mechanisms

Approval Workflows

Starting an Investigation

Investigation Prompts for Fault Injection Scenarios

Interacting During Investigations

Reading Documentation

Official Resources

Key Concepts to Understand

Supported Integrations

Best Practices for This Lab

Fault Injection Scenarios

Prerequisites

Setup

Packages