🚀 Ready to deploy? Skip to the Deployment section to start deploying the infrastructure with Terraform.
⚠️ Disclaimer: This repository includes intentional fault injection and stress test scenarios designed to demonstrate the AWS DevOps Agent's investigation capabilities. These scripts deliberately introduce issues such as memory leaks, network partitions, database stress, and service latency. Do not run these scripts in production environments. They are intended for learning and demonstration purposes only.
📦 Source Code: The source code for the Retail Store Sample Application can be found at: https://github.com/aws-containers/retail-store-sample-app
- Getting Started
- Lab Introduction & Goals
- Architecture Overview
- Observability Stack
- 🚀 Deployment
- Application Access
- AWS DevOps Agent Integration
- Fault Injection Scenarios
- Cleanup
If you don't have Git installed, install it first:
# Linux (Debian/Ubuntu)
sudo apt-get update && sudo apt-get install git
# Linux (RHEL/CentOS/Amazon Linux)
sudo yum install git
# macOS (using Homebrew)
brew install git
# Verify installation
git --version# Clone the repository
git clone https://github.com/aws-samples/AmazonEKS-Devops-agent-sample.git
# Navigate to the project directory
cd AmazonEKS-Devops-agent-sampleđź”§ Troubleshooting Git Clone Issues? If you're encountering issues with
git clone, you can download the repository as a ZIP file instead:
- Navigate to the repository in your browser: https://github.com/aws-samples/AmazonEKS-Devops-agent-sample
- Click the Code button → Download ZIP
- Extract the ZIP file to your desired location:
unzip AmazonEKS-Devops-agent-sample-main.zip cd AmazonEKS-Devops-agent-sample-main
This hands-on lab demonstrates how to deploy, operate, and troubleshoot a production-grade microservices application on Amazon EKS using the AWS DevOps Agent. You'll gain practical experience with real-world scenarios including fault injection, observability, and automated incident investigation.
-
Deploy the EKS Cluster with Retail Sample App - Deploy a complete microservices architecture to Amazon EKS using Terraform, including all backend dependencies and observability tooling.
-
Understand the Microservices Architecture - Explore how the five core microservices (UI, Catalog, Carts, Orders, Checkout) interact with each other and their backend dependencies.
-
Work with AWS Managed Backend Services - Configure and operate production-grade AWS services that power the application.
-
Experience Observability in Action - Use CloudWatch Container Insights, Application Signals, Amazon Managed Prometheus, and Amazon Managed Grafana to monitor application health and performance.
-
Leverage the AWS DevOps Agent - See how the DevOps Agent automatically detects, investigates, and helps resolve infrastructure and application issues.
The Retail Store Sample App is a deliberately over-engineered e-commerce application designed to demonstrate microservices patterns and AWS service integrations:
The following diagram shows how the microservices communicate with each other and their backend data stores:
The comprehensive observability stack provides full visibility into application and infrastructure health:
Note: An editable Draw.io version of the architecture diagram is available at
docs/retail-store-architecture.drawio
| Component | Language | Container Image | Helm Chart | Description |
|---|---|---|---|---|
| UI | Java | Link | Link | Store user interface |
| Catalog | Go | Link | Link | Product catalog API |
| Cart | Java | Link | Link | User shopping carts API |
| Orders | Java | Link | Link | User orders API |
| Checkout | Node | Link | Link | API to orchestrate the checkout process |
The services communicate using synchronous HTTP REST calls within the Kubernetes cluster:
| Source | Target | Protocol | Endpoint | Purpose |
|---|---|---|---|---|
| UI | Catalog | HTTP | http://catalog.catalog.svc:80 |
Fetch product listings and details |
| UI | Carts | HTTP | http://carts.carts.svc:80 |
Manage shopping cart operations |
| UI | Orders | HTTP | http://orders.orders.svc:80 |
View order history and status |
| UI | Checkout | HTTP | http://checkout.checkout.svc:80 |
Initiate checkout process |
| Checkout | Orders | HTTP | http://orders.orders.svc:80 |
Create new orders |
The Terraform modules in this repository provision the following AWS resources:
Compute & Orchestration:
- Amazon EKS (v1.34) - Kubernetes cluster with EKS Auto Mode enabled
- General-purpose and system node pools
- Network Policy Controller enabled
- All control plane logging (API, audit, authenticator, controller manager, scheduler)
Networking:
- Amazon VPC - Custom VPC with public/private subnets across 3 AZs
- NAT Gateway for private subnet internet access
- VPC Flow Logs with 30-day retention
- Kubernetes-tagged subnets for ELB integration
Databases:
- Amazon Aurora MySQL (v8.0) - Catalog service database
- db.t3.medium instance class
- Storage encryption enabled
- Amazon Aurora PostgreSQL (v15.10) - Orders service database
- db.t3.medium instance class
- Storage encryption enabled
- Amazon DynamoDB - Carts service NoSQL database
- Global secondary index on customerId
- On-demand capacity mode
Messaging & Caching:
- Amazon MQ (RabbitMQ) (v3.13) - Message broker for Orders service
- mq.t3.micro instance type
- Single-instance deployment
- Amazon ElastiCache (Redis) - Session/cache store for Checkout service
- cache.t3.micro instance type
Observability Stack:
- Amazon CloudWatch Container Insights - Enhanced container monitoring with Application Signals
- Amazon Managed Service for Prometheus (AMP) - Metrics collection and storage
- EKS Managed Prometheus Scraper
- Scrapes: API server, kubelet, cAdvisor, kube-state-metrics, node-exporter, application pods
- Amazon Managed Grafana - Visualization and dashboards
- Prometheus, CloudWatch, and X-Ray data sources
- AWS X-Ray - Distributed tracing
- Network Flow Monitoring Agent - Container network observability
EKS Add-ons:
- metrics-server
- kube-state-metrics
- prometheus-node-exporter
- aws-efs-csi-driver
- aws-secrets-store-csi-driver-provider
- amazon-cloudwatch-observability (with Application Signals)
- aws-network-flow-monitoring-agent
- cert-manager
The Retail Store Sample App includes a comprehensive observability stack that provides full visibility into application and infrastructure health. This section details the instrumentation, metrics collection, and visualization capabilities.
Each microservice is instrumented for observability:
| Service | Language | Prometheus Metrics | OpenTelemetry Tracing | Application Signals |
|---|---|---|---|---|
| UI | Java | âś… /actuator/prometheus |
âś… OTLP | âś… Auto-instrumented |
| Catalog | Go | âś… /metrics |
✅ OTLP | ❌ (Go not supported) |
| Carts | Java | âś… /actuator/prometheus |
âś… OTLP | âś… Auto-instrumented |
| Orders | Java | âś… /actuator/prometheus |
âś… OTLP | âś… Auto-instrumented |
| Checkout | Node.js | âś… /metrics |
âś… OTLP | âś… Auto-instrumented |
Application Signals Auto-Instrumentation: Java and Node.js services are automatically instrumented via pod annotations:
# Java services (UI, Carts, Orders)
instrumentation.opentelemetry.io/inject-java: "true"
# Node.js services (Checkout)
instrumentation.opentelemetry.io/inject-nodejs: "true"Note: The Catalog service (Go) does not support Application Signals auto-instrumentation. It uses manual OpenTelemetry SDK instrumentation.
Container Insights provides enhanced observability for EKS clusters with the following capabilities:
Metrics Collected:
- Container CPU/memory utilization and limits
- Pod network I/O (bytes received/transmitted)
- Container restart counts
- Cluster, node, and pod-level aggregations
Application Signals Features:
- Automatic service map generation
- Request latency percentiles (p50, p95, p99)
- Error rates and HTTP status code distribution
- Service dependency visualization
- SLO monitoring and alerting
Application Signals provides Application Performance Monitoring (APM) capabilities for your microservices. Four of the five services are auto-instrumented:
| Service | Language | Auto-Instrumented | APM Features |
|---|---|---|---|
| UI | Java | âś… Yes | Traces, metrics, service map |
| Carts | Java | âś… Yes | Traces, metrics, service map |
| Orders | Java | âś… Yes | Traces, metrics, service map |
| Checkout | Node.js | âś… Yes | Traces, metrics, service map |
| Catalog | Go | ❌ No | Manual OTEL instrumentation only |
Accessing Application Signals Console:
- Open the CloudWatch Console
- In the left navigation, click Application Signals → Services
- You will see the 4 instrumented services listed:
ui(Java)carts(Java)orders(Java)checkout(Node.js)
Key APM Features in Application Signals:
-
Service Map: Visual representation of service dependencies and traffic flow
- Navigate to Application Signals → Service Map
- See real-time connections between UI → Catalog, UI → Carts, Checkout → Orders, etc.
-
Service Details: Click on any service to view:
- Request rate (requests/second)
- Latency percentiles (p50, p95, p99)
- Error rate and fault rate
- Top operations and endpoints
-
Traces: Distributed tracing across services
- Navigate to Application Signals → Traces
- Filter by service, operation, or latency
- View end-to-end request flow across microservices
-
SLO Monitoring: Set Service Level Objectives
- Define availability and latency targets
- Get alerts when SLOs are breached
Note: The Catalog service (Go) does not appear in Application Signals because Go auto-instrumentation is not supported. However, it still sends traces via manual OpenTelemetry SDK instrumentation visible in X-Ray.
Container Logs Collection:
Container logs from all pods are automatically collected by Fluent Bit and sent to CloudWatch Logs. The logs are organized into the following log groups:
| Log Group | Description |
|---|---|
/aws/containerinsights/retail-store/application |
Application container logs (stdout/stderr) from all pods |
/aws/containerinsights/retail-store/dataplane |
Kubernetes dataplane component logs |
/aws/containerinsights/retail-store/host |
Node-level host logs |
/aws/containerinsights/retail-store/performance |
Performance metrics in log format |
Viewing Container Logs:
# View recent logs for a specific service using CloudWatch Logs Insights
aws logs start-query \
--log-group-name "/aws/containerinsights/retail-store/application" \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, @message | filter kubernetes.namespace_name = "catalog" | sort @timestamp desc | limit 50'
# Or use kubectl for real-time logs
kubectl logs -n catalog -l app.kubernetes.io/name=catalog --tail=100 -fLog Structure: Each log entry includes Kubernetes metadata for easy filtering:
kubernetes.pod_name- Pod namekubernetes.namespace_name- Namespacekubernetes.container_name- Container namekubernetes.host- Node instance IDlog_processed- Parsed JSON log content (if applicable)
Access Container Insights:
- Open CloudWatch Console
- Navigate to Container Insights → Performance monitoring
- Select your EKS cluster from the dropdown
- Explore metrics by: Cluster, Namespace, Service, Pod, or Container
- For logs, navigate to Logs → Log groups →
/aws/containerinsights/retail-store/application
AMP provides a fully managed Prometheus-compatible monitoring service.
Metrics Scrape Configuration:
The EKS Managed Prometheus Scraper collects metrics from multiple sources:
Key Metrics Available:
| Source | Metrics | Use Case |
|---|---|---|
| kube-state-metrics | kube_pod_status_phase, kube_deployment_status_replicas |
Kubernetes object states |
| node-exporter | node_cpu_seconds_total, node_memory_MemAvailable_bytes |
Node hardware/OS metrics |
| cAdvisor | container_cpu_usage_seconds_total, container_memory_usage_bytes |
Container resource usage |
| API Server | apiserver_request_total, apiserver_request_duration_seconds |
Control plane performance |
| Application Pods | Custom application metrics | Business and application KPIs |
📌 Optional: Amazon Managed Grafana is optional for this lab. The primary focus is on the AWS DevOps Agent, which automatically analyzes metrics from CloudWatch and Prometheus. Configure Grafana only if you want to manually review and visualize metrics through custom dashboards.
Grafana provides visualization and dashboarding for all collected metrics.
Pre-configured Data Sources:
- Prometheus - AMP workspace for Kubernetes and application metrics
- CloudWatch - AWS service metrics (RDS, DynamoDB, ElastiCache, etc.)
- X-Ray - Distributed traces and service maps
Accessing Grafana:
- Get the Grafana workspace URL from Terraform output:
terraform output grafana_workspace_endpoint
- Sign in using AWS IAM Identity Center (SSO)
- Navigate to Dashboards to view pre-built visualizations
Configuring the Prometheus Data Source:
The Prometheus data source must be manually configured in Grafana to query metrics from Amazon Managed Prometheus (AMP).
-
Get your AMP workspace endpoint:
terraform output prometheus_workspace_endpoint
-
In Grafana, navigate to Connections → Data sources → Add data source → Prometheus
-
Configure the data source with these settings:
- Name:
Amazon Managed Prometheus(or your preferred name) - URL: Your AMP workspace endpoint (e.g.,
https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
Note: The Prometheus endpoint URL is unique to your deployment. Get it from the Terraform output above.
- Name:
-
Under Authentication, enable SigV4 auth:
- Toggle SigV4 auth to ON
- Default Region:
us-east-1(or your deployment region) - Leave Assume Role ARN empty (Grafana uses its workspace IAM role automatically)
-
Under HTTP Method, select POST
-
Click Save & test to verify the connection
Troubleshooting: If you receive a
403 Forbiddenerror, ensure SigV4 auth is enabled. Amazon Managed Grafana automatically uses its workspace IAM role for authentication - no manual credentials are needed.
Recommended Dashboards to Import:
How to Import a Dashboard:
- In Grafana, click Dashboards in the left sidebar
- Click New → Import
- Enter the Grafana ID from the table below in the "Import via grafana.com" field
- Click Load
- Select your Prometheus data source (the one you configured above)
- Click Import
The dashboard will be added to your Grafana instance and start displaying metrics immediately.
| Dashboard | Grafana ID | Description |
|---|---|---|
| Control Plane | ||
| Kubernetes API Server | 15761 | API server request rates, latencies, and error rates |
| etcd | 3070 | etcd cluster health, leader elections, and disk I/O |
| Kubernetes Controller Manager | 12122 | Controller work queue depths and reconciliation metrics |
| Kubernetes Scheduler | 12123 | Scheduler latency, pending pods, and preemption metrics |
| Kube State Metrics | ||
| Kubernetes Cluster (via kube-state-metrics) | 13332 | Comprehensive cluster state overview |
| Kubernetes Deployment Statefulset Daemonset | 8588 | Workload replica status and rollout progress |
| Kubernetes Resource Requests vs Limits | 13770 | Resource allocation vs actual usage |
| Kubernetes Pod Status | 15759 | Pod phase distribution and container states |
| Node Exporter | ||
| Node Exporter Full | 1860 | Comprehensive node hardware and OS metrics |
| Node Exporter for Prometheus | 11074 | Simplified node metrics overview |
| Node Problem Detector | 15549 | Node conditions and kernel issues |
| Network & Conntrack | ||
| Kubernetes Networking | 12125 | Pod and service network traffic |
| Node Network and Conntrack | 14996 | Connection tracking table usage and network stats |
| CoreDNS | 14981 | DNS query rates, latencies, and cache hit ratios |
| General Kubernetes | ||
| Kubernetes Cluster Monitoring | 315 | Cluster-wide resource utilization |
| Kubernetes Pods | 6336 | Pod-level metrics and logs |
| Kubernetes Namespace Resources | 14678 | Per-namespace resource consumption |
| AWS RDS | 707 | RDS database performance |
| AWS DynamoDB | 12637 | DynamoDB table metrics |
Node Exporter exposes hardware and OS-level metrics from each Kubernetes node.
Key Metrics:
node_cpu_seconds_total- CPU time spent in each modenode_memory_MemTotal_bytes- Total memorynode_memory_MemAvailable_bytes- Available memorynode_filesystem_size_bytes- Filesystem sizenode_network_receive_bytes_total- Network bytes receivednode_load1,node_load5,node_load15- System load averages
Useful PromQL Queries:
# CPU utilization percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory utilization percentage
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk utilization percentage
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Kube State Metrics generates metrics about the state of Kubernetes objects.
Key Metrics:
kube_pod_status_phase- Pod phase (Pending, Running, Succeeded, Failed, Unknown)kube_pod_container_status_restarts_total- Container restart countkube_deployment_status_replicas_available- Available replicaskube_node_status_condition- Node conditions (Ready, MemoryPressure, DiskPressure)kube_horizontalpodautoscaler_status_current_replicas- HPA current replicas
Useful PromQL Queries:
# Pods not in Running state
kube_pod_status_phase{phase!="Running",phase!="Succeeded"} == 1
# Deployments with unavailable replicas
kube_deployment_status_replicas_unavailable > 0
# Container restarts in last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0
The Network Flow Monitoring Agent provides container network observability.
Capabilities:
- Service-to-service traffic flow visualization
- Network latency between pods
- Packet loss detection
- TCP connection metrics
- Network policy effectiveness monitoring
Access Network Flow Insights:
- Open CloudWatch Console
- Navigate to Network Monitoring → Network Flow Monitor
- View traffic flows between services in the retail store application
OpenTelemetry provides distributed tracing across all microservices.
Configuration:
# OTEL Instrumentation settings (from Terraform)
OTEL_SDK_DISABLED: "false"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
OTEL_RESOURCE_PROVIDERS_AWS_ENABLED: "true"
OTEL_METRICS_EXPORTER: "none" # Metrics via Prometheus
OTEL_JAVA_GLOBAL_AUTOCONFIGURE_ENABLED: "true"Trace Propagation:
- W3C Trace Context (
tracecontext) - W3C Baggage (
baggage)
Sampling: Always-on sampling for complete trace visibility
CloudWatch Container Insights:
# Get cluster name
CLUSTER_NAME=$(terraform output -raw cluster_name)
# View in AWS Console
echo "https://console.aws.amazon.com/cloudwatch/home#container-insights:infrastructure"Amazon Managed Grafana:
# Get Grafana endpoint
terraform output grafana_workspace_endpointPrometheus Queries (via Grafana):
# Get AMP workspace endpoint
terraform output prometheus_workspace_endpointThe AWS DevOps Agent leverages the comprehensive observability stack to automatically investigate and diagnose issues:
-
Resource Discovery - All resources are tagged with
devopsagent = "true", enabling automatic discovery of related infrastructure components. -
Metrics Correlation - The agent queries Amazon Managed Prometheus and CloudWatch to identify anomalies in:
- Pod CPU/memory utilization
- Request latency (p50, p95, p99)
- Error rates and HTTP status codes
- Database connection pools and query performance
-
Log Analysis - CloudWatch Logs from EKS control plane and application pods are analyzed for:
- Error patterns and stack traces
- Connection timeouts and failures
- Resource exhaustion warnings
-
Trace Investigation - X-Ray traces help identify:
- Slow service dependencies
- Failed downstream calls
- Latency bottlenecks in the request path
-
Network Insights - Network Flow Monitoring reveals:
- Traffic patterns between services
- Network policy violations
- Connectivity issues
When you inject faults using the provided scripts, the DevOps Agent can automatically detect symptoms, correlate signals across the observability stack, and provide root cause analysis with remediation recommendations.
Before deploying and running fault injection scenarios, install the following tools:
# Linux (x86_64)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# macOS
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /
# Verify installation
aws --versionConfigure AWS credentials:
aws configure
# Enter your AWS Access Key ID, Secret Access Key, and default region (us-east-1)# Linux/macOS using tfenv (recommended)
git clone https://github.com/tfutils/tfenv.git ~/.tfenv
echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
tfenv install 1.5.0
tfenv use 1.5.0
# Or direct installation (Linux)
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
unzip terraform_1.5.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/
# macOS using Homebrew
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
# Verify installation
terraform --version# Linux
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
# macOS using Homebrew
brew install kubectl
# Verify installation
kubectl version --client# Linux/macOS
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# macOS using Homebrew
brew install helm
# Verify installation
helm version📌 Optional: Amazon Managed Grafana is disabled by default in this deployment. Grafana requires AWS IAM Identity Center (SSO) to be configured, and if SSO is not set up, the Terraform deployment will fail. The AWS DevOps Agent does not require Grafana - it directly queries CloudWatch and Prometheus for automated analysis.
To enable Grafana, you must:
- First configure AWS IAM Identity Center in your account
- Set
enable_grafana = truein your Terraform variables
Setup Guide: Enable IAM Identity Center for Amazon Managed Grafana
Quick Steps to Enable Grafana:
-
Open the IAM Identity Center console
-
Click Enable if not already enabled
-
Create users or groups that will access Grafana
-
Deploy with Grafana enabled:
terraform apply -var="enable_grafana=true" -
After deployment, assign yourself as Grafana admin:
- Go to Amazon Managed Grafana console
- Choose All workspaces from the left navigation
- Select the
retail-store-grafanaworkspace - Choose the Authentication tab
- Choose Configure users and user groups
- Select the checkbox next to your SSO user and choose Assign user
- Select your user and choose Make admin
For detailed instructions, see Manage user and group access to Amazon Managed Grafana workspaces
Navigate to the EKS deployment directory:
cd terraform/eks/defaultWhen you run terraform apply, the following resources will be provisioned:
EKS Cluster & Compute:
- Amazon EKS cluster (v1.34) with EKS Auto Mode enabled
- IAM roles for cluster and node management
- EKS managed add-ons (metrics-server, kube-state-metrics, prometheus-node-exporter, etc.)
Networking:
- New VPC with public and private subnets across 3 Availability Zones
- NAT Gateway for private subnet internet access
- VPC Flow Logs for network traffic analysis
- Security groups for all components
Application Dependencies:
- Amazon DynamoDB - Table for Carts service with GSI on customerId
- Amazon Aurora MySQL - Database for Catalog service
- Amazon Aurora PostgreSQL - Database for Orders service
- Amazon MQ (RabbitMQ) - Message broker for Orders service
- Amazon ElastiCache (Redis) - Cache for Checkout service
- Application Load Balancer - Managed by EKS Auto Mode for ingress
Observability Stack:
- Amazon CloudWatch Container Insights with Application Signals
- Amazon Managed Service for Prometheus (AMP) with EKS scraper
- Amazon Managed Grafana workspace (optional, requires
enable_grafana = trueand AWS SSO) - AWS X-Ray integration
- Network Flow Monitoring Agent
Retail Store Application:
- All five microservices (UI, Catalog, Carts, Orders, Checkout) deployed to dedicated namespaces
# 1. Navigate to the full EKS deployment directory
cd terraform/eks/default
# 2. Initialize Terraform (downloads providers and modules)
terraform init
# 3. Review the execution plan
# This shows all resources that will be created
terraform plan
# 4. Apply the configuration
# Type 'yes' when prompted to confirm
# This takes approximately 20-30 minutes
terraform apply
# 5. Note the outputs - you'll need these for kubectl configuration
# Look for: cluster_name, region, and any endpoint URLs
terraform outputOptional: Customize Cluster Name and Region
By default, the cluster is named retail-store and deployed to us-east-1. You can customize these values:
# Deploy with custom cluster name and region
terraform apply -var="cluster_name=my-retail-cluster" -var="region=us-west-2"
# Or create a terraform.tfvars file for persistent configuration
cat > terraform.tfvars <<EOF
cluster_name = "my-retail-cluster"
region = "us-west-2"
EOF
terraform apply| Variable | Default | Description |
|---|---|---|
cluster_name |
retail-store |
Name of the EKS cluster |
region |
us-east-1 |
AWS region for deployment |
enable_grafana |
false |
Enable Amazon Managed Grafana (requires AWS SSO) |
Optional: Enable Amazon Managed Grafana
⚠️ Important: Grafana requires AWS IAM Identity Center (SSO) to be configured in your account. If SSO is not set up, Terraform will fail whenenable_grafana=true. See Prerequisites - AWS IAM Identity Center for setup instructions.
# To deploy with Grafana enabled (requires AWS SSO):
terraform apply -var="enable_grafana=true"Important: After the EKS cluster is created, you must manually add your IAM role to the cluster's access entries. Terraform does not configure this automatically.
Steps to add your IAM role:
- Open the Amazon EKS Console
- Select your cluster (default name:
retail-store, or your customcluster_name) - Navigate to Access tab → IAM access entries
- Click Create access entry
- Configure the access entry:
- IAM principal ARN: Enter your IAM role ARN (any IAM user or role with required permissions, or an admin user)
- Type: Standard
- Click Next
- Add access policy:
- Policy name:
AmazonEKSClusterAdminPolicy - Access scope: Cluster
- Policy name:
- Click Create
Alternative: Using AWS CLI
# Get your current IAM identity
aws sts get-caller-identity
# Get cluster name from Terraform output (or use your custom name)
CLUSTER_NAME=$(terraform output -raw cluster_name)
# Create access entry (replace YOUR_ROLE_ARN with your actual role ARN)
aws eks create-access-entry \
--cluster-name $CLUSTER_NAME \
--principal-arn YOUR_ROLE_ARN \
--type STANDARD
# Associate the admin policy
aws eks associate-access-policy \
--cluster-name $CLUSTER_NAME \
--principal-arn YOUR_ROLE_ARN \
--policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
--access-scope type=clusterAll AWS resources created by this Terraform deployment are tagged with:
devopsagent = "true"
This tag enables the AWS DevOps Agent to automatically discover and monitor resources associated with this retail store application. The agent uses this tag to:
- Identify resources for automated investigation during incidents
- Correlate related resources across EKS, RDS, DynamoDB, and other AWS services
- Scope troubleshooting and root cause analysis to the correct infrastructure
After the EKS cluster is deployed, configure kubectl to access it:
# Update kubeconfig using Terraform outputs
aws eks update-kubeconfig \
--name $(terraform output -raw cluster_name) \
--region $(terraform output -raw region)
# Or manually specify your cluster name and region
aws eks update-kubeconfig --name retail-store --region us-east-1
# Verify cluster access
kubectl get nodes
# Verify all pods are running
kubectl get pods -A# Check all retail store services are running
kubectl get pods -A | grep -E "carts|catalog|orders|checkout|ui"
# Get the UI Ingress URL (ALB)
kubectl get ingress -n uiThe Retail Sample App UI is exposed via an AWS Application Load Balancer (ALB) created automatically by the AWS Load Balancer Controller.
After deployment, get the ALB URL from Terraform output:
# Get the application URL
terraform output retail_app_url
# Or get it directly from the Ingress resource
kubectl get ingress -n ui uiThe ALB URL will look like: http://k8s-ui-ui-xxxxxxxxxx-xxxxxxxxxx.us-east-1.elb.amazonaws.com
Note: It may take 2-3 minutes for the ALB to be provisioned and become healthy after deployment.
# Check all retail store services are running
kubectl get pods -A | grep -E "carts|catalog|orders|checkout|ui"
# Check the UI Ingress status
kubectl get ingress -n ui
# Verify the ALB target group is healthy
kubectl describe ingress ui -n ui| Issue | Cause | Solution |
|---|---|---|
| ALB URL returns 503 | Target group unhealthy | Check pod health: kubectl get pods -n ui |
| ALB not provisioned | AWS LB Controller issue | Check controller logs: kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller |
| Ingress has no address | ALB still provisioning | Wait 2-3 minutes and check again |
AWS DevOps Agent is a frontier AI agent that helps accelerate incident response and improve system reliability. It automatically correlates data across your operational toolchain, identifies probable root causes, and recommends targeted mitigations. This section provides step-by-step guidance for integrating the DevOps Agent with your EKS-based Retail Store deployment.
Note: AWS DevOps Agent is currently in public preview and available in the US East (N. Virginia) Region (
us-east-1). While the agent runs inus-east-1, it can monitor applications deployed in any AWS Region.
An Agent Space defines the tools and infrastructure that AWS DevOps Agent has access to.
For more details, see the AWS DevOps Agent documentation.
- Sign in to the AWS Management Console
- Ensure you're in the US East (N. Virginia) region (
us-east-1) - Navigate to the AWS DevOps Agent console
- Click Create Agent Space +
- In the Agent Space details section, provide:
- Name:
retail-store-eks-workshop - Description (Optional): Add details about the Agent Space's purpose
- Name:
In the Give this Agent Space AWS resource access section:
- Select Auto-create a new AWS DevOps Agent role
- (Optional) Update the Agent Space role name
Note: You must have IAM permissions to create new roles to use this option.
By default, all CloudFormation stacks and their resources will be discovered. Since this sample uses Terraform (not CloudFormation), you need to add a tag during Agent Space creation so the agent can discover your resources.
In the Include AWS tags section:
- Click Add tag
- Enter:
| Tag Key | Tag Value |
|---|---|
eksdevopsagent |
true |
Important: All resources in this sample are tagged with
eksdevopsagent=true. This ensures the DevOps Agent discovers the EKS cluster, services, Aurora databases, DynamoDB tables, and all related infrastructure.
The Web App is where you interact with AWS DevOps Agent for incident investigations.
- Select Auto-create a new AWS DevOps Agent role
- Review the permissions that will be granted to the role
Click Submit and wait for the Agent Space to be created (typically 1-2 minutes).
Once configured, the Configure Web App button should become Admin access. Clicking it should open the Web App and authenticate successfully.
AWS DevOps Agent needs access to the Kubernetes API to describe your Kubernetes cluster objects, retrieve pod logs and cluster events.
⚠️ Important: This step is required for the DevOps Agent to investigate Kubernetes-level issues like pod failures, resource constraints, and deployment problems.
For detailed instructions, refer to the official AWS documentation: AWS EKS access setup
- Open the DevOps Agent Console
- Select your Agent Space:
retail-store-eks-workshop - Navigate to Capabilities → Cloud → Primary Source → Edit
- Follow the setup instructions provided in the console
With EKS access configured, the DevOps Agent can:
| Capability | Description |
|---|---|
| Pod Status | Check running, pending, failed pods |
| Events | View Kubernetes events for troubleshooting |
| Deployments | Examine deployment configurations |
| Services | Check service endpoints and selectors |
| HPA | Monitor autoscaling status |
| Logs | Access pod logs via CloudWatch |
| Resource Usage | View CPU/memory from metrics server |
Error: Access entry already exists
- The entry may have been created during infrastructure deployment
- Verify it has the correct policy attached
Error: Invalid principal ARN
- Ensure you copied the complete ARN including the role name
- Verify the role exists in IAM
Agent still getting 401 errors
- Wait 1-2 minutes for the access entry to propagate
- Verify the policy
AmazonEKSClusterAdminPolicyis associated
The Topology view provides a visual map of your system components and their relationships. AWS DevOps Agent automatically builds this topology by analyzing your infrastructure.
- Open your Agent Space in the AWS Console
- Click the Topology tab
- View the automatically discovered resources and relationships
The DevOps Agent automatically detects:
| Relationship Type | Example | How Detected |
|---|---|---|
| Service Dependencies | UI → Catalog | Network traffic analysis, service mesh |
| Database Connections | Orders → Aurora PostgreSQL | Security group rules, connection strings |
| Message Queue Links | Orders → RabbitMQ | Environment variables, connection configs |
| Cache Dependencies | Checkout → Redis | Pod configurations, endpoint references |
Operator access allows your on-call engineers and DevOps team to interact with the AWS DevOps Agent through a dedicated web application.
-
From the Agent Space Console
- Navigate to your Agent Space
- Click Operator access in the left navigation
- Click Enable operator access if not already enabled
-
Access Methods
Option A: Direct Console Access
- Click the Operator access link in your Agent Space
- This opens the DevOps Agent web app directly
- Requires AWS Console authentication
Option B: AWS IAM Identity Center (Recommended for Teams)
- Configure IAM Identity Center for your organization
- Create a permission set for DevOps Agent access
- Assign users/groups to the permission set
- Users can access via the Identity Center portal
The DevOps Agent interacts with your EKS cluster through:
-
Read-Only Kubernetes API Access
- Lists pods, deployments, services, events
- Reads pod logs for error analysis
- Checks resource utilization metrics
-
CloudWatch Container Insights
- Queries container metrics (CPU, memory, network)
- Analyzes Application Signals data
- Reviews performance anomalies
-
AWS API Calls
- Describes EKS cluster configuration
- Checks node group status
- Reviews security group rules
AWS DevOps Agent includes several safety mechanisms:
| Mechanism | Description |
|---|---|
| Read-Only by Default | The agent only reads data; it does not modify resources |
| Scoped Access | Access is limited to resources within the Agent Space |
| Audit Logging | All agent actions are logged to CloudTrail |
| Investigation Boundaries | Investigations are scoped to specific incidents |
| Human-in-the-Loop | Mitigation recommendations require human approval |
When the DevOps Agent identifies a mitigation:
- Recommendation Generated - Agent proposes a fix (e.g., "Scale up deployment")
- Human Review - Operator reviews the recommendation in the web app
- Approval Required - Operator must explicitly approve any changes
- Implementation Guidance - Agent provides detailed specs for implementation
Important: The DevOps Agent does not automatically make changes to your infrastructure. All mitigations are recommendations that require human approval and manual implementation.
From the Operator Web App:
-
Click Start Investigation
-
Choose a starting point:
- Latest alarm - Investigate the most recent CloudWatch alarm
- High CPU usage - Analyze CPU utilization across resources
- Error rate spike - Investigate application error increases
- Custom - Describe the issue in your own words
-
Provide investigation details:
- Investigation details - Describe what you're investigating
- Date and time - When the incident occurred
- AWS Account ID - The account containing the affected resources
-
Click Start and watch the investigation unfold in real-time
After injecting a fault using the scripts in the Fault Injection Scenarios section, use these prompts to start a DevOps Agent investigation:
| Scenario | Investigation Details |
|---|---|
| Catalog Latency | "Product pages are loading slowly." |
| Network Partition | "Website is unreachable." |
| RDS Security Group Block | "Catalog pod is crashing." |
| Cart Memory Leak | "Intermittent cart failures, pods restarting." |
| DynamoDB Stress Test | "Slow performance and occasional failures." |
Investigation Flow:
Tip: For detailed investigation prompts with specific metrics and starting points, see the "DevOps Agent Investigation Prompts" section under each Fault Injection Scenario.
You can interact with the agent during investigations:
- Ask clarifying questions: "Which logs did you analyze?"
- Provide context: "Focus on the orders namespace"
- Steer the investigation: "Check the RDS connection pool metrics"
- Request AWS Support: Create a support case with one click
For the most up-to-date information about AWS DevOps Agent, refer to the official documentation:
| Resource | URL | Description |
|---|---|---|
| Product Page | https://aws.amazon.com/devops-agent | Overview and sign-up |
| AWS News Blog | Launch Announcement | Detailed walkthrough |
| IAM Reference | Service Authorization Reference | IAM actions and permissions |
- Agent Spaces - Logical boundaries for resource grouping
- Topology - Visual map of infrastructure relationships
- Investigations - Automated root cause analysis sessions
- Mitigations - Recommended fixes with implementation guidance
- Integrations - Connections to observability and CI/CD tools
AWS DevOps Agent integrates with:
Observability Tools:
- Amazon CloudWatch (native)
- Datadog
- Dynatrace
- New Relic
- Splunk
- Grafana/Prometheus (via MCP)
CI/CD & Source Control:
- GitHub Actions
- GitLab CI/CD
Incident Management:
- ServiceNow (native)
- PagerDuty (via webhooks)
- Slack (for notifications)
Custom Tools:
- Bring Your Own MCP Server for custom integrations
- Tag All Resources - Ensure
devopsagent = "true"tag is applied - Enable Container Insights - Already configured in Terraform
- Configure Alarms - Set up CloudWatch alarms for key metrics
- Use Fault Injection - Test the agent's investigation capabilities
- Review Recommendations - Learn from the agent's analysis
This repository includes fault injection scripts for simulating production-like issues during demos and training sessions. These scenarios help demonstrate how DevOps agents and monitoring tools can detect and diagnose real-world problems.
- EKS cluster deployed with the retail store application
kubectlconfigured to access the cluster- AWS CLI configured with appropriate permissions
Navigate to the fault injection directory:
cd fault-injectionMake all fault injection scripts executable:
chmod +x *.sh| Scenario | Inject Script | Rollback Script | Symptom |
|---|---|---|---|
| Catalog Latency | inject-catalog-latency.sh |
rollback-catalog.sh |
Product pages are loading slowly |
| Network Partition | inject-network-partition.sh |
rollback-network-partition.sh |
Website is unreachable |
| RDS Security Group Block | inject-rds-sg-block.sh |
rollback-rds-sg-block.sh |
Catalog pod is crashing |
| Cart Memory Leak | inject-cart-memory-leak.sh |
rollback-cart-memory-leak.sh |
Intermittent cart failures, pods restarting |
| DynamoDB Stress Test | inject-dynamodb-stress.sh |
rollback-dynamodb-stress.sh |
Slow performance and occasional failures |
Simulates high latency and CPU stress in the Catalog microservice.
Run the scenario:
# Inject the fault
./fault-injection/inject-catalog-latency.sh
# Rollback
./fault-injection/rollback-catalog.shSymptom: Product pages are loading slowly.
Sample Prompt:
"Product pages are loading slowly. The catalog service seems to be responding with high latency. Can you investigate what's causing the performance degradation?"
Blocks ingress traffic to the UI service using Kubernetes NetworkPolicy.
Run the scenario:
# Inject the fault
./fault-injection/inject-network-partition.sh
# Rollback
./fault-injection/rollback-network-partition.shSymptom: Website is unreachable.
Sample Prompt:
"The retail store website is completely unreachable. Users are reporting connection timeouts when trying to access the site. Can you investigate the network connectivity issue?"
Simulates an accidental security group change that blocks EKS nodes from connecting to RDS instances.
Run the scenario:
# Inject the fault
./fault-injection/inject-rds-sg-block.sh
# Rollback
./fault-injection/rollback-rds-sg-block.shSymptom: Catalog pod is crashing.
Sample Prompt:
"The catalog pod keeps crashing and restarting. It was working fine earlier today but now it can't seem to stay healthy. Can you investigate why the catalog service is failing?"
Simulates a memory leak in the Cart service causing OOMKill and pod restarts.
Run the scenario:
# Inject the fault
./fault-injection/inject-cart-memory-leak.sh
# Rollback
./fault-injection/rollback-cart-memory-leak.shSymptom: Intermittent cart failures, pods restarting.
Sample Prompt:
"Users are experiencing intermittent cart failures. Sometimes adding items to cart works, sometimes it doesn't. I've noticed the cart pods are restarting frequently. Can you investigate what's causing the instability?"
Deploys a stress pod that hammers DynamoDB with read requests, causing throttling.
Run the scenario:
# Inject the fault
./fault-injection/inject-dynamodb-stress.sh
# Rollback (instant - no data cleanup needed)
./fault-injection/rollback-dynamodb-stress.shSymptom: Slow performance and occasional failures.
Sample Prompt:
"The cart service is experiencing slow performance and occasional failures. Users are complaining about delays when viewing or updating their shopping carts. Can you investigate the DynamoDB-related issues?"
For a training session, follow this workflow:
-
Verify baseline - Ensure all services are healthy before injection
kubectl get pods -A | grep -E "carts|catalog|orders|checkout|ui"
-
Choose a scenario - Select one fault injection scenario from the table above
-
Inject the fault - Run the inject script and wait for symptoms to appear
-
Observe symptoms - Use monitoring tools (CloudWatch, Prometheus, pod logs) to observe the impact
-
Let DevOps Agent investigate - Allow automated investigation to detect root cause
-
Rollback - Run the rollback script to restore normal operation
-
Verify recovery - Confirm all services return to healthy state
When you're finished with the lab, it's important to clean up all AWS resources to avoid ongoing charges. This section provides detailed instructions for completely removing the environment.
Running terraform destroy alone may not remove all resources because:
- AWS GuardDuty automatically creates VPC endpoints and security groups for runtime monitoring - these block subnet/VPC deletion
- CloudWatch Container Insights creates log groups dynamically when the agent starts
- Kubernetes resources (Helm releases, namespaces) can cause provider errors during destroy
The cleanup script handles all of these edge cases by cleaning up AWS auto-provisioned resources BEFORE running terraform destroy.
Use the provided destroy script for a complete cleanup:
# Make the script executable
chmod +x scripts/destroy-environment.sh
# Run the cleanup script (uses defaults: CLUSTER_NAME=retail-store, AWS_REGION=us-east-1)
./scripts/destroy-environment.sh
# Or override defaults
CLUSTER_NAME=my-cluster AWS_REGION=us-west-2 ./scripts/destroy-environment.shThe script will:
- Get VPC ID for the cluster
- Delete VPC endpoints and GuardDuty security groups (prevents subnet deletion failures)
- Remove Kubernetes resources from Terraform state (prevents provider errors)
- Run
terraform destroyto remove all Terraform-managed resources - Final VPC cleanup if it still exists
- Clean up orphaned CloudWatch log groups
If you prefer to clean up manually or need to troubleshoot:
Step 1: Get VPC ID
VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:environment-name,Values=retail-store" --query "Vpcs[0].VpcId" --output text --region us-east-1)
echo "VPC ID: $VPC_ID"Step 2: Clean Up GuardDuty Resources FIRST
This must be done BEFORE terraform destroy to prevent subnet deletion failures:
# Delete VPC endpoints created by GuardDuty
ENDPOINTS=$(aws ec2 describe-vpc-endpoints --filters "Name=vpc-id,Values=$VPC_ID" --query "VpcEndpoints[*].VpcEndpointId" --output text --region us-east-1)
for ep in $ENDPOINTS; do
echo "Deleting VPC endpoint: $ep"
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $ep --region us-east-1
done
# Wait for endpoints to be deleted
sleep 30
# Delete GuardDuty security groups
SG_IDS=$(aws ec2 describe-security-groups --filters "Name=vpc-id,Values=$VPC_ID" "Name=group-name,Values=GuardDuty*" --query "SecurityGroups[*].GroupId" --output text --region us-east-1)
for sg in $SG_IDS; do
echo "Deleting GuardDuty security group: $sg"
aws ec2 delete-security-group --group-id $sg --region us-east-1
doneStep 3: Remove Kubernetes Resources from Terraform State
cd terraform/eks/default
# Remove aws_auth ConfigMap
terraform state rm 'kubernetes_config_map_v1_data.aws_auth' 2>/dev/null || true
# Remove Helm releases (prevents Kubernetes provider errors)
terraform state rm 'helm_release.ui' 'helm_release.catalog' 'helm_release.carts' 'helm_release.orders' 'helm_release.checkout' 2>/dev/null || true
# Remove Kubernetes namespaces
terraform state rm 'kubernetes_namespace.ui' 'kubernetes_namespace.catalog' 'kubernetes_namespace.carts' 'kubernetes_namespace.orders' 'kubernetes_namespace.checkout' 'kubernetes_namespace.rabbitmq' 2>/dev/null || trueStep 4: Run Terraform Destroy
terraform destroy -auto-approveStep 5: Final VPC Cleanup (if needed)
# Try to delete VPC if it still exists
aws ec2 delete-vpc --vpc-id $VPC_ID --region us-east-1 2>/dev/null || trueStep 6: Clean Up CloudWatch Log Groups
# Delete Container Insights log groups
for lg in $(aws logs describe-log-groups --log-group-name-prefix /aws/containerinsights/retail-store --query "logGroups[*].logGroupName" --output text --region us-east-1); do
echo "Deleting log group: $lg"
aws logs delete-log-group --log-group-name "$lg" --region us-east-1
done
# Delete EKS cluster log groups
for lg in $(aws logs describe-log-groups --log-group-name-prefix /aws/eks/retail-store --query "logGroups[*].logGroupName" --output text --region us-east-1); do
echo "Deleting log group: $lg"
aws logs delete-log-group --log-group-name "$lg" --region us-east-1
doneAfter running the cleanup, verify all resources are removed:
# Check for remaining EKS clusters
aws eks list-clusters --region us-east-1
# Check for remaining VPCs with retail-store tag
aws ec2 describe-vpcs --filters "Name=tag:environment-name,Values=retail-store" --region us-east-1
# Check for remaining CloudWatch log groups
aws logs describe-log-groups --log-group-name-prefix /aws/containerinsights/retail-store --region us-east-1
aws logs describe-log-groups --log-group-name-prefix /aws/eks/retail-store --region us-east-1
# Check Terraform state is empty
cd terraform/eks/default
terraform state listIssue: VPC deletion hangs or fails
- Cause: GuardDuty or other AWS services created resources in the VPC
- Solution: Use the cleanup script or manually delete VPC endpoints and security groups first
Issue: Terraform provider errors during destroy
- Cause: Kubernetes provider can't connect to deleted cluster
- Solution: Remove Kubernetes resources from state before destroying (Step 1 above)
Issue: Log groups still exist after destroy
- Cause: Container Insights creates log groups outside of Terraform
- Solution: Manually delete using AWS CLI (Step 4 above)
Issue: "resource not found" errors
- Cause: Resource was already deleted manually or by another process
- Solution: These errors are safe to ignore; the resource is already gone
See CONTRIBUTING for more information.
This project is licensed under the MIT-0 License.
This package depends on and may incorporate or retrieve a number of third-party software packages (such as open source packages) at install-time or build-time or run-time ("External Dependencies"). The External Dependencies are subject to license terms that you must accept in order to use this package. If you do not accept all of the applicable license terms, you should not use this package. We recommend that you consult your company's open source approval policy before proceeding.
Provided below is a list of External Dependencies and the applicable license identification as indicated by the documentation associated with the External Dependencies as of Amazon's most recent review.
THIS INFORMATION IS PROVIDED FOR CONVENIENCE ONLY. AMAZON DOES NOT PROMISE THAT THE LIST OR THE APPLICABLE TERMS AND CONDITIONS ARE COMPLETE, ACCURATE, OR UP-TO-DATE, AND AMAZON WILL HAVE NO LIABILITY FOR ANY INACCURACIES. YOU SHOULD CONSULT THE DOWNLOAD SITES FOR THE EXTERNAL DEPENDENCIES FOR THE MOST COMPLETE AND UP-TO-DATE LICENSING INFORMATION.
YOUR USE OF THE EXTERNAL DEPENDENCIES IS AT YOUR SOLE RISK. IN NO EVENT WILL AMAZON BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, CONSEQUENTIAL, SPECIAL, INCIDENTAL, OR PUNITIVE DAMAGES (INCLUDING FOR ANY LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, OR COMPUTER FAILURE OR MALFUNCTION) ARISING FROM OR RELATING TO THE EXTERNAL DEPENDENCIES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, EVEN IF AMAZON HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS AND DISCLAIMERS APPLY EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW.
| Dependency | License |
|---|---|
| MariaDB Community Edition | LICENSE |
| MySQL Community Edition | LICENSE |





