Skip to content

Operations Guide

Garot Conklin edited this page Apr 29, 2025 · 1 revision

Operations Guide

Monitoring CloudOpsAI

Key Metrics

  • Lambda execution times
  • AI decision accuracy
  • Remediation success rates
  • Cost per incident

Dashboards

Access the operational dashboard: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=CloudOpsAI

Troubleshooting

Common Issues

  1. High Lambda Latency

    • Check VPC configuration
    • Verify Bedrock endpoint access
    • Review memory allocation
  2. Failed Remediation Actions

    • Verify IAM permissions
    • Check SSM automation status
    • Review error logs

Log Analysis

aws logs get-log-events \
  --log-group-name /aws/lambda/cloudopsai-agent \
  --log-stream-name $(date +%Y/%m/%d)

Maintenance

Regular Tasks

  • Review and update YAML rules
  • Analyze AI decision accuracy
  • Cleanup old incident records
  • Update AWS resource tags

Backup and Recovery

  1. Configuration backup:
aws s3 sync s3://cloudopsai-config/ backup/
  1. DynamoDB backup:
aws dynamodb create-backup \
  --table-name cloudopsai-incidents \
  --backup-name "manual-backup-$(date +%Y%m%d)"

Clone this wiki locally