hashicorp · im2nguyen · Jul 22, 2025 · Jul 17, 2025 · Jul 17, 2025 · Jul 22, 2025
@@ -150,6 +150,57 @@
             ]
           }
         ]
+      },
+      {
+        "title": "Operations and monitoring",
+        "routes": [
+          {
+            "title": "Overview",
+            "path": "automate-and-define-processes/ops-monitoring"
+          },
+          {
+            "title": "Setup monitoring agents",
+            "routes": [
+              {
+                "title": "Overview",
+                "path": "automate-and-define-processes/ops-monitoring/setup-monitoring-agents"
+              },
+              {
+                "title": "Secure agents secrets",
+                "path": "automate-and-define-processes/ops-monitoring/setup-monitoring-agents/manage-secrets"
+              },
+              {
+                "title": "Configure agent on VMs",
+                "path": "automate-and-define-processes/ops-monitoring/setup-monitoring-agents/vm"
+              },
+              {
+                "title": "Configure agent on containers",
+                "path": "automate-and-define-processes/ops-monitoring/setup-monitoring-agents/containers"
+              },
+              {
+                "title": "Configure agent on service mesh",
+                "path": "automate-and-define-processes/ops-monitoring/setup-monitoring-agents/service-mesh"
+              }
+            ]
+          },
+          {
+            "title": "Configure dashboards and alerts",
+            "routes": [
+              {
+                "title": "Overview",
+                "path": "automate-and-define-processes/ops-monitoring/dashboards-alerts"
+              },
+              {
+                "title": "Vendor monitoring tools",
+                "path": "automate-and-define-processes/ops-monitoring/dashboards-alerts/manage-vendor"
+              },
+              {
+                "title": "Cloud-native monitoring tools",
+                "path": "automate-and-define-processes/ops-monitoring/dashboards-alerts/manage-cloud-native"
+              }
+            ]
+          }
+        ]
       }
     ]
   },

@@ -0,0 +1,35 @@
+---
+page_title: Configure dashboards and alerts
+description: Learn how to configure dashboards and alerts on your infrastructure and services.
+---
+
+# Configure dashboards and alerts
+
+As the number of services you manage and maintain grows, manually managing monitoring components like dashboards and alerts become unsustainable. This can lead to inconsistencies across environments, observability gaps where issues go undetected, and potential security risks. We recommend adopting monitoring-as-code (MaC) to manage these configurations to solve these challenges as your organization scales.
+
+With monitoring-as-code (MaC), you can adopt many of the best practices as infrastructure-as-code (IaC), such as:
+
+  - **Consistent configuration:** MaC lets you consistently deploy standardized monitoring setups across teams and environments. Terraform lets you create modules that include standard monitoring configurations with built-in reasonable defaults. A range of monitoring tools also offer official Terraform providers and modules your organization can use.
+
+    For example, the team responsible for ensuring data integrity can make changes to the Terraform modules and propagate those changes throughout the organization.
+  - **Automated provisioning:** As your organization scales, you can configure infrastructure and service deployments to automatically trigger monitoring dashboards.
+  - **Auditable changes:** All changes to monitoring components are traceable through version control.
+  - **Policy-compliant resources:** You can use Sentinel and Open Policy Agent (OPA) to ensure your monitoring resources are secure and compliant with your organization's policies.
+
+While the codified approach offers significant benefits, designing complex monitoring dashboards and alert rules directly in code can be challenging initially. 
+
+To balance this, we recommend an iterative workflow. First, leverage your monitoring tool's UI to visually design and build dashboards, layouts, and alert rules. This allows you to fully utilize the robust querying capabilities and intuitive interfaces provided by monitoring solutions. Once you have functional prototypes, import those configurations into Terraform code and standardize them as a Terraform module. From there, your organization can consume and modify the Terraform modules to create consistent monitoring dashboards and alerts across your infrastructure and services. 
+
+This approach combines the flexibility of visually designing your dashboards first with the consistency and maintainability of managing it as code.
+
+External resources:
+  - New Relic's [article](https://newrelic.com/blog/best-practices/the-importance-of-monitoring-as-code-for-modern-enterprises) provides additional insights into why organizations should adopt monitoring-as-code.
+
+## Next steps
+
+In this overview, you learned the benefits of monitoring-as-code and how to use Terraform to define and manage dashboards and alerts on your infrastructure and services. Configuring dashboards and alerts is part of the [Automate and define processes pillar](/well-architected-framework/automate-and-define-processes/introduction).
+
+Refer to the following documents to learn how to deploy monitoring tools with Terraform.
+
+- [Manage vendor monitoring tools](/well-architected-framework/automate-and-define-processes/ops-monitoring/dashboard-alerts/manage-vendor)
+- [Manage cloud native monitoring tools](/well-architected-framework/automate-and-define-processes/ops-monitoring/dashboard-alerts/manage-cloud-native)
@@ -0,0 +1,23 @@
+---
+page_title: Manage cloud-native monitoring tools
+description: Learn how to manage cloud-native monitoring tools with Terraform.
+---
+
+# Manage cloud-native monitoring tools
+
+Many cloud providers, such as AWS, Azure, and Google Cloud, offer their own monitoring services, which can effectively monitor infrastructure metrics and application logs. With Terraform, you can use cloud provider resources and specific monitoring modules to deploy and manage your cloud-native monitoring infrastructure without installing additional monitoring agents.
+
+HashiCorp resources:
+
+  - AWS maintains the [AWS Integration and Automation (IA) Terraform modules](https://registry.terraform.io/namespaces/aws-ia) - the [`cloudwatch-log-group`](https://registry.terraform.io/modules/aws-ia/cloudwatch-log-group/aws/latest) module deploys and manages an AWS Cloudwatch log group along with the corresponding IAM permissions. The [Terraform AWS provider](https://registry.terraform.io/providers/hashicorp/aws/latest/docs) contains CloudWatch resources that Terraform can create and manage such as the [`aws_cloudwatch_dashboard`](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_dashboard) resource. 
+  - Azure maintains [Azure Verified Modules](https://azure.github.io/Azure-Verified-Modules/indexes/terraform/tf-resource-modules/#available-modules) - the [avm-res-operationalinsights-workspace](https://registry.terraform.io/modules/Azure/avm-res-operationalinsights-workspace/azurerm/latest) module deploys and manages a Log Analytics Workspace with reasonable defaults. The [Azure Terraform provider](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs) contains the resources you need to deploy monitoring for your application in Azure such as [`azurerm_portal_dashboard`](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/portal_dashboard) and [`azurerm_monitor_metric_alert`](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/monitor_metric_alert).  
+  - Google maintains a [cloud operations module](https://registry.terraform.io/modules/terraform-google-modules/cloud-operations/google/latest) that manages Google Cloud's operations suite (Cloud Logging and Monitoring). The [Terraform Google Cloud provider](https://registry.terraform.io/providers/hashicorp/google/latest/docs) page provides Google Cloud Monitoring resources that Terraform can create and manage such as the [`google_monitoring_dashboard`](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_dashboard) resource. 
+
+External resources:
+
+  - Andrei Maksimov's [tutorial with a video](https://hands-on.cloud/terraform-cloudwatch-examples/) guides you through how to automate alarms, dashboards, and logs in the AWS CloudWatch service.
+  - Azure's [Multi-cloud monitoring](https://learn.microsoft.com/en-us/azure/azure-monitor/best-practices-multicloud) article guides you through setting up Azure Monitor to monitor your services and infrastructure across different clouds, and ingest cloud native metrics and telemetry information into your existing monitoring solution.
+
+## Next steps
+
+In this section of [Configure dashboards and alerts](/well-architected-framework/automate-and-define-processes/ops-monitoring), you found resources to help you use Terraform to define and manage dashboards and alerts on your preferred cloud vendor. Configuring dashboards and alerts is part of the [Automate and define processes pillar](/well-architected-framework/automate-and-define-processes/introduction).
@@ -0,0 +1,33 @@
+---
+page_title: Manage vendor monitoring tools with Terraform
+description: Learn how to manage vendor monitoring tools with Terraform.
+---
+
+# Manage vendor monitoring tools with Terraform
+
+Terraform makes it easy to deploy and manage various vendor monitoring tools through official providers and modules. Terraform has over 200 partner and community providers for logging and monitoring — some popular ones include DataDog, New Relic, Grafana, and Splunk. With these providers, you can automatically provision the monitoring tool, and its resources like dashboard alerts.
+
+Many monitoring vendors also contribute to OpenTelemetry (OTel). OpenTelemetry provides a standardized way to generate and export telemetry data like metrics, traces, and logs from your applications. The OpenTelemetry agent collects this data from your applications. Instead of deploying separate agents for each vendor monitoring tool, you can configure the OTel agent to export data to multiple backends simultaneously. For example, you can send metrics to Datadog, traces to Honeycomb, and logs to Splunk from the same Otel agent.
+
+With Terraform, you have a unified approach to manage and operate all different monitoring systems through a single, automated workflow defined in code. Terraform lets you define the exact configuration you want for everything — from the OpenTelemetry agents collecting data (Refer to GCE example in external resources) to backend monitoring tools like DataDog, New Relic, Grafana, and others.
+
+HashiCorp resources:
+
+  - The [Terraform Datadog provider](/terraform/tutorials/applications/datadog-provider) tutorial guides you through how to use Terraform to deploy an application in EKS and install the DataDog agent across the Kubernetes cluster.
+  - Terraform Registry hosts over 200 [Logging and Monitoring](https://registry.terraform.io/browse/providers?category=logging-monitoring) partner and community providers. You should be able to find a provider to manage your monitoring tool of choice. This includes popular providers such as the [New Relic Terraform provider](https://registry.terraform.io/providers/newrelic/newrelic/latest/docs) and [DataDog Terraform provider](https://registry.terraform.io/providers/DataDog/datadog/latest/docs).
+
+External resources:
+
+  - The [Deploying OpenTelemetry (OTel) agent to your GCE instances](https://liveramp.com/blog/deploying-opentelemetry-agent-to-your-gce-instances/) article provides insights from LiveRamp as they automatically deploy OTel agents using Terraform on their Google Cloud instances.
+  - DataDog provides a [quick start guide](https://www.datadoghq.com/blog/managing-datadog-with-terraform/) where they walk you through creating dashboards, deploying monitors and alerts, and integrating into AWS. This guide uses the resources in the DataDog provider module. 
+
+  - New Relic provides resources on implementing monitoring-as-code (MaC) with Terraform:
+    - The [Automate your configuration with observability as code](https://newrelic.com/blog/how-to-relic/examples-observability-as-code-part-one) tutorial covers the importance of codifying monitoring using HCL.
+    - A three-part series guides you through using Terraform with JSON to create and dynamically generate New Relic dashboards:
+      - [Creating dashboards with Terraform and JSON templates](https://newrelic.com/blog/how-to-relic/create-nr-dashboards-with-terraform-part-1) guides you through quickly updating New Relic dashboards with Terraform by using JSON templates
+      - [Dynamically creating New Relic dashboards with Terraform](https://newrelic.com/blog/how-to-relic/create-nr-dashboards-with-terraform-part-2) guides you through using JSON templates to create dynamic dashboards
+      - [Using Terraform to generate New Relic dashboards from NRQL queries](https://newrelic.com/blog/how-to-relic/create-nr-dashboards-with-terraform-part-3) guides you through using Terraform and NRQL queries to generate dashboards with dynamic data
+
+## Next steps
+
+In this section of [Configure dashboards and alerts](/well-architected-framework/automate-and-define-processes/ops-monitoring), you found resources to help you configure OTel on your services and manage vendor monitoring tools with Terraform. Configuring dashboards and alerts is part of the [Automate and define processes pillar](/well-architected-framework/automate-and-define-processes/introduction).
@@ -0,0 +1,17 @@
+---
+page_title: Operations and monitoring
+description: Learn best practices to monitor your infrastructure and services.
+---
+
+# Operations and monitoring
+
+This section describes the best practices for monitoring your infrastructure and services. Monitoring your infrastructure and services lets you maintain their reliability, performance, and security. You can use what you learn here to continuously observe key metrics such as resource usage, response times and error rates, to proactively identify and fix problems before they impact users. 
+
+Following these recommendations helps you standardize and automate your entire monitoring process across your organization.
+
+## Next steps
+
+Refer to the following documents to learn how to monitor your infrastructure and services.
+
+- [Setup monitoring agents](/well-architected-framework/automate-and-define-processes/ops-monitoring/setup-monitoring-agents) to collect metrics and logs.
+- [Configure dashboards and alerts](/well-architected-framework/automate-and-define-processes/ops-monitoring/dashboard-alerts) to view the metrics and notify you of issues.
@@ -0,0 +1,27 @@
+---
+page_title: Configure monitoring agent on container orchestrators
+description: Learn how to configure monitoring agents on container orchestrators like Nomad, OpenShift, and Kubernetes.
+---
+
+# Configure monitoring agent on container orchestrators
+
+Monitoring container orchestrators, like Kubernetes and Nomad, is important for keeping your clusters and services healthy, and lets you sustain high performance and reliability. The built-in telemetry data from these tools doesn't provide much value alone; you need a monitoring tool to collect, parse, and alert on raw telemetry data. By setting up monitoring agents, you can get valuable insights into how your clusters and services are functioning.
+
+Track metrics about Kubernetes cluster nodes, like CPU and memory usage, to understand if the nodes are healthy and have enough resources. Monitor application-level metrics, like request latency and error rates, to ensure the services are running smoothly. Tools like Prometheus and Grafana let you collect and visualize these metrics.
+
+Track Nomad cluster node metrics like resource usage to optimize resources and keep the cluster stable, and identify any performance bottlenecks. Nomad’s integration with Prometheus collects and analyzes cluster metrics, providing insights into the cluster health and performance. Monitor Nomad job metrics so you know if jobs execute smoothly. Using monitoring tools like Prometheus and Grafana with Nomad lets you comprehensively monitor the entire system - both the cluster itself and all running jobs.
+
+HashiCorp resources:
+  - The [Terraform Datadog provider](/terraform/tutorials/applications/datadog-provider) tutorial shows you how to use Terraform to deploy an application in EKS and install the DataDog agent across the Kubernetes cluster.
+  - For node-level Nomad metrics, refer to the following resources:
+    - The [Nomad Prometheus](/nomad/tutorials/manage-clusters/prometheus-metrics) tutorial guides you through configuring Prometheus to integrate with a Nomad cluster. This tutorial covers how to gather node-level metrics.
+    - The [Monitoring Nomad](/nomad/docs/operations/monitoring-nomad), [Metrics reference](/nomad/docs/operations/metrics-reference), [Nomad autoscaler documentation](/nomad/tools/autoscaling), and [Nomad telemetry block documentation](/nomad/docs/configuration/telemetry) provide a deep dive into the telemetry and metrics that Nomad has to offer.
+  - The [Collect resource utilization metrics](/nomad/tutorials/manage-jobs/jobs-utilization) shows you how to view naive Nomad job usage for simple service level metrics.
+
+External resources:
+  - Kubernetes provides resources to learn more about tools that help you monitor [Kubernetes resources](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/) and [node health](https://kubernetes.io/docs/tasks/debug/debug-cluster/monitor-node-health/).
+  - The [Nomad integration for Grafana](https://grafana.com/docs/grafana-cloud/monitor-infrastructure/integrations/integration-reference/integration-nomad/) includes two pre-built dashboards to help monitor and visualize Nomad metrics.
+
+## Next steps
+
+In this section of [Setup monitoring agents](/well-architected-framework/automate-and-define-processes/ops-monitoring/setup-monitoring-agents), you learned how to configure and deploy monitoring agents for containers. Setting up monitoring agents is part of the [Automate and define processes pillar](/well-architected-framework/automate-and-define-processes/introduction).