Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add os grafana stack #526

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# SageMaker HyperPod Monitoring with OS Grafana <!-- omit from toc -->

Since Amazon Managed Service for Grafana requires authentication via AWS IAM Identity Center or SAML. Setting those authentication mechanism can be troublesome in some environment such as on AWS account used via partner. For such situation, this page guide you how to set up an EC2 instance and run Grafana container along with Amazon Managed Service for Prometheus workspace with a single cloudformation template. After the environment setup, you can follow this guide how to 1/ access to the EC2 instance securely with SSH over SSM and 2/ set the prometheus as datasource so that you can view the metrics.

To get started, you will initiate the provisioning of an Amazon CloudFormation Stack within your AWS Account. You can find the complete stack template in [cluster-observability-os-grafana.yaml](./cluster-observability-os-grafana.yaml). This CloudFormation stack will orchestrate the deployment of the following resources dedicated to cluster monitoring in your AWS environment:

* [Amazon Manged Prometheus WorkSpace](https://aws.amazon.com/prometheus/)
* [Amazon Managed Grafana Workspace](https://aws.amazon.com/grafana/)
* Associated IAM roles and permissions

### Prerequisites

* Refer to the [original readme](./README.md) for exporter setup and other prerequisites.
* Set up the [SSM Session Manager Plugin](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html) on your local environment to access the EC2 instance. Follow instruction in [here](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/01-cluster/05-ssh).

### Deploy the CloudFormation Stack

[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/cluster-observability-os-grafana.yaml&stackName=Cluster-Observability-OS-Grafana)

>[!IMPORTANT]
> It is strongly recommended you deploy this stack into the same region and same account as your SageMaker HyperPod Cluster.This will ensure successful execution of the Lifecycle Scripts, specifically `install_prometheus.sh`, which relies on AWS CLI commands that assume same account and same region.

### Connect to the EC2 instance running OS Grafana

Connect to the EC2 instance using SSM:

```bash
aws ssm start-session --target ${Instance_ID} --region ${REGION}
```
Then switch to the `ec2-user`:

```bash
sudo su - ec2-user
```

Add your SSH public key to `~/.ssh/authorized_keys` on the instance.
Configure SSH access over SSM by adding the following to your local `~/.ssh/config`:

```bash
Host os-grafana
User ec2-user
ProxyCommand sh -c "aws ssm start-session --region ${REGION} --target ${INSTANCE_ID} --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
```

Connect to the instance:

```bash
ssh os-grafana
```

Set up port forwarding for port 3000 to access the Grafana dashboard from a web browser. For VS Code setup, refer to [this guide](https://code.visualstudio.com/docs/editor/port-forwarding). Default Grafana login credentials are `admin/admin`. Please change the password after the first login.

### Set Prometheus Workspace as datasource for the Grafana dashboard

Finally, you can connect the Prometheus workspace with the Grafana dashboard by setting workspace as a data source.

Navigate to "Data sources" in Grafana and select "Prometheus".

![](./assets/os-grafana-set-datasource1.png)

Set the "Prometheus server URL" with the value retrieved from the AWS console.

![](./assets/retrieve-amp-endpoint.png)
![](./assets/os-grafana-set-datasource2.png)

For authentication:
* Choose "SigV4 auth"
* Set "Authentication Provider" as "AWS SDK Default"
* Set "Default Region" to the region where you deployed the CloudFormation stack.

![](./assets/os-grafana-set-datasource3.png)

5 changes: 5 additions & 0 deletions 4.validation_and_observability/4.prometheus-grafana/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ To get started, you will initiate the provisioning of an Amazon CloudFormation S
* [Amazon Managed Grafana Workspace](https://aws.amazon.com/grafana/)
* Associated IAM roles and permissions

If you are using an environment which does not allow to use IAM Identity Center or SAML, consider [alternative OS grafana option](./README-OS-grafana.md).

![observability_architecture](./assets/observability_architecture.png)

Expand All @@ -35,6 +36,10 @@ If you have already created your HyperPod cluster, you can follow [these instruc

[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/cluster-observability.yaml&stackName=Cluster-Observability)

Alternatively, you can deploy OS Grafana stack.

[<kbd> <br> 1-Click Deploy 🚀 <br> </kbd>](https://console.aws.amazon.com/cloudformation/home?#/stacks/quickcreate?templateURL=https://awsome-distributed-training.s3.amazonaws.com/templates/cluster-observability-os-grafana.yaml&stackName=Cluster-Observability-OS-Grafana)

>[!IMPORTANT]
> It is strongly recommended you deploy this stack into the same region and same account as your SageMaker HyperPod Cluster.This will ensure successful execution of the Lifecycle Scripts, specifically `install_prometheus.sh`, which relies on AWS CLI commands that assume same account and same region.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
AWSTemplateFormatVersion: "2010-09-09"
Description: CloudFormation template to monitor SageMaker Hyperpod - launches a t2.medium instance with 30GB of storage, security group, IAM role for Prometheus access, Grafana setup, and a Prometheus workspace.

Parameters:
LatestAmiId:
Type: 'AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>'
Default: '/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2'
Description: "The latest Amazon Linux 2 AMI ID."

Resources:

GrafanaEC2Role:
Type: "AWS::IAM::Role"
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: "sts:AssumeRole"
Policies:
- PolicyName: "PrometheusAccessPolicy"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- aps:ListWorkspaces
- aps:DescribeWorkspace
- aps:QueryMetrics
- aps:GetLabels
- aps:GetSeries
- aps:GetMetricMetadata
Resource: "*"
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

MyInstanceProfile:
Type: "AWS::IAM::InstanceProfile"
Properties:
Roles:
- !Ref GrafanaEC2Role

APSWorkspace:
Type: "AWS::APS::Workspace"
Properties:
Alias: !Sub "${AWS::StackName}-Hyperpod-WorkSpace"
Tags:
- Key: "Name"
Value: "SageMaker Hyperpod PrometheusMetrics"

MyInstance:
Type: "AWS::EC2::Instance"
Properties:
InstanceType: "t2.medium"
ImageId: !Ref LatestAmiId
IamInstanceProfile: !Ref MyInstanceProfile
BlockDeviceMappings:
- DeviceName: "/dev/xvda"
Ebs:
VolumeSize: 30
UserData:
Fn::Base64: !Sub |
#!/bin/bash

# Update system packages
sudo yum update -y

# Install Docker
echo "Installing Docker..."
sudo amazon-linux-extras install docker -y

# Start Docker service
echo "Starting Docker service..."
sudo systemctl start docker

# Enable Docker to start on boot
sudo systemctl enable docker

# Add the current user (ec2-user) to the Docker group to run Docker commands without sudo
echo "Adding ec2-user to Docker group..."
sudo usermod -aG docker ec2-user

# Pull the latest Grafana image
echo "Pulling the latest Grafana Docker image..."
docker pull grafana/grafana:10.4.14-ubuntu

# Run Grafana container with automatic restart
echo "Starting Grafana container with restart policy..."
docker run --env GF_AUTH_SIGV4_AUTH_ENABLED=true --env AWS_SDK_LOAD_CONFIG=true -d -p 3000:3000 --name=grafana --restart always grafana/grafana:10.4.14-ubuntu

# Print Grafana access info
echo "Docker and Grafana setup complete."
echo "Grafana is running at http://$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4):3000"
echo "Default Grafana login credentials are admin/admin. Please change the password after the first login."

# Note: Log out and log back in for Docker permissions to take effect
echo "Please log out and back in for Docker group permissions to apply."
Tags:
- Key: "Name"
Value: "OS-Grafana"


Outputs:
InstanceId:
Description: "Instance ID of the EC2 instance"
Value: !Ref MyInstance
PrometheusWorkspaceId:
Description: "ID of the Amazon Managed Prometheus Workspace"
Value: !Ref APSWorkspace
AMPRemoteWriteURL:
Value: !Join ["" , [ !GetAtt APSWorkspace.PrometheusEndpoint , "api/v1/remote_write" ]]
GrafanaInstanceAddress:
Description: "Grafana address with port 3000 for the EC2 instance"
Value: !Sub "http://${MyInstance.PublicIp}:3000"