Skip to content

Commit

Permalink
Merge pull request #208 from nerc-project/openstack_gpu_request
Browse files Browse the repository at this point in the history
added all files
  • Loading branch information
Milstein authored Jul 23, 2024
2 parents edd274c + cc3471f commit 153c37d
Show file tree
Hide file tree
Showing 18 changed files with 148 additions and 38 deletions.
81 changes: 79 additions & 2 deletions docs/get-started/allocation/allocation-change-request.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,19 @@ Then submitting the change request, this will notify the NERC admin about it. Pl
wait untill the NERC admin approves/ deny the change request to see the change on
your resource allocation for the selected project.

!!! info "Information"
!!! tip "Important Information"
PI or project managers can put the new values on the textboxes for **ONLY**
quota attributes they want to change others they can be left **blank** so those
quotas will not get changed!

To use GPU resources on your VM, you need to specify the number of GPUs in the
"OpenStack GPU Quota" attribute. Additionally, ensure that your other quota
attributes, namely "OpenStack Compute vCPU Quota" and "OpenStack Compute RAM
Quota (MiB)" have sufficient resources to meet the **vCPU** and **RAM** requirements
for one of the GPU tier-based flavors. Refer to the [GPU Tier documentation](../../openstack/create-and-connect-to-the-VM/flavors.md#3-gpu-tier)
for specific requirements and further details on the flavors available for GPU
usage.

### Allocation Change Requests for OpenStack Project

Once the request is processed by the NERC admin, any user can view that request
Expand All @@ -46,6 +54,58 @@ This will show more details about the change request as shown below:

![Allocation Change Request Details for OpenStack Project](images/coldfront-openstack-change-requested-details.png)

### How to Use GPU Resources in your OpenStack Project

!!! tip "Comparison Between CPU and GPU"
To learn more about the key differences between CPUs and GPUs, please [read this](../../openstack/create-and-connect-to-the-VM/flavors.md#comparison-between-cpu-and-gpu).

A GPU instance is launched in the [same way](../../openstack/create-and-connect-to-the-VM/launch-a-VM.md)
as any other compute instance, with a few considerations to keep in mind:

- When launching a GPU based instance, be sure to select one of the
[GPU Tier](../../openstack/create-and-connect-to-the-VM/flavors.md#3-gpu-tier)
based flavor.

- You need to have sufficient resource quota to launch the desired flavor. Always
ensure you know which GPU-based flavor you want to use, then submit an
[allocation change request](#request-change-resource-allocation-attributes-for-openstack-project)
to adjust your current allocation to fit the flavor's resource requirements.

!!! tip "Resource Requirements for Launching a VM with "NVIDIA A100 SXM4 40GB" Flavor."
Based on the [GPU Tier documentation](../../openstack/create-and-connect-to-the-VM/flavors.md#i-nvidia-a100-sxm4-40gb),
NERC provides two variations of NVIDIA A100 SXM4 40GB flavors:

1. **`gpu-su-a100sxm4.1`**: Includes 1 NVIDIA A100 GPU
2. **`gpu-su-a100sxm4.2`**: Includes 2 NVIDIA A100 GPUs

You should select the flavor that best fits your resource needs and ensure your
OpenStack quotas are appropriately configured for the chosen flavor. To use
a GPU-based VM flavor, choose the one that best fits your resource needs and
make sure your OpenStack quotas meet the required specifications:

- For the **`gpu-su-a100sxm4.1`** flavor:
- **vCPU**: 32
- **RAM (GiB)**: 240

- For the **`gpu-su-a100sxm4.2`** flavor:
- **vCPU**: 64
- **RAM (GiB)**: 480

Ensure that your OpenStack resource quotas are configured as follows:

- **OpenStack GPU Quota**: Meets or exceeds the number of GPUs required by the
chosen flavor.
- **OpenStack Compute vCPU Quota**: Meets or exceeds the vCPU requirement.
- **OpenStack Compute RAM Quota (MiB)**: Meets or exceeds the RAM requirement.

Properly configure these quotas to successfully launch a VM with the selected
"gpu-su-a100sxm4" flavor.

- We recommend using [ubuntu-22.04-x86_64](../../openstack/create-and-connect-to-the-VM/images.md#nerc-images-list)
as the image for your GPU-based instance because we have tested the NVIDIA driver
with this image and obtained good results. That said, it is possible to run a
variety of other images as well.

## Request Change Resource Allocation Attributes for OpenShift Project

![Request Change Resource Allocation Attributes for OpenShift Project](images/coldfront-openshift-allocation-attributes.png)
Expand All @@ -64,11 +124,14 @@ Then submitting the change request, this will notify the NERC admin about it. Pl
wait untill the NERC admin approves/ deny the change request to see the change on
your resource allocation for the selected project.

!!! info "Information"
!!! tip "Important Information"
PI or project managers can put the new values on the textboxes for **ONLY**
quota attributes they want to change others they can be left **blank** so those
quotas will not get changed!

In order to use GPU resources on your pod, you must specify the number of GPUs
you want to use in the "OpenShift Request on GPU Quota" attribute.

### Allocation Change Requests for OpenShift Project

Once the request is processed by the NERC admin, any user can view that request
Expand All @@ -82,4 +145,18 @@ This will show more details about the change request as shown below:

![Allocation Change Request Details for OpenShift Project](images/coldfront-openshift-change-requested-details.png)

### How to Use GPU Resources in your OpenShift Project

!!! tip "Comparison Between CPU and GPU"
To learn more about the key differences between CPUs and GPUs, please [read this](../../openstack/create-and-connect-to-the-VM/flavors.md#comparison-between-cpu-and-gpu).

For OpenShift pods, we can specify different types of GPUs. Since OpenShift is not
based on flavors, we can customize the resources as needed at the pod level while
still utilizing GPU resources.

You can read about how to specify a pod to use a GPU [here](../../openshift/applications/scaling-and-performance-guide.md#how-to-specify-pod-to-use-gpu).

Also, you will be able to select a different GPU device for your workload, as
explained [here](../../openshift/applications/scaling-and-performance-guide.md#how-to-select-a-different-gpu-device).

---
2 changes: 1 addition & 1 deletion docs/get-started/allocation/allocation-details.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Allocation details

Access to ColdFront's allocations details is based on [user roles](#user-roles).
Access to ColdFront's allocations details is based on [user roles](manage-users-to-a-project.md#user-roles).
PIs and managers see the same allocation details as users, and can also add
project users to the allocation, if they're not already on it, and remove users
from an allocation.
Expand Down
4 changes: 2 additions & 2 deletions docs/get-started/allocation/coldfront.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ is granted, the PI will receive an email confirming the request approval and
how to connect NERC's ColdFront.

PI or project managers can use NERC's ColdFront as a self-service web-portal that
can see an administrative view of it as [described here](#pi-and-manager-view) and
can do the following tasks:
can see an administrative view of it as [described here](coldfront.md#pi-and-manager-view)
and can do the following tasks:

- **Only PI** can add a new project and archive any existing project(s)

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions docs/get-started/allocation/requesting-an-allocation.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,16 @@ or *OpenShift Resource Allocation* by specifying either **NERC (OpenStack)** or
**NERC-OCP (OpenShift)** in the **Resource** dropdown option. **Note:** The
first option i.e. **NERC (OpenStack)**, is selected by default.

!!! info "Default GPU Resource Quota for Initial Allocation Requests"
By default, the GPU resource quota is set to 0 for the initial resource
allocation request for both OpenStack and OpenShift Resource Types. However,
you will be able to [change request](allocation-change-request.md) and adjust
the corresponding GPU quotas for both after they are approved for the first
time. For NERC's OpenStack, please follow [this guide](allocation-change-request.md#how-to-use-gpu-resources-in-your-openstack-project)
on how to utilize GPU resources in your OpenStack project. For NERC's OpenShift,
refer to [this reference](allocation-change-request.md#how-to-use-gpu-resources-in-your-openshift-project)
to learn about how to use GPU resources in pod level.

## Request A New OpenStack Resource Allocation for an OpenStack Project

![Request A New OpenStack Resource Allocation](images/coldfront-request-new-openstack-allocation.png)
Expand Down
6 changes: 3 additions & 3 deletions docs/get-started/create-a-user-portal-account.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,8 +133,8 @@ as shown in the image below:
!!! info "Information"
Once your PI user request is reviewed and approved by the NERC's admin, you
will receive an email confirmation from NERC's support system, i.e.,
**[email protected]**. Then, you can access [NERC's ColdFront resource
allocation management portal](https://coldfront.mss.mghpcc.org/) using the
PI user role, as [described here](allocation/coldfront.md).
[[email protected]](mailto:[email protected]?subject=NERC%20MOU%20Question).
Then, you can access [NERC's ColdFront resource allocation management portal](https://coldfront.mss.mghpcc.org/)
using the PI user role, as [described here](allocation/coldfront.md#how-to-get-access-to-nercs-coldfront).

---
8 changes: 4 additions & 4 deletions docs/migration-moc-to-nerc/Step2.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,10 @@ samples below your lists might look like this:

| MOC Volume Name | MOC Disk | MOC Attached To | Bootable | MOC UUID | NERC Volume Name |
| --------------- | -------- | --------------- | -------- | -------- | ---------------- |
| Fedora | 10GiB | Fedora_test | Yes | ea45c20b-434a-4c41-8bc6-f48256fc76a8 | |
| 9c73295d-fdfa-4544-b8b8-a876cc0a1e86 | 10GiB | Ubuntu_Test | Yes | 9c73295d-fdfa-4544-b8b8-a876cc0a1e86 | |
| Snapshot of Fed_Test | 10GiB | Fedora_test | No | ea45c20b-434a-4c41-8bc6-f48256fc76a8 | |
| total | 30GiB | | | |
| Fedora | 10GiB | Fedora_test | Yes | ea45c20b-434a-4c41-8bc6-f48256fc76a8 | |
| 9c73295d-fdfa-4544-b8b8-a876cc0a1e86 | 10GiB | Ubuntu_Test | Yes | 9c73295d-fdfa-4544-b8b8-a876cc0a1e86 | |
| Snapshot of Fed_Test | 10GiB | Fedora_test | No | ea45c20b-434a-4c41-8bc6-f48256fc76a8 | |
| total | 30GiB | | | | |

#### MOC Security Group Information Table

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ Wait until the requested resource allocation gets approved by the NERC's admin.
After approval, kindly review and verify that the quotas are accurately
reflected in your [resource allocation](https://coldfront.mss.mghpcc.org/allocation/)
and [OpenShift project](https://console.apps.shift.nerc.mghpcc.org). Please ensure
that the approved quota values are accurately displayed as [explained here](#review-your-projects-resource-quota-from-openshift-web-dashboard).
that the approved quota values are accurately displayed as [explained here](decommission-openshift-resources.md#review-your-projects-resource-quota-from-openshift-web-dashboard).

### Review your Project Usage

Expand Down
5 changes: 3 additions & 2 deletions docs/openstack/access-and-security/create-a-key-pair.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,15 +233,16 @@ PuTTY requires SSH keys to be in its own `ppk` format. To convert between
OpenSSH keys used by OpenStack and PuTTY's format, you need a utility called PuTTYgen.

If it was not installed when you originally installed PuTTY, you can get it
here: [Download PuTTY](#http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html).
here: [Download PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html).

You have 2 options for generating keys that will work with PuTTY:

1. Generate an OpenSSH key with ssh-keygen or from the Horizon GUI using the
instructions above, then use PuTTYgen to convert the private key to .ppk

2. Generate a .ppk key with PuTTYgen, and import the provided OpenSSH public
key to OpenStack using the 'Import a Key Pair' instructions [above](#import-a-key-pair).
key to OpenStack using the 'Import the generated Key Pair' instructions
[above](create-a-key-pair.md#import-the-generated-key-pair).

There is a detailed walkthrough of how to use PuTTYgen here: [Use SSH Keys with
PuTTY on Windows](https://devops.profitbricks.com/tutorials/use-ssh-keys-with-putty-on-windows/).
Expand Down
30 changes: 24 additions & 6 deletions docs/openstack/create-and-connect-to-the-VM/flavors.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,24 @@ The important fields are
| Ephemeral | Size of a second disk. 0 means no second disk is defined and mounted. |
| VCPUs | Number of virtual cores |

## Comparison Between CPU and GPU

Here are the key differences between CPUs and GPUs:

| CPUs | GPUs |
| --------------------------------------------- | ---------------------------- |
| Work mostly in sequence. While several cores and excellent task switching give the impression of parallelism, a CPU is fundamentally designed to run one task at a time. | Are designed to work in parallel. A vast number of cores and threading managed in hardware enable GPUs to perform many simple calculations simultaneously. |
| Are designed for task parallelism. | Are designed for data parallelism. |
| Have a small number of cores that can complete single complex tasks at very high speeds. | Have a large number of cores that work in tandem to compute many simple tasks. |
| Have access to a large amount of relatively slow RAM with low latency, optimizing them for latency (operation). | Have access to a relatively small amount of very fast RAM with higher latency, optimizing them for throughput. |
| Have a very versatile instruction set, allowing the execution of complex tasks in fewer cycles but creating overhead in others. | Have a limited (but highly optimized) instruction set, allowing them to execute their designed tasks very efficiently. |
| Task switching (as a result of running the OS) creates overhead. | Task switching is not used; instead, numerous serial data streams are processed in parallel from point A to point B. |
| Will always work for any given use case but may not provide adequate performance for some tasks. | Would only be a valid choice for some use cases but would provide excellent performance in those cases. |

In summary, for applications such as Machine Learning (ML), Artificial
Intelligence (AI), or image processing, a GPU can provide a performance increase
of 50x to 200x compared to a typical CPU performing the same tasks.

## Currently, our setup supports and offers the following flavors

NERC offers the following flavors based on our Infrastructure-as-a-Service
Expand All @@ -32,7 +50,7 @@ The standard compute flavor **"cpu-su"** is provided from Lenovo SD530 (2x Intel
8268 2.9 GHz, 48 cores, 384 GB memory) server. The base unit is 1 vCPU, 4 GB
memory with default of 20 GB root disk at a rate of $0.013 / hr of wall time.

| Flavor | SUs | GPU | vCPU | RAM(GB) | Storage(GB) | Cost / hr |
| Flavor | SUs | GPU | vCPU | RAM(GiB) | Storage(GiB) | Cost / hr |
|---------------|-----|-----|-------|---------|-------------|-----------|
|cpu-su.1 |1 |0 |1 |4 |20 |$0.013 |
|cpu-su.2 |2 |0 |2 |8 |20 |$0.026 |
Expand All @@ -46,7 +64,7 @@ The memory optimized flavor **"mem-su"** is provided from the same servers at
**"cpu-su"** but with 8 GB of memory per core. The base unit is 1 vCPU, 8 GB
memory with default of 20 GB root disk at a rate of $0.026 / hr of wall time.

| Flavor | SUs | GPU | vCPU | RAM(GB) | Storage(GB) | Cost / hr |
| Flavor | SUs | GPU | vCPU | RAM(GiB) | Storage(GiB) | Cost / hr |
|---------------|-----|-----|-------|---------|-------------|-----------|
|mem-su.1 |1 |0 |1 |8 |20 |$0.026 |
|mem-su.2 |2 |0 |2 |16 |20 |$0.052 |
Expand Down Expand Up @@ -99,7 +117,7 @@ The higher number of tensor cores available can significantly enhance the speed
of machine learning applications. The base unit is 32 vCPU, 240 GB memory with
default of 20 GB root disk at a rate of $2.078 / hr of wall time.

| Flavor | SUs | GPU | vCPU | RAM(GB) | Storage(GB) | Cost / hr |
| Flavor | SUs | GPU | vCPU | RAM(GiB) | Storage(GiB) | Cost / hr |
|-------------------|-----|-----|-------|---------|-------------|-----------|
|gpu-su-a100sxm4.1 |1 |1 |32 |240 |20 |$2.078 |
|gpu-su-a100sxm4.2 |2 |2 |64 |480 |20 |$4.156 |
Expand Down Expand Up @@ -131,7 +149,7 @@ industry-leading high throughput and low latency networking. The base unit is 24
vCPU, 74 GB memory with default of 20 GB root disk at a rate of $1.803 / hr of
wall time.

| Flavor | SUs | GPU | vCPU | RAM(GB) | Storage(GB) | Cost / hr |
| Flavor | SUs | GPU | vCPU | RAM(GiB) | Storage(GiB) | Cost / hr |
|---------------|-----|-----|-------|---------|-------------|-----------|
|gpu-su-a100.1 |1 |1 |24 |74 |20 |$1.803 |
|gpu-su-a100.2 |2 |2 |48 |148 |20 |$3.606 |
Expand Down Expand Up @@ -161,7 +179,7 @@ The **"gpu-su-v100"** flavor is provided from Dell R740xd (2x Intel Xeon Gold 61
40 cores, 768GB memory, 1x NVIDIA V100 32GB) servers. The base unit is 48 vCPU,
192 GB memory with default of 20 GB root disk at a rate of $1.214 / hr of wall time.

| Flavor | SUs | GPU | vCPU | RAM(GB) | Storage(GB) | Cost / hr |
| Flavor | SUs | GPU | vCPU | RAM(GiB) | Storage(GiB) | Cost / hr |
|---------------|-----|-----|-------|---------|-------------|-----------|
|gpu-su-v100.1 |1 |1 |48 |192 |20 |$1.214 |

Expand Down Expand Up @@ -191,7 +209,7 @@ E5-2620 2.40GHz, 24 cores, 128GB memory, 4x NVIDIA K80 12GB) servers. The base u
is 6 vCPU, 28.5 GB memory with default of 20 GB root disk at a rate of $0.463 /
hr of wall time.

| Flavor | SUs | GPU | vCPU | RAM(GB) | Storage(GB) | Cost / hr |
| Flavor | SUs | GPU | vCPU | RAM(GiB) | Storage(GiB) | Cost / hr |
|--------------|-----|-----|-------|---------|-------------|-----------|
|gpu-su-k80.1 |1 |1 |6 |28.5 |20 |$0.463 |
|gpu-su-k80.2 |2 |2 |12 |57 |20 |$0.926 |
Expand Down
1 change: 1 addition & 0 deletions docs/openstack/create-and-connect-to-the-VM/images.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ an instance:
| Name |
|---------------------------------------|
| centos-7-x86_64 |
| centos-8-x86_64 |
| debian-10-x86_64 |
| fedora-36-x86_64 |
| rocky-8-x86_64 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ Press **Yes** if you receive the identity verification popup:
![RDP Windows Popup](images/rdp_popup_for_xrdp.png)

Then, enter your VM's username (ubuntu) and the password you created
for user ubuntu following [this steps](#setting-a-password.md).
for user ubuntu following [this steps](ssh-to-the-VM.md#setting-a-password.md).

Press **Ok**.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ Wait until the requested resource allocation gets approved by the NERC's admin.
After approval, kindly review and verify that the quotas are accurately
reflected in your [resource allocation](https://coldfront.mss.mghpcc.org/allocation/)
and [OpenStack project](https://stack.nerc.mghpcc.org/). Please ensure that the
approved quota values are accurately displayed as [explained here](#review-your-openstack-dashboard).
approved quota values are accurately displayed as [explained here](decommission-openstack-resources.md#review-your-openstack-dashboard).

### Review your Block Storage(Volume/Cinder) Quota

Expand Down
2 changes: 1 addition & 1 deletion docs/openstack/persistent-storage/detach-a-volume.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ the volume created before and attached to the VM and can be shown in
Check that the volume is in state 'available' again.

If that's the case, the volume is now ready to either be attached to another
virtual machine or, if it is not needed any longer, to be [completely deleted](#delete-volumes)
virtual machine or, if it is not needed any longer, to be [completely deleted](./delete-volumes.md)
(please note that this step cannot be reverted!).

## Attach the detached volume to an instance
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1266,7 +1266,8 @@ Here,
You can run either `juicefs config redis://default:<your_redis_password>@127.0.0.1:6379/1`
or `juicefs status redis://default:<your_redis_password>@127.0.0.1:6379/1` to get
detailed information about mounted file system i.e. **"myjfs"** that is setup by
following [this step](##formatting-file-system). The output looks like shown here:
following [this step](mount-the-object-storage.md#formatting-file-system). The
output looks like shown here:

{
...
Expand Down
Loading

0 comments on commit 153c37d

Please sign in to comment.