Merge pull request #208 from nerc-project/openstack_gpu_request

added all files
nerc-project · Jul 23, 2024 · 153c37d · 153c37d
2 parents edd274c + cc3471f
commit 153c37d
Show file tree

Hide file tree

Showing 18 changed files with 148 additions and 38 deletions.
diff --git a/docs/get-started/allocation/allocation-change-request.md b/docs/get-started/allocation/allocation-change-request.md
@@ -28,11 +28,19 @@ Then submitting the change request, this will notify the NERC admin about it. Pl
 wait untill the NERC admin approves/ deny the change request to see the change on
 your resource allocation for the selected project.
 
-!!! info "Information"
+!!! tip "Important Information"
     PI or project managers can put the new values on the textboxes for **ONLY**
     quota attributes they want to change others they can be left **blank** so those
     quotas will not get changed!
 
+    To use GPU resources on your VM, you need to specify the number of GPUs in the
+    "OpenStack GPU Quota" attribute. Additionally, ensure that your other quota
+    attributes, namely "OpenStack Compute vCPU Quota" and "OpenStack Compute RAM
+    Quota (MiB)" have sufficient resources to meet the **vCPU** and **RAM** requirements
+    for one of the GPU tier-based flavors. Refer to the [GPU Tier documentation](../../openstack/create-and-connect-to-the-VM/flavors.md#3-gpu-tier)
+    for specific requirements and further details on the flavors available for GPU
+    usage.
+
 ### Allocation Change Requests for OpenStack Project
 
 Once the request is processed by the NERC admin, any user can view that request
@@ -46,6 +54,58 @@ This will show more details about the change request as shown below:
 
 ![Allocation Change Request Details for OpenStack Project](images/coldfront-openstack-change-requested-details.png)
 
+### How to Use GPU Resources in your OpenStack Project
+
+!!! tip "Comparison Between CPU and GPU"
+    To learn more about the key differences between CPUs and GPUs, please [read this](../../openstack/create-and-connect-to-the-VM/flavors.md#comparison-between-cpu-and-gpu).
+
+A GPU instance is launched in the [same way](../../openstack/create-and-connect-to-the-VM/launch-a-VM.md)
+as any other compute instance, with a few considerations to keep in mind:
+
+- When launching a GPU based instance, be sure to select one of the
+[GPU Tier](../../openstack/create-and-connect-to-the-VM/flavors.md#3-gpu-tier)
+based flavor.
+
+- You need to have sufficient resource quota to launch the desired flavor. Always
+ensure you know which GPU-based flavor you want to use, then submit an
+[allocation change request](#request-change-resource-allocation-attributes-for-openstack-project)
+to adjust your current allocation to fit the flavor's resource requirements.
+
+!!! tip "Resource Requirements for Launching a VM with "NVIDIA A100 SXM4 40GB" Flavor."
+    Based on the [GPU Tier documentation](../../openstack/create-and-connect-to-the-VM/flavors.md#i-nvidia-a100-sxm4-40gb),
+    NERC provides two variations of NVIDIA A100 SXM4 40GB flavors:
+
+    1. **`gpu-su-a100sxm4.1`**: Includes 1 NVIDIA A100 GPU
+    2. **`gpu-su-a100sxm4.2`**: Includes 2 NVIDIA A100 GPUs
+
+    You should select the flavor that best fits your resource needs and ensure your
+    OpenStack quotas are appropriately configured for the chosen flavor. To use
+    a GPU-based VM flavor, choose the one that best fits your resource needs and
+    make sure your OpenStack quotas meet the required specifications:
+
+    - For the **`gpu-su-a100sxm4.1`** flavor:
+        - **vCPU**: 32
+        - **RAM (GiB)**: 240
+
+    - For the **`gpu-su-a100sxm4.2`** flavor:
+        - **vCPU**: 64
+        - **RAM (GiB)**: 480
+
+    Ensure that your OpenStack resource quotas are configured as follows:
+
+    - **OpenStack GPU Quota**: Meets or exceeds the number of GPUs required by the
+    chosen flavor.
+    - **OpenStack Compute vCPU Quota**: Meets or exceeds the vCPU requirement.
+    - **OpenStack Compute RAM Quota (MiB)**: Meets or exceeds the RAM requirement.
+
+    Properly configure these quotas to successfully launch a VM with the selected
+    "gpu-su-a100sxm4" flavor.
+
+- We recommend using [ubuntu-22.04-x86_64](../../openstack/create-and-connect-to-the-VM/images.md#nerc-images-list)
+as the image for your GPU-based instance because we have tested the NVIDIA driver
+with this image and obtained good results. That said, it is possible to run a
+variety of other images as well.
+
 ## Request Change Resource Allocation Attributes for OpenShift Project
 
 ![Request Change Resource Allocation Attributes for OpenShift Project](images/coldfront-openshift-allocation-attributes.png)
@@ -64,11 +124,14 @@ Then submitting the change request, this will notify the NERC admin about it. Pl
 wait untill the NERC admin approves/ deny the change request to see the change on
 your resource allocation for the selected project.
 
-!!! info "Information"
+!!! tip "Important Information"
     PI or project managers can put the new values on the textboxes for **ONLY**
     quota attributes they want to change others they can be left **blank** so those
     quotas will not get changed!
 
+    In order to use GPU resources on your pod, you must specify the number of GPUs
+    you want to use in the "OpenShift Request on GPU Quota" attribute.
+
 ### Allocation Change Requests for OpenShift Project
 
 Once the request is processed by the NERC admin, any user can view that request
@@ -82,4 +145,18 @@ This will show more details about the change request as shown below:
 
 ![Allocation Change Request Details for OpenShift Project](images/coldfront-openshift-change-requested-details.png)
 
+### How to Use GPU Resources in your OpenShift Project
+
+!!! tip "Comparison Between CPU and GPU"
+    To learn more about the key differences between CPUs and GPUs, please [read this](../../openstack/create-and-connect-to-the-VM/flavors.md#comparison-between-cpu-and-gpu).
+
+For OpenShift pods, we can specify different types of GPUs. Since OpenShift is not
+based on flavors, we can customize the resources as needed at the pod level while
+still utilizing GPU resources.
+
+You can read about how to specify a pod to use a GPU [here](../../openshift/applications/scaling-and-performance-guide.md#how-to-specify-pod-to-use-gpu).
+
+Also, you will be able to select a different GPU device for your workload, as
+explained [here](../../openshift/applications/scaling-and-performance-guide.md#how-to-select-a-different-gpu-device).
+
 ---
diff --git a/docs/get-started/allocation/allocation-details.md b/docs/get-started/allocation/allocation-details.md
@@ -1,6 +1,6 @@
 # Allocation details
 
-Access to ColdFront's allocations details is based on [user roles](#user-roles).
+Access to ColdFront's allocations details is based on [user roles](manage-users-to-a-project.md#user-roles).
 PIs and managers see the same allocation details as users, and can also add
 project users to the allocation, if they're not already on it, and remove users
 from an allocation.

diff --git a/docs/get-started/allocation/coldfront.md b/docs/get-started/allocation/coldfront.md
@@ -25,8 +25,8 @@ is granted, the PI will receive an email confirming the request approval and
 how to connect NERC's ColdFront.
 
 PI or project managers can use NERC's ColdFront as a self-service web-portal that
-can see an administrative view of it as [described here](#pi-and-manager-view) and
-can do the following tasks:
+can see an administrative view of it as [described here](coldfront.md#pi-and-manager-view)
+and can do the following tasks:
 
 - **Only PI** can add a new project and archive any existing project(s)
 

diff --git a/.../get-started/allocation/images/coldfront-openstack-change-requested-details.png b/.../get-started/allocation/images/coldfront-openstack-change-requested-details.png
diff --git a/docs/get-started/allocation/requesting-an-allocation.md b/docs/get-started/allocation/requesting-an-allocation.md
@@ -10,6 +10,16 @@ or *OpenShift Resource Allocation* by specifying either **NERC (OpenStack)** or
 **NERC-OCP (OpenShift)** in the **Resource** dropdown option. **Note:** The
 first option i.e. **NERC (OpenStack)**, is selected by default.
 
+!!! info "Default GPU Resource Quota for Initial Allocation Requests"
+    By default, the GPU resource quota is set to 0 for the initial resource
+    allocation request for both OpenStack and OpenShift Resource Types. However,
+    you will be able to [change request](allocation-change-request.md) and adjust
+    the corresponding GPU quotas for both after they are approved for the first
+    time. For NERC's OpenStack, please follow [this guide](allocation-change-request.md#how-to-use-gpu-resources-in-your-openstack-project)
+    on how to utilize GPU resources in your OpenStack project. For NERC's OpenShift,
+    refer to [this reference](allocation-change-request.md#how-to-use-gpu-resources-in-your-openshift-project)
+    to learn about how to use GPU resources in pod level.
+
 ## Request A New OpenStack Resource Allocation for an OpenStack Project
 
 ![Request A New OpenStack Resource Allocation](images/coldfront-request-new-openstack-allocation.png)

diff --git a/docs/get-started/create-a-user-portal-account.md b/docs/get-started/create-a-user-portal-account.md
@@ -133,8 +133,8 @@ as shown in the image below:
 !!! info "Information"
     Once your PI user request is reviewed and approved by the NERC's admin, you
     will receive an email confirmation from NERC's support system, i.e.,
-    **[email protected]**. Then, you can access [NERC's ColdFront resource
-    allocation management portal](https://coldfront.mss.mghpcc.org/) using the
-    PI user role, as [described here](allocation/coldfront.md).
+    [[email protected]](mailto:[email protected]?subject=NERC%20MOU%20Question).
+    Then, you can access [NERC's ColdFront resource allocation management portal](https://coldfront.mss.mghpcc.org/)
+    using the PI user role, as [described here](allocation/coldfront.md#how-to-get-access-to-nercs-coldfront).
 
 ---
diff --git a/docs/migration-moc-to-nerc/Step2.md b/docs/migration-moc-to-nerc/Step2.md
@@ -97,10 +97,10 @@ samples below your lists might look like this:
 
 | MOC Volume Name | MOC Disk | MOC Attached To | Bootable | MOC UUID | NERC Volume Name |
 | --------------- | -------- | --------------- | -------- | -------- | ---------------- |
-| Fedora | 10GiB | Fedora_test | Yes | ea45c20b-434a-4c41-8bc6-f48256fc76a8 | |
-| 9c73295d-fdfa-4544-b8b8-a876cc0a1e86 | 10GiB | Ubuntu_Test | Yes | 9c73295d-fdfa-4544-b8b8-a876cc0a1e86 | |
-| Snapshot of Fed_Test | 10GiB | Fedora_test | No | ea45c20b-434a-4c41-8bc6-f48256fc76a8 | |
-| total | 30GiB | | | |
+| Fedora | 10GiB | Fedora_test | Yes | ea45c20b-434a-4c41-8bc6-f48256fc76a8 |   |
+| 9c73295d-fdfa-4544-b8b8-a876cc0a1e86 | 10GiB | Ubuntu_Test | Yes | 9c73295d-fdfa-4544-b8b8-a876cc0a1e86 |   |
+| Snapshot of Fed_Test | 10GiB | Fedora_test | No | ea45c20b-434a-4c41-8bc6-f48256fc76a8 |   |
+| total | 30GiB |   |   |   |   |
 
 #### MOC Security Group Information Table
 

diff --git a/docs/openshift/decommission/decommission-openshift-resources.md b/docs/openshift/decommission/decommission-openshift-resources.md
@@ -246,7 +246,7 @@ Wait until the requested resource allocation gets approved by the NERC's admin.
 After approval, kindly review and verify that the quotas are accurately
 reflected in your [resource allocation](https://coldfront.mss.mghpcc.org/allocation/)
 and [OpenShift project](https://console.apps.shift.nerc.mghpcc.org). Please ensure
-that the approved quota values are accurately displayed as [explained here](#review-your-projects-resource-quota-from-openshift-web-dashboard).
+that the approved quota values are accurately displayed as [explained here](decommission-openshift-resources.md#review-your-projects-resource-quota-from-openshift-web-dashboard).
 
 ### Review your Project Usage
 

diff --git a/docs/openstack/access-and-security/create-a-key-pair.md b/docs/openstack/access-and-security/create-a-key-pair.md
@@ -233,15 +233,16 @@ PuTTY requires SSH keys to be in its own `ppk` format. To convert between
 OpenSSH keys used by OpenStack and PuTTY's format, you need a utility called PuTTYgen.
 
 If it was not installed when you originally installed PuTTY, you can get it
-here: [Download PuTTY](#http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html).
+here: [Download PuTTY](http://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html).
 
 You have 2 options for generating keys that will work with PuTTY:
 
  1. Generate an OpenSSH key with ssh-keygen or from the Horizon GUI using the
  instructions above, then use PuTTYgen to convert the private key to .ppk
 
  2. Generate a .ppk key with PuTTYgen, and import the provided OpenSSH public
- key to OpenStack using the 'Import a Key Pair' instructions [above](#import-a-key-pair).
+ key to OpenStack using the 'Import the generated Key Pair' instructions
+ [above](create-a-key-pair.md#import-the-generated-key-pair).
 
 There is a detailed walkthrough of how to use PuTTYgen here: [Use SSH Keys with
 PuTTY on Windows](https://devops.profitbricks.com/tutorials/use-ssh-keys-with-putty-on-windows/).

diff --git a/docs/openstack/create-and-connect-to-the-VM/flavors.md b/docs/openstack/create-and-connect-to-the-VM/flavors.md
@@ -17,6 +17,24 @@ The important fields are
 | Ephemeral  | Size of a second disk. 0 means no second disk is defined and mounted. |
 | VCPUs      | Number of virtual cores                          |
 
+## Comparison Between CPU and GPU
+
+Here are the key differences between CPUs and GPUs:
+
+| CPUs                                          | GPUs                         |
+| --------------------------------------------- | ---------------------------- |
+| Work mostly in sequence. While several cores and excellent task switching give the impression of parallelism, a CPU is fundamentally designed to run one task at a time. | Are designed to work in parallel. A vast number of cores and threading managed in hardware enable GPUs to perform many simple calculations simultaneously. |
+| Are designed for task parallelism. | Are designed for data parallelism. |
+| Have a small number of cores that can complete single complex tasks at very high speeds. | Have a large number of cores that work in tandem to compute many simple tasks. |
+| Have access to a large amount of relatively slow RAM with low latency, optimizing them for latency (operation). | Have access to a relatively small amount of very fast RAM with higher latency, optimizing them for throughput. |
+| Have a very versatile instruction set, allowing the execution of complex tasks in fewer cycles but creating overhead in others. | Have a limited (but highly optimized) instruction set, allowing them to execute their designed tasks very efficiently. |
+| Task switching (as a result of running the OS) creates overhead. | Task switching is not used; instead, numerous serial data streams are processed in parallel from point A to point B. |
+| Will always work for any given use case but may not provide adequate performance for some tasks. | Would only be a valid choice for some use cases but would provide excellent performance in those cases. |
+
+In summary, for applications such as Machine Learning (ML), Artificial
+Intelligence (AI), or image processing, a GPU can provide a performance increase
+of 50x to 200x compared to a typical CPU performing the same tasks.
+
 ## Currently, our setup supports and offers the following flavors
 
 NERC offers the following flavors based on our Infrastructure-as-a-Service
@@ -32,7 +50,7 @@ The standard compute flavor **"cpu-su"** is provided from Lenovo SD530 (2x Intel
 8268 2.9 GHz, 48 cores, 384 GB memory) server. The base unit is 1 vCPU, 4 GB
 memory with default of 20 GB root disk at a rate of $0.013 / hr of wall time.
 
-| Flavor        | SUs | GPU | vCPU  | RAM(GB) | Storage(GB) | Cost / hr |
+| Flavor        | SUs | GPU | vCPU  | RAM(GiB) | Storage(GiB) | Cost / hr |
 |---------------|-----|-----|-------|---------|-------------|-----------|
 |cpu-su.1       |1    |0    |1      |4        |20           |$0.013     |
 |cpu-su.2       |2    |0    |2      |8        |20           |$0.026     |
@@ -46,7 +64,7 @@ The memory optimized flavor **"mem-su"** is provided from the same servers at
 **"cpu-su"** but with 8 GB of memory per core. The base unit is 1 vCPU, 8 GB
 memory with default of 20 GB root disk at a rate of $0.026 / hr of wall time.
 
-| Flavor        | SUs | GPU | vCPU  | RAM(GB) | Storage(GB) | Cost / hr |
+| Flavor        | SUs | GPU | vCPU  | RAM(GiB) | Storage(GiB) | Cost / hr |
 |---------------|-----|-----|-------|---------|-------------|-----------|
 |mem-su.1       |1    |0    |1      |8        |20           |$0.026     |
 |mem-su.2       |2    |0    |2      |16       |20           |$0.052     |
@@ -99,7 +117,7 @@ The higher number of tensor cores available can significantly enhance the speed
 of machine learning applications. The base unit is 32 vCPU, 240 GB memory with
 default of 20 GB root disk at a rate of $2.078 / hr of wall time.
 
-| Flavor            | SUs | GPU | vCPU  | RAM(GB) | Storage(GB) | Cost / hr |
+| Flavor            | SUs | GPU | vCPU  | RAM(GiB) | Storage(GiB) | Cost / hr |
 |-------------------|-----|-----|-------|---------|-------------|-----------|
 |gpu-su-a100sxm4.1  |1    |1    |32     |240      |20           |$2.078     |
 |gpu-su-a100sxm4.2  |2    |2    |64     |480      |20           |$4.156     |
@@ -131,7 +149,7 @@ industry-leading high throughput and low latency networking. The base unit is 24
 vCPU, 74 GB memory with default of 20 GB root disk at a rate of $1.803 / hr of
 wall time.
 
-| Flavor        | SUs | GPU | vCPU  | RAM(GB) | Storage(GB) | Cost / hr |
+| Flavor        | SUs | GPU | vCPU  | RAM(GiB) | Storage(GiB) | Cost / hr |
 |---------------|-----|-----|-------|---------|-------------|-----------|
 |gpu-su-a100.1  |1    |1    |24     |74       |20           |$1.803     |
 |gpu-su-a100.2  |2    |2    |48     |148      |20           |$3.606     |
@@ -161,7 +179,7 @@ The **"gpu-su-v100"** flavor is provided from Dell R740xd (2x Intel Xeon Gold 61
 40 cores, 768GB memory, 1x NVIDIA V100 32GB) servers. The base unit is 48 vCPU,
 192 GB memory with default of 20 GB root disk at a rate of $1.214 / hr of wall time.
 
-| Flavor        | SUs | GPU | vCPU  | RAM(GB) | Storage(GB) | Cost / hr |
+| Flavor        | SUs | GPU | vCPU  | RAM(GiB) | Storage(GiB) | Cost / hr |
 |---------------|-----|-----|-------|---------|-------------|-----------|
 |gpu-su-v100.1  |1    |1    |48     |192      |20           |$1.214     |
 
@@ -191,7 +209,7 @@ E5-2620 2.40GHz, 24 cores, 128GB memory, 4x NVIDIA K80 12GB) servers. The base u
 is 6 vCPU, 28.5 GB memory with default of 20 GB root disk at a rate of $0.463 /
 hr of wall time.
 
-| Flavor       | SUs | GPU | vCPU  | RAM(GB) | Storage(GB) | Cost / hr |
+| Flavor       | SUs | GPU | vCPU  | RAM(GiB) | Storage(GiB) | Cost / hr |
 |--------------|-----|-----|-------|---------|-------------|-----------|
 |gpu-su-k80.1  |1    |1    |6      |28.5     |20           |$0.463     |
 |gpu-su-k80.2  |2    |2    |12     |57       |20           |$0.926     |

diff --git a/docs/openstack/create-and-connect-to-the-VM/images.md b/docs/openstack/create-and-connect-to-the-VM/images.md
@@ -22,6 +22,7 @@ an instance:
 | Name                                  |
 |---------------------------------------|
 | centos-7-x86_64                       |
+| centos-8-x86_64                       |
 | debian-10-x86_64                      |
 | fedora-36-x86_64                      |
 | rocky-8-x86_64                        |

diff --git a/docs/openstack/create-and-connect-to-the-VM/ssh-to-the-VM.md b/docs/openstack/create-and-connect-to-the-VM/ssh-to-the-VM.md
@@ -326,7 +326,7 @@ Press **Yes** if you receive the identity verification popup:
 ![RDP Windows Popup](images/rdp_popup_for_xrdp.png)
 
 Then, enter your VM's username (ubuntu) and the password you created
-for user ubuntu following [this steps](#setting-a-password.md).
+for user ubuntu following [this steps](ssh-to-the-VM.md#setting-a-password.md).
 
 Press **Ok**.
 

diff --git a/docs/openstack/decommission/decommission-openstack-resources.md b/docs/openstack/decommission/decommission-openstack-resources.md
@@ -150,7 +150,7 @@ Wait until the requested resource allocation gets approved by the NERC's admin.
 After approval, kindly review and verify that the quotas are accurately
 reflected in your [resource allocation](https://coldfront.mss.mghpcc.org/allocation/)
 and [OpenStack project](https://stack.nerc.mghpcc.org/). Please ensure that the
-approved quota values are accurately displayed as [explained here](#review-your-openstack-dashboard).
+approved quota values are accurately displayed as [explained here](decommission-openstack-resources.md#review-your-openstack-dashboard).
 
 ### Review your Block Storage(Volume/Cinder) Quota
 

diff --git a/docs/openstack/persistent-storage/detach-a-volume.md b/docs/openstack/persistent-storage/detach-a-volume.md
@@ -59,7 +59,7 @@ the volume created before and attached to the VM and can be shown in
 Check that the volume is in state 'available' again.
 
 If that's the case, the volume is now ready to either be attached to another
-virtual machine or, if it is not needed any longer, to be [completely deleted](#delete-volumes)
+virtual machine or, if it is not needed any longer, to be [completely deleted](./delete-volumes.md)
 (please note that this step cannot be reverted!).
 
 ## Attach the detached volume to an instance

diff --git a/docs/openstack/persistent-storage/mount-the-object-storage.md b/docs/openstack/persistent-storage/mount-the-object-storage.md
@@ -1266,7 +1266,8 @@ Here,
 You can run either `juicefs config redis://default:<your_redis_password>@127.0.0.1:6379/1`
 or `juicefs status redis://default:<your_redis_password>@127.0.0.1:6379/1` to get
 detailed information about mounted file system i.e. **"myjfs"** that is setup by
-following [this step](##formatting-file-system). The output looks like shown here:
+following [this step](mount-the-object-storage.md#formatting-file-system). The
+output looks like shown here:
 
     {
     ...