Add gif

chuanli11 · chuanli11 · commit 58e862efcee7 · 2023-03-13T08:59:03.000-07:00
diff --git a/README.md b/README.md
@@ -1,15 +1,26 @@
-# LLaMA 
+# LLaMA
 
 This repository is intended as a minimal, hackable and readable example to load [LLaMA](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) ([arXiv](https://arxiv.org/abs/2302.13971v1)) models and run inference.
 In order to download the checkpoints and tokenizer, fill this [google form](https://forms.gle/jk851eBVbX1m5TAv5)
 
+## Inference with mpirun
+
+This fork supports launching an LLAMA inference job with multiple instances (one or more GPUs on each instance) uisng `mpirun`. You can find more details [here](deployment/README.md).
+
+Example: Launching an interactive 65B LLAMA inference job across eight 1xA10 Lambda Cloud instances
+
+![Launching 65B LLAMA inference across eight A10 Cloud instances](deployment/pics/newton-einstein-8xA10.gif)
+
 ## Setup
 
 In a conda env with pytorch / cuda available, run
+
 ```
 pip install -r requirements.txt
 ```
+
 Then in this repository
+
 ```
 pip install -e .
 ```
@@ -22,18 +33,19 @@ Edit the `download.sh` script with the signed url provided in the email to downl
 ## Inference
 
 The provided `example.py` can be run on a single or multi-gpu node with `torchrun` and will output completions for two pre-defined prompts. Using `TARGET_FOLDER` as defined in `download.sh`:
+
 ```
 torchrun --nproc_per_node MP example.py --ckpt_dir $TARGET_FOLDER/model_size --tokenizer_path $TARGET_FOLDER/tokenizer.model
 ```
 
 Different models require different MP values:
 
-|  Model | MP |
-|--------|----|
-| 7B     | 1  |
-| 13B    | 2  |
-| 33B    | 4  |
-| 65B    | 8  |
+| Model | MP  |
+| ----- | --- |
+| 7B    | 1   |
+| 13B   | 2   |
+| 33B   | 4   |
+| 65B   | 8   |
 
 ## FAQ
 
@@ -56,7 +68,9 @@ LLaMA: Open and Efficient Foundation Language Models -- https://arxiv.org/abs/23
 ```
 
 ## Model Card
+
 See [MODEL_CARD.md](MODEL_CARD.md)
 
 ## License
+
 See the [LICENSE](LICENSE) file.
diff --git a/deployment/README.md b/deployment/README.md
@@ -13,7 +13,7 @@ Despite being more memory effient than previous langauge foundation models, LLAM
 
 Don't worry, this tutorial explains how to use `mpirun` to launch an LLAMA inference job across multiple cloud instances (one or more GPUs on each instance). Here are some key updates in addition to the [original llama repo](https://github.com/facebookresearch/llama) and [shawwn's fork](https://github.com/shawwn/llama):
 
-- A script to easily set up a "cluster" of cloud instances that is ready to run LLAMA inference (all models from 7B to 65B).
+- A [script](./setup_nodes.sh) to easily set up a "cluster" of cloud instances that is ready to run LLAMA inference (all models from 7B to 65B).
 - `mpirun` compatible, so you can launch the job directly from the head node without the need of typing in the `torchrun` command on the worker nodes.
 - Interactive inference mode across multiple nodes.
 - `eos_w`: constrols how "lengthy" the results are likely to be by scaling the probability of `eos_token`.
diff --git a/deployment/pics/newton-einstein-8xA10.gif b/deployment/pics/newton-einstein-8xA10.gif
diff --git a/deployment/setup_nodes.sh b/deployment/setup_nodes.sh
@@ -41,12 +41,11 @@ for IP in ${WORKER_IP[*]}; do
     ssh -i $LAMBDA_CLOUD_KEY ubuntu@$HEAD_IP "echo '/home/ubuntu/shared ${IP}(rw,sync,no_subtree_check)' | sudo tee -a /etc/exports"
 done
 ssh -i $LAMBDA_CLOUD_KEY ubuntu@$HEAD_IP "sudo systemctl restart nfs-kernel-server"
-# echo "NFS set up on the head node"
+echo "NFS set up on the head node"
 
 for IP in ${WORKER_IP[*]}; do
     ssh -i $LAMBDA_CLOUD_KEY ubuntu@$IP "sudo mount ${HEAD_IP}:/home/ubuntu/shared /home/ubuntu/shared"
 done
-
 echo "NFS set up on the worker nodes"
 
 echo "Clone repos into NFS ------------------------------"