Update deepsparse-readme.md (#846)

* Update deepsparse-readme.md * Update deepsparse-readme.md * Add files via upload * Update deepsparse-readme.md * Update deepsparse-readme.md * Add files via upload * Update deepsparse-readme.md * Update deepsparse-readme.md * Add files via upload * Add files via upload * Update deepsparse-readme.md * Update deepsparse-readme.md * Add files via upload * Update deepsparse-readme.md * Delete batch-1.png * Delete batch-64.png * Update deepsparse-readme.md Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Mark Kurtz <[email protected]>
neuralmagic · Jan 5, 2023 · a3900c0 · a3900c0
1 parent 6e98f33
commit a3900c0
Show file tree

Hide file tree

Showing 3 changed files with 75 additions and 78 deletions.
diff --git a/cluster-bar.png b/cluster-bar.png
diff --git a/examples/ultralytics-yolo/ultralytics-readmes/cluster-bar.png b/examples/ultralytics-yolo/ultralytics-readmes/cluster-bar.png
diff --git a/examples/ultralytics-yolo/ultralytics-readmes/deepsparse-readme.md b/examples/ultralytics-yolo/ultralytics-readmes/deepsparse-readme.md
@@ -1,17 +1,23 @@
-# DeepSparse
+<div style="display: flex; flex-direction: column; align-items: center;">
+  <h1>
+    DeepSparse &nbsp;&nbsp;<img alt="tool icon" src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/docs/source/icon-deepsparse.png" />
+  </h1>
+</div>
 
-This guide explains how to deploy YOLOv5 with Neural Magic's DeepSparse.
+Welcome to software-delivered AI.
 
-## About DeepSparse
+This guide explains how to deploy YOLOv5 with Neural Magic's DeepSparse.
 
-Welcome to software-delivered AI.
+DeepSparse is an inference runtime with exceptional performance on CPUs. For instance, compared to ONNX Runtime's baseline, DeepSparse offers a 3.7x speed-up at batch size 1 and a 5.8x speed-up at batch size 64 for YOLOv5s!
 
-DeepSparse is an inference runtime offering GPU-class performance on CPUs. For the first time, your deep learning workloads can meet the performance 
-demands of production without the complexity and costs of hardware accelerators.
+<p align="center">
+  <img width="60%" src="cluster-bar.png">
+</p>
 
-Simply put, DeepSparse gives you the performance of GPUs and the simplicity of software:
+For the first time, your deep learning workloads can meet the performance demands of production without the complexity and costs of hardware accelerators.
+Put simply, DeepSparse gives you the performance of GPUs and the simplicity of software:
 - **Flexible Deployments**: Run consistently across cloud, data center, and edge with any hardware provider from Intel to AMD to ARM
-- **Near-Infinite Scalability**: Scale vertically from 1 to 192 cores, out with standard Kubernetes, or fully-abstracted with Serverless
+- **Infinite Scalability**: Scale vertically to 100s of cores, out with standard Kubernetes, or fully-abstracted with Serverless
 - **Easy Integration**: Clean APIs for integrating your model into an application and monitoring it in production
 
 **[Start your 90 day Free Trial](https://neuralmagic.com/deepsparse-free-trial/?utm_campaign=free_trial&utm_source=ultralytics_github).**
@@ -20,21 +26,20 @@ Simply put, DeepSparse gives you the performance of GPUs and the simplicity of s
 
 DeepSparse takes advantage of model sparsity to gain its performance speedup. 
 
-Sparsification through pruning and quantization is a broadly studied technique, allowing reductions of 10x in the size and compute needed to 
-execute a network, while maintaining high accuracy. DeepSparse is sparsity-aware, so it skips the multiply-adds by 0, shrinking amount of compute
-in a forward pass. Since the sparse computation is memory bound, DeepSparse executes the network depth-wise, breaking the problem into Tensor Columns, 
+Sparsification through pruning and quantization is a broadly studied technique, allowing order-of-magnitude reductions in the size and compute needed to 
+execute a network, while maintaining high accuracy. DeepSparse is sparsity-aware, meaning it skips the zeroed out parameters, shrinking amount of compute
+in a forward pass. Since the sparse computation is now memory bound, DeepSparse executes the network depth-wise, breaking the problem into Tensor Columns, 
 vertical stripes of computation that fit in cache.
 
 <p align="center">
   <img width="60%" src="tensor-columns.png">
 </p>
 
-Sparse computation, executed depth-wise in cache, allows DeepSparse to deliver GPU-class performance on CPUs!
+Sparse networks with compressed computation, executed depth-wise in cache, allows DeepSparse to deliver GPU-class performance on CPUs!
 
 ### How Do I Create A Sparse Version of YOLOv5 Trained on My Data?
 
-Neural Magic's open-source model repository SparseZoo and open-source model optimization library SparseML are integrated with Ultralytics YOLOv5, making 
-it easy to fine-tune a pre-sparsified version of any YOLOv5 model onto custom data with a single CLI command.
+Neural Magic's open-source model repository, SparseZoo, contains pre-sparsified checkpoints of each YOLOv5 model. Using SparseML, which is integrated with Ultralytics, you can fine-tune a sparse checkpoint onto your data with a single CLI command.
 
 [Checkout Neural Magic's YOLOv5 documentation for more details](https://docs.neuralmagic.com/use-cases/object-detection/sparsifying).
 
@@ -44,120 +49,100 @@ We will walk through an example benchmarking and deploying a sparse version of Y
 
 ### Install DeepSparse
 
-Run the following to install DeepSparse. We recommend you use a virtual enviornment.
+Run the following to install DeepSparse. We recommend you use a virtual enviornment with Python.
 
 ```bash
-pip install deepsparse[server,yolo]
+pip install deepsparse[server,yolo,onnxruntime]
 ```
 
 ### Collect an ONNX File
 
 DeepSparse accepts a model in the ONNX format, passed either as:
 - A SparseZoo stub which identifies an ONNX file in the SparseZoo
-- A Local Path to an ONNX model in a filesystem
+- A local path to an ONNX model in a filesystem
 
-We will compare the pruned-quantized YOLOv5s to the standard dense YOLOv5s, identified by the following SparseZoo stubs:
-```
+The examples below will use the standard dense YOLOv5s and pruned-quantized YOLOv5s checkpoints, identified by the following SparseZoo stubs:
+```bash
 zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
-zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94
+zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
+zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni # < pruned for VNNI machines
 ```
 
 ### Benchmark Performance
 
-DeepSparse includes a benchmarking script to test performance. The examples below use an AWS `c6i.4xlarge` instance (8 cores).
+We will compare DeepSparse's throughput to ONNX Runtime's throughput on YOLOv5s, using DeepSparse's benchmarking script.
 
-We will show DeepSparse's performance in several scenarios against ONNX Runtime, which can be installed with the following:
-```
-pip install onnxruntime
-```
+The benchmarks were run on an AWS `c6i.8xlarge` instance (16 cores). 
 
-#### ONNX Runtime Baseline
+#### Batch 1 Performance Comparison
 
-The performance benchmarking script includes an option to run with ONNX Runtime as a baseline. Run the following:
+ONNX Runtime achieves 49 images/sec with dense YOLOv5s.
 
 ```bash
-deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -e onnxruntime
+deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1 -nstreams 1 -e onnxruntime
 
 > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
 > Batch Size: 1
 > Scenario: sync
-> Throughput (items/sec): 17.2266
-> Latency Mean (ms/batch): 58.0394
-> Latency Median (ms/batch): 58.0130
-> Latency Std (ms/batch): 0.2080
-> Iterations: 173
+> Throughput (items/sec): 48.8549
+> Latency Mean (ms/batch): 20.4613
+> Latency Median (ms/batch): 20.4192
 ```
-At batch 1, ORT achieves 17 images/sec with dense YOLOv5s.
-
-#### DeepSparse Dense Performance
 
-While DeepSparse gets its best performance with sparse models, it also runs dense models well. Run the following:
+DeepSparse achieves 135 items/sec with a pruned-quantized YOLOv5s, **a 2.8x performance gain over ONNX Runtime!**
 
 ```bash
-deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 1
+deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 1 -nstreams 1
 
-> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
+> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
 > Batch Size: 1
 > Scenario: sync
-> Throughput (items/sec): 35.2821
-> Latency Mean (ms/batch): 28.3274
-> Latency Median (ms/batch): 28.3064
-> Latency Std (ms/batch): 0.1022
-> Iterations: 353
+> Throughput (items/sec): 135.0647
+> Latency Mean (ms/batch): 7.3895
+> Latency Median (ms/batch): 7.2398```
 ```
 
-At batch 1, DeepSparse achieves 35 images/sec with dense YOLOv5s, a **2x performance improvement over ORT**!
-
-#### DeepSparse Sparse Performance
-
-When sparsity is applied to the model, DeepSparse's performance is even better. Run the following:
+Since `c6i.8xlarge` instances have VNNI instructions, DeepSparse's throughput can be pushed further if weights are pruned in blocks of 4. DeepSparse achieves 180 items/sec with a 4-block pruned-quantized YOLOv5s, a **3.7x performance gain over ONNX Runtime!**
 
 ```bash
-deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94 -s sync -b 1
+deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni -s sync -b 1 -nstreams 1
 
-> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94
+> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned35_quant-none-vnni
 > Batch Size: 1
 > Scenario: sync
-> Throughput (items/sec): 79.8339
-> Latency Mean (ms/batch): 12.5111
-> Latency Median (ms/batch): 12.5534
-> Latency Std (ms/batch): 0.1546
-> Iterations: 799
+> Throughput (items/sec): 179.6016
+> Latency Mean (ms/batch): 5.5615
+> Latency Median (ms/batch): 5.5458
 ```
 
-At batch 1, DeepSparse achieves 80 images/sec with a pruned-quantized YOLOv5s, a **4.7x performance improvement over ORT**!
-
 #### Batch 64 Performance Comparison
 
-In latency-insensitive scenarios with large batch sizes, DeepSparse's performance relative to ORT is even stronger.
+In latency-insensitive scenarios with large batch sizes, DeepSparse's performance relative to ONNX Runtime is even stronger.
+
+ONNX Runtime achieves 42 images/sec with dense YOLOv5s:
 
-ORT achieves 15 images/sec at batch 64:
 ```bash
-deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 64 -e onnxruntime
+deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none -s sync -b 64 -nstreams 1 -e onnxruntime
 
 > Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none
 > Batch Size: 64
 > Scenario: sync
-> Throughput (items/sec): 14.7454
-> Latency Mean (ms/batch): 4339.7396
-> Latency Median (ms/batch): 4190.1105
-> Latency Std (ms/batch): 211.6520
-> Iterations: 3
+> Throughput (items/sec): 41.5560
+> Latency Mean (ms/batch): 1538.6640
+> Latency Median (ms/batch): 1538.0362
 ```
 
-DeepSparse achieves 148 images/sec at batch 64, a **10x performance improvement over ORT**!
+DeepSparse achieves 239 images/sec with pruned-quantized YOLOv5s, a **5.8x performance improvement over ORT**!
 
 ```bash
-deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94 -s sync -b 64
+deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none -s sync -b 64 -nstreams 1
 
-> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94
+> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
 > Batch Size: 64
 > Scenario: sync
-> Throughput (items/sec): 147.6844
-> Latency Mean (ms/batch): 433.3391
-> Latency Median (ms/batch): 433.1426
-> Latency Std (ms/batch): 0.9764
-> Iterations: 24
+> Throughput (items/sec): 239.0854
+> Latency Mean (ms/batch): 267.6703
+> Latency Median (ms/batch): 267.3194
 ```
 
 ### Deploy a Model
@@ -169,6 +154,18 @@ To try the deployment examples below, pull down a sample image for the example a
 wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
 ```
 
+#### Annotate CLI
+You can also use the annotate command to have the engine save an annotated photo on disk. Try --source 0 to annotate your live webcam feed!
+```bash
+deepsparse.object_detection.annotate --model_filepath zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94 --source basilica.jpg
+```
+
+Running the above command will create an `annotation-results` folder and save the annotated image inside.
+
+<p align = "center">
+<img src="https://github.com/neuralmagic/deepsparse/blob/d31f02596ebff2ec62761d0bc9ca14c4663e8858/src/deepsparse/yolo/sample_images/basilica-annotated.jpg" alt="annotated" width="60%"/>
+</p>
+
 #### Python API
 
 `Pipelines` wrap pre-processing and output post-processing around the runtime, providing a clean inferface for adding DeepSparse to an application. 
@@ -183,7 +180,7 @@ from deepsparse import Pipeline
 images = ["basilica.jpg"]
 
 # create Pipeline
-model_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94"
+model_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none"
 yolo_pipeline = Pipeline.create(
     task="yolo",
     model_path=model_stub,
@@ -205,7 +202,7 @@ Spin up the Server with the pruned-quantized YOLOv5s:
 ```bash
 deepsparse.server \
     --task yolo \
-    --model_path zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned_quant-aggressive_94
+    --model_path zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned65_quant-none
 ```
 
 An example request, using Python's `requests` package:
@@ -228,6 +225,6 @@ labels = annotations["labels"]
 
 ## Get Started With DeepSparse
 
-**Research or Testing?** DeepSparse Community is free for research and testing. Production deployments require DeepSparse Enterprise.
+**Research or Testing?** DeepSparse Community is free for research and testing. Get started with our [Documentation](https://docs.neuralmagic.com/).
 
-**Want to Try DeepSparse Enterprise?** [Start your 90 day free trail](https://neuralmagic.com/deepsparse-free-trial/?utm_campaign=free_trial&utm_source=ultralytics_github).
+**Want to Try DeepSparse Enterprise?** [Start your 90 day free trial](https://neuralmagic.com/deepsparse-free-trial/?utm_campaign=free_trial&utm_source=ultralytics_github).