Skip to content

Commit 1e64267

Browse files
committed
Update 1.7 docs to reflect latest changes from nightly
1 parent 2efa8c6 commit 1e64267

File tree

17 files changed

+634
-31
lines changed

17 files changed

+634
-31
lines changed

versioned_docs/version-1.7.0/get-started/index.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,6 @@ Transfer a pre-sparsified, foundational LLM to your use case without heavy retra
5858

5959
🔗 <b>Let's Get Started!</b> Choose a task above to begin your Neural Magic journey.
6060

61-
📚 <b>Need Help?</b> Join our [Slack community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-2gbar46r6-2Tu~SS5iQdHgczAKlQ2jJA) for support.
61+
📚 <b>Need Help?</b> Join our [Slack community](https://neuralmagic.com/community/) for support.
6262

6363
🌟 <b>Shape the Future:</b> Contribute to our GitHub [GitHub repositories](https://github.com/neuralmagic).

versioned_docs/version-1.7.0/get-started/install/deepsparse.mdx

+11-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ sidebar_position: 1
1616
# Installing DeepSparse
1717

1818
DeepSparse is Neural Magic's inference engine, empowering you to run deep learning models on CPUs with exceptional performance and efficiency.
19-
This guide covers various installation methods, including PyPI, Docker, and installation from the GitHub source code for advanced use cases.
19+
This guide covers various installation methods, including PyPI and installation from the GitHub source code for advanced use cases.
2020

2121
## Prerequisites
2222

@@ -77,6 +77,16 @@ Or from a locally cloned repository:
7777
```
7878
</VersionInjector>
7979

80+
### Product Usage Analytics
81+
DeepSparse Community Edition gathers basic usage telemetry including, but not limited to, Invocations, Package, Version, and IP Address for Product Usage Analytics purposes. Review Neural Magic's [Products Privacy Policy](https://neuralmagic.com/legal/) for further details on how we process this data.
82+
83+
To disable Product Usage Analytics, run the command:
84+
```bash
85+
export NM_DISABLE_ANALYTICS=True
86+
```
87+
88+
Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check." For additional assistance, reach out through the [DeepSparse GitHub Issue queue](https://github.com/neuralmagic/deepsparse/issues).
89+
8090
## Enterprise Installation
8191

8292
### PyPI

versioned_docs/version-1.7.0/guides/deepsparse-engine/index.mdx

+3-3
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ keywords:
1212
- scheduler
1313
description: DeepSparse Feature Overview
1414
sidebar_label: DeepSparse Features
15-
sidebar_position: 1
15+
sidebar_position: 2
1616
---
1717

1818
# DeepSparse Features
1919

2020
Learn more about DeepSparse through the following overviews:
2121

22-
<DocCardList>
23-
</DocCardList>
22+
<DocCardList />
23+

versioned_docs/version-1.7.0/guides/deploying-deepsparse/deepsparse-server.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -149,4 +149,4 @@ All you need is to add `/docs` at the end of your host URL:
149149

150150
localhost:5543/docs
151151

152-
<img src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/server/img/swagger_ui.png" alt="Swagger UI For Viewing Model Pipeline" width="1200" height="524" />
152+
<img src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/server/img/endpoints.png" alt="Swagger UI For Viewing Model Pipeline" width="1200" height="524" />

versioned_docs/version-1.7.0/guides/deploying-deepsparse/google-cloud-run.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
description: Deploy DeepSparse in a Serverless framework with Google Cloud Run.
2+
description: Deploy DeepSparse in a serverless framework with Google Cloud Run.
33
sidebar_label: Google Cloud Run
44
sidebar_position: 4
55
---
@@ -33,7 +33,7 @@ cd deepsparse/examples/google-cloud-run
3333
```
3434

3535
## Model Configuration
36-
The current server configuration is running `token classification`. To alter the model, task, or other parameters (e.g., number of cores, workers, routes, or batch size), edit the `config.yaml` file.
36+
The current server configuration is running `token classification`. To alter the model, task or other parameters (e.g., number of cores, workers, routes, or batch size), edit the `config.yaml` file.
3737

3838
## Create Endpoint
3939
Run the following command to build the Cloud Run endpoint.

versioned_docs/version-1.7.0/guides/deploying-deepsparse/index.mdx

+3-3
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@ keywords:
1010
- GCP
1111
description: DeepSparse Deployment Options
1212
sidebar_label: Deployment Options
13-
sidebar_position: 1
13+
sidebar_position: 3
1414
---
1515

1616
# DeepSparse Deployment Options
1717

1818
Select the deployment option that's best for your project. Benefit from a faster and cost-effective solution.
1919

20-
<DocCardList>
21-
</DocCardList>
20+
<DocCardList />
21+
2222

versioned_docs/version-1.7.0/guides/index.mdx

+1
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,4 @@ Explore these foundational guides that will provide essential information and in
1414
<DocCardList>
1515
</DocCardList>
1616

17+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
tags:
3+
- onnx
4+
- model export
5+
- cpu inference
6+
keywords:
7+
- onnx model format
8+
- neural network interchange
9+
- cross-platform compatibility
10+
- deepsparse
11+
description: Overview of ONNX, its role in DeepSparse for CPU inference, and guidance on exporting models to the ONNX format.
12+
sidebar_label: ONNX
13+
sidebar_position: 4
14+
---
15+
16+
# ONNX: Model Definitions for DeepSparse
17+
18+
ONNX (Open Neural Network Exchange) is an open-source format representing machine learning models, including deep neural networks.
19+
It provides a standardized way to exchange models between different frameworks and tools, promoting cross-platform compatibility.
20+
21+
## Why ONNX Matters for DeepSparse
22+
23+
DeepSparse leverages ONNX for optimized CPU inference pathways. Here's why it's important:
24+
25+
- <b>Framework Flexibility:</b> Exporting models to ONNX allows you to deploy sparsified and optimized neural networks created in various training frameworks (e.g., PyTorch, TensorFlow) to DeepSparse's CPU inference engine.
26+
- <b>Hardware Portability:</b> ONNX, combined with DeepSparse, can run on various CPU architectures (x86, ARM, etc.), ensuring your optimized models work across diverse hardware environments.
27+
- <b>Performance Optimization:</b> DeepSparse's ONNX runtime is specifically tuned to deliver efficient inference performance on CPUs, taking advantage of platform-specific optimizations.
28+
29+
## Exporting Models to ONNX
30+
31+
Converting your trained model to the ONNX format depends on the original training framework.
32+
SparseML is the default pathway for exporting models to ONNX within the Neural Magic ecosystem.
33+
It supports exporting standard PyTorch models to ONNX, including unoptimized and sparsified models.
34+
35+
Here's a basic example of exporting a PyTorch GPT2 from HuggingFace to ONNX using SparseML:
36+
37+
```python
38+
from sparseml import export
39+
40+
# Load your PyTorch model
41+
from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer
42+
model = SparseAutoModelForCausalLM.from_pretrained("gpt2")
43+
tokenizer = SparseAutoTokenizer.from_pretrained("gpt2")
44+
45+
# Export the model to ONNX
46+
export(
47+
model=model,
48+
tokenizer=tokenizer,
49+
target_path="./onnx-export",
50+
)
51+
```
52+
53+
For unsupported frameworks or the above doesn't work for your custom models, you can export models to ONNX using the native APIs of the training framework or supported third-party pathways:
54+
- [PyTorch](https://pytorch.org/docs/stable/onnx.html)
55+
- [TensorFlow](https://github.com/onnx/tensorflow-onnx)
56+
- [Keras](https://github.com/onnx/keras-onnx)
57+
- [JAX](https://github.com/google/jaxonnxruntime)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
tags:
3+
- sparsification
4+
- model optimization
5+
- model compression k
6+
keywords:
7+
- model compression
8+
- model acceleration
9+
- neural network optimization
10+
- efficiency
11+
- pruning
12+
- quantization
13+
- distillation
14+
description: A comprehensive overview of sparsification techniques used to create smaller, faster, and more energy-efficient neural networks while maintaining accuracy.
15+
sidebar_label: Sparsification
16+
sidebar_position: 1
17+
---
18+
19+
# Sparsification: Compressing Neural Networks
20+
21+
Sparsification encompasses a range of powerful techniques used to compress and optimize neural networks.
22+
By strategically removing or reducing the significance of less important connections and information within a model, sparsification leads to retaining accuracy while resulting in:
23+
- <b>Smaller Model Sizes:</b> Reduced storage requirements and memory footprint, simplifying deployment.
24+
- <b>Faster Inference:</b> Significant boosts in computational speed, especially on resource-constrained hardware, promoting real-time applications.
25+
- <b>Reduced Energy Consumption:</b> Enable efficient execution for servers, edge environments, and mobile devices, lowering costs and broadening usage.
26+
27+
This guide delves into the core concepts of sparsification, and in it, you'll learn:
28+
- <b>The Purpose of Sparsification:</b> Discover the benefits and motivations behind optimizing neural networks.
29+
- <b>Essential Techniques:</b> Explore the key methods used to achieve sparsification.
30+
- <b>Application Strategies:</b> Understand how to implement sparsification at different stages of the model's lifecycle.
31+
- <b>Practical Recipes:</b> Get guidance on applying sparsification techniques to everyday use cases.
32+
33+
## Techniques
34+
35+
Sparsification techniques can be broadly categorized into several key methods, each with its unique approach to compressing and optimizing neural networks:
36+
37+
### Quantization
38+
39+
Quantization reduces the precision of weights and activations in a neural network, for example, from 32-bit floating-point numbers to 8-bit integers.
40+
Quantization can be applied to weights, activations, or both and can be done statically (before deployment) or dynamically (at runtime).
41+
It decreases model size and memory usage, often leading to faster inference, particularly with specialized hardware support for low-precision arithmetic.
42+
43+
### Pruning
44+
45+
Pruning eliminates redundant or less important connections within a model.
46+
Pruning can be done in either a structured or unstructured manner, where structured pruning changes the model's shape, and unstructured pruning keeps the shape intact while introducing zeros in the weights (sparsity).
47+
This results in a smaller model and faster inference due to reduced compute provided the engine/hardware supports sparse computation.
48+
49+
### Knowledge Distillation
50+
51+
Distillation generally trains a smaller or more compressed "student" model to mimic the behavior of a larger, unoptimized "teacher" model.
52+
It enables the creation of more compressed models that are easier to deploy and execute while leveraging the knowledge and performance of the larger model to maintain accuracy.
53+
Distillation is further broken down into granularity levels, such as model-level, layer-level, and instance-level distillation.
54+
55+
### Low Rank Approximation
56+
57+
Low-rank approximations (LoRA), also known as matrix factorization, matrix decomposition, or tensor decomposition, reduce the rank of the weight matrices in a neural network, effectively compressing the model.
58+
This technique is based on the observation that the weight matrices of neural networks are often low-rank, meaning they can be approximated by a product of two smaller matrices.
59+
It can be particularly effective for compressing a model's large, fully connected layers.
60+
It's also known to be used in conjunction with other compression techniques, such as quantization (QLoRA), to enable faster fine-tuning.
61+
62+
### Conditional Computation
63+
64+
Conditional computation selectively activates only parts of a model based on the input data, leading to dynamic sparsity.
65+
This can be achieved through techniques such as gating, where a gating network decides which parts of the model to execute, or through adaptive computation, where the model learns to skip or reduce computation based on the input, such as Mixture of Experts (MoE) techniques.
66+
Conditional computation can significantly speed up inference time, especially for models with large, redundant, or unnecessary computations.
67+
68+
### Regularization
69+
70+
Regularization methods such as L1 and L2 can be used to encourage sparsity in a neural network's weights.
71+
Adding a regularization term to the loss function incentivizes the model to reduce overfitting and learn simpler representations, which can lead to sparser models.
72+
Regularization can be used with other techniques, such as pruning, to further enhance the sparsity of a model.
73+
74+
### Weight Sharing
75+
76+
Weight sharing involves sharing the weights of a neural network across different parts of the model, effectively reducing the number of unique weights and thereby reducing the model size.
77+
This can be done by clustering similar weights and sharing the same weight value across multiple connections.
78+
Weight sharing can be particularly effective for reducing a model's memory footprint, especially when combined with other compression techniques.
79+
80+
### Architecture Search
81+
82+
Techniques such as neural architecture search (NAS) can automatically discover more efficient and compact neural network architectures.
83+
By searching over a large space of possible architectures, NAS can identify smaller, faster, and more accurate models than hand-designed architectures.
84+
NAS can be used to optimize existing models or discover entirely new architectures tailored to specific tasks or constraints.
85+
86+
### Compound Sparsification
87+
88+
Compound sparsification combines multiple techniques to achieve even more significant compression and optimization.
89+
By leveraging the strengths of different methods, compound sparsification can create smaller, faster, and more energy-efficient models than those produced by individual techniques.
90+
For example, pruning can be combined with quantization and distillation to create highly compressed models that retain high accuracy.
91+
92+
## Application
93+
94+
Sparsification techniques can be applied at different stages of a model's lifecycle with varying degrees of complexity and effectiveness:
95+
96+
### Post-Training / One-Shot
97+
98+
Sparsification can be applied post-training, where a pre-trained model is compressed using pruning, quantization, or distillation techniques.
99+
Post-training is often the most straightforward approach to sparsification, as it does not require changes to the training process or hyperparameters.
100+
However, post-training sparsification may have the same level of compression or performance as techniques applied during training.
101+
It is particularly practical for quantization but less effective for pruning.
102+
103+
### Training Aware
104+
105+
Sparsification can also be applied during training, where the model is trained with sparsification techniques such as pruning, quantization, and distillation.
106+
This approach can lead to more effective compression and optimization as the model adapts to the sparsity constraints during training.
107+
Training-aware sparsification can be more complex and computationally intensive than post-training sparsification, but it can often achieve better results regarding model size, speed, and accuracy.
108+
109+
### Transfer Learning
110+
111+
Sparsification can be combined with transfer learning, where a sparsified, pre-trained model is fine-tuned on a new task or dataset.
112+
This approach can leverage the knowledge and compression of the pre-trained model without the complexity of sparsification hyperparameters or training from scratch.
113+
Transfer learning with sparsification can be particularly effective for quickly adapting compressed models to new tasks or domains with fewer resources and complexity while closely matching the performance of training-aware techniques.
114+
115+
## Recipes
116+
117+
Sparsification recipes provide a structured and reusable way to define the steps and parameters for optimizing neural networks.
118+
They encapsulate the specific sparsification techniques, hyperparameters, and necessary training adjustments into a single configuration file.
119+
Sparsification recipes can be shared, reused, and adapted across different models, tasks, and domains, making experimenting with and deploying compressed models easier.
120+
121+
Recipes are core to the sparsification process through SparseML, a comprehensive framework for sparsification and model optimization.
122+
Additionally, models generally available in the SparseZoo or our HuggingFace model hub include the recipes used to train them, making it easy to reproduce and adapt the training process.
123+
124+
Throughout the sparsification guides, you'll find example recipes for different techniques and applications, providing a hands-on approach to implementing and experimenting with sparsification.
125+
126+
A general workflow for sparsification using SparseML is as follows:
127+
1. Define a sparsification recipe for the desired technique and application.
128+
2. Integrate SparseML into your experimentation pipelines or utilize the pre-built pipelines in SparseML.
129+
3. Apply the sparsification recipe to your model through one-shot, training-aware, or transfer learning methods.
130+
4. Evaluate the compressed model on your desired metrics and tasks.
131+
132+
---
133+
134+
Dive into the guides in this section to learn more about the core sparsification techniques, applications, and recipes for compressing and optimizing neural networks.

versioned_docs/version-1.7.0/llms/guides/hf-llm-to-deepsparse.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This guide is for people interested in exporting their Hugging Face-compatible L
1212

1313
SparseML provides tools for optimizing machine learning models for deployment. To install it along with the necessary support for Hugging Face Transformers, open your terminal and run:
1414
```bash
15-
pip install sparseml[transformers]==1.7
15+
pip install sparseml[transformers]
1616
```
1717

1818
> #### Note on system requirements
@@ -52,7 +52,7 @@ After exporting your model, you can run inference using DeepSparse.
5252
1. **Install DeepSparse LLM**:
5353
Install the DeepSparse library, which is specifically designed for running inference on large language models (LLMs) efficiently.
5454
```bash
55-
pip install deepsparse[llm]==1.7
55+
pip install deepsparse[llm]
5656
```
5757

5858
2. **Load Your Model and Run Inference**:

versioned_docs/version-1.7.0/llms/guides/one-shot-llms-with-sparseml.mdx

+6-15
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,7 @@ options:
4040
Example command:
4141
```bash
4242
wget https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v0.4-pruned50-quant/raw/main/recipe.yaml # download recipe
43-
sparseml.transformers.text_generation.oneshot \
44-
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
45-
--dataset open_platypus --recipe recipe.yaml \
46-
--output_dir ./obcq_deployment \
47-
--precision float16
43+
sparseml.transformers.text_generation.oneshot --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dataset_name open_platypus --recipe recipe.yaml --output_dir ./obcq_deployment --precision float16
4844
```
4945
## How to Evaluate the One-shot Model
5046
Next, evaluate the model's performance using the [lm-evaluation-harness framework](https://github.com/neuralmagic/lm-evaluation-harness).
@@ -60,16 +56,16 @@ Evaluate on the `hellaswag` task:
6056
start=`date +%s`
6157
python main.py \
6258
--model hf-causal-experimental \
63-
--model_args pretrained=../obcq_deployment,trust_remote_code=True \
59+
--model_args pretrained=obcq_deployment,trust_remote_code=True \
6460
--tasks hellaswag \
6561
--batch_size 64 \
6662
--no_cache \
6763
--write_out \
68-
--output_path "../obcq_deployment/hellaswag.json" \
64+
--output_path "obcq_deployment/hellaswag.json" \
6965
--device "cuda:0" \
7066
--num_fewshot 0
71-
end=`date +%s`
72-
echo Execution time was `expr $end - $start` seconds.
67+
end=`date +%s`
68+
echo Execution time was `expr $end - $start` seconds.
7369
```
7470
The results obtained in this case are:
7571
```
@@ -266,12 +262,7 @@ Save the recipe to a file named `recipe.yaml`.
266262

267263
Run one-shot quantization on any Mistral-based model, for example, `zephyr-7b-beta`:
268264
```bash
269-
sparseml.transformers.text_generation.oneshot \
270-
--model HuggingFaceH4/zephyr-7b-beta \
271-
--dataset open_platypus \
272-
--recipe recipe.yaml \
273-
--output_dir ./output_oneshot \
274-
--precision float16
265+
sparseml.transformers.text_generation.oneshot --model_name HuggingFaceH4/zephyr-7b-beta --dataset_name open_platypus --recipe recipe.yaml --output_dir ./output_oneshot --precision float16
275266
```
276267
We set `precision` to `float16` because quantization is not supported for the `bfloat16` data type as of this writing.
277268

0 commit comments

Comments
 (0)