neuralmagic
diff --git a/‎versioned_docs/version-1.7.0/get-started/index.mdx
+1-1 b/‎versioned_docs/version-1.7.0/get-started/index.mdx
+1-1
diff --git a/‎versioned_docs/version-1.7.0/get-started/install/deepsparse.mdx
+11-1 b/‎versioned_docs/version-1.7.0/get-started/install/deepsparse.mdx
+11-1
diff --git a/‎versioned_docs/version-1.7.0/guides/deepsparse-engine/index.mdx
+3-3 b/‎versioned_docs/version-1.7.0/guides/deepsparse-engine/index.mdx
+3-3
diff --git a/‎versioned_docs/version-1.7.0/guides/deploying-deepsparse/deepsparse-server.mdx
+1-1 b/‎versioned_docs/version-1.7.0/guides/deploying-deepsparse/deepsparse-server.mdx
+1-1
diff --git a/‎versioned_docs/version-1.7.0/guides/deploying-deepsparse/google-cloud-run.mdx
+2-2 b/‎versioned_docs/version-1.7.0/guides/deploying-deepsparse/google-cloud-run.mdx
+2-2
diff --git a/‎versioned_docs/version-1.7.0/guides/deploying-deepsparse/index.mdx
+3-3 b/‎versioned_docs/version-1.7.0/guides/deploying-deepsparse/index.mdx
+3-3
diff --git a/‎versioned_docs/version-1.7.0/guides/index.mdx
+1 b/‎versioned_docs/version-1.7.0/guides/index.mdx
+1
diff --git a/‎versioned_docs/version-1.7.0/guides/onnx/index.mdx
+57 b/‎versioned_docs/version-1.7.0/guides/onnx/index.mdx
+57
diff --git a/‎versioned_docs/version-1.7.0/guides/sparsification/index.mdx
+134 b/‎versioned_docs/version-1.7.0/guides/sparsification/index.mdx
+134
diff --git a/‎versioned_docs/version-1.7.0/llms/guides/hf-llm-to-deepsparse.mdx
+2-2 b/‎versioned_docs/version-1.7.0/llms/guides/hf-llm-to-deepsparse.mdx
+2-2
diff --git a/‎versioned_docs/version-1.7.0/llms/guides/one-shot-llms-with-sparseml.mdx
+6-15 b/‎versioned_docs/version-1.7.0/llms/guides/one-shot-llms-with-sparseml.mdx
+6-15
@@ -58,6 +58,6 @@ Transfer a pre-sparsified, foundational LLM to your use case without heavy retra
 
 🔗 <b>Let's Get Started!</b> Choose a task above to begin your Neural Magic journey.
 
-📚 <b>Need Help?</b> Join our [Slack community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-2gbar46r6-2Tu~SS5iQdHgczAKlQ2jJA) for support.
+📚 <b>Need Help?</b> Join our [Slack community](https://neuralmagic.com/community/) for support.
 
 🌟 <b>Shape the Future:</b> Contribute to our GitHub [GitHub repositories](https://github.com/neuralmagic).
@@ -16,7 +16,7 @@ sidebar_position: 1
 # Installing DeepSparse
 
 DeepSparse is Neural Magic's inference engine, empowering you to run deep learning models on CPUs with exceptional performance and efficiency.
-This guide covers various installation methods, including PyPI, Docker, and installation from the GitHub source code for advanced use cases.
+This guide covers various installation methods, including PyPI and installation from the GitHub source code for advanced use cases.
 
 ## Prerequisites
 
@@ -77,6 +77,16 @@ Or from a locally cloned repository:
     ```
 </VersionInjector>
 
+### Product Usage Analytics
+DeepSparse Community Edition gathers basic usage telemetry including, but not limited to, Invocations, Package, Version, and IP Address for Product Usage Analytics purposes. Review Neural Magic's [Products Privacy Policy](https://neuralmagic.com/legal/) for further details on how we process this data.
+
+To disable Product Usage Analytics, run the command:
+```bash
+export NM_DISABLE_ANALYTICS=True
+```
+
+Confirm that telemetry is shut off through info logs streamed with engine invocation by looking for the phrase "Skipping Neural Magic's latest package version check." For additional assistance, reach out through the [DeepSparse GitHub Issue queue](https://github.com/neuralmagic/deepsparse/issues).
+
 ## Enterprise Installation
 
 ### PyPI
 
@@ -12,12 +12,12 @@ keywords:
 - scheduler
 description: DeepSparse Feature Overview
 sidebar_label: DeepSparse Features
-sidebar_position: 1
+sidebar_position: 2
 ---
 
 # DeepSparse Features
 
 Learn more about DeepSparse through the following overviews:
 
-<DocCardList>
-</DocCardList>
+<DocCardList />
+
@@ -149,4 +149,4 @@ All you need is to add `/docs` at the end of your host URL:
 
     localhost:5543/docs
 
-<img src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/server/img/swagger_ui.png" alt="Swagger UI For Viewing Model Pipeline" width="1200" height="524" />
+<img src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/server/img/endpoints.png" alt="Swagger UI For Viewing Model Pipeline" width="1200" height="524" />
@@ -1,5 +1,5 @@
 ---
-description: Deploy DeepSparse in a Serverless framework with Google Cloud Run.
+description: Deploy DeepSparse in a serverless framework with Google Cloud Run. 
 sidebar_label: Google Cloud Run
 sidebar_position: 4
 ---
@@ -33,7 +33,7 @@ cd deepsparse/examples/google-cloud-run
 ```
 
 ## Model Configuration
-The current server configuration is running `token classification`. To alter the model, task, or other parameters (e.g., number of cores, workers, routes, or batch size), edit the `config.yaml` file.
+The current server configuration is running `token classification`. To alter the model, task or other parameters (e.g., number of cores, workers, routes, or batch size), edit the `config.yaml` file.
 
 ## Create Endpoint
 Run the following command to build the Cloud Run endpoint.
 
@@ -10,13 +10,13 @@ keywords:
 - GCP
 description: DeepSparse Deployment Options
 sidebar_label: Deployment Options
-sidebar_position: 1
+sidebar_position: 3
 ---
 
 # DeepSparse Deployment Options
 
 Select the deployment option that's best for your project. Benefit from a faster and cost-effective solution.
 
-<DocCardList>
-</DocCardList>
+<DocCardList />
+
 
@@ -14,3 +14,4 @@ Explore these foundational guides that will provide essential information and in
 <DocCardList>
 </DocCardList>
 
+
@@ -0,0 +1,57 @@
+---
+tags:
+- onnx
+- model export
+- cpu inference
+keywords:
+- onnx model format
+- neural network interchange
+- cross-platform compatibility
+- deepsparse
+description: Overview of ONNX, its role in DeepSparse for CPU inference, and guidance on exporting models to the ONNX format.
+sidebar_label: ONNX
+sidebar_position: 4
+---
+
+# ONNX: Model Definitions for DeepSparse
+
+ONNX (Open Neural Network Exchange) is an open-source format representing machine learning models, including deep neural networks.
+It provides a standardized way to exchange models between different frameworks and tools, promoting cross-platform compatibility.
+
+## Why ONNX Matters for DeepSparse
+
+DeepSparse leverages ONNX for optimized CPU inference pathways. Here's why it's important:
+
+- <b>Framework Flexibility:</b> Exporting models to ONNX allows you to deploy sparsified and optimized neural networks created in various training frameworks (e.g., PyTorch, TensorFlow) to DeepSparse's CPU inference engine.
+- <b>Hardware Portability:</b> ONNX, combined with DeepSparse, can run on various CPU architectures (x86, ARM, etc.), ensuring your optimized models work across diverse hardware environments.
+- <b>Performance Optimization:</b> DeepSparse's ONNX runtime is specifically tuned to deliver efficient inference performance on CPUs, taking advantage of platform-specific optimizations.
+
+## Exporting Models to ONNX
+
+Converting your trained model to the ONNX format depends on the original training framework.
+SparseML is the default pathway for exporting models to ONNX within the Neural Magic ecosystem.
+It supports exporting standard PyTorch models to ONNX, including unoptimized and sparsified models.
+
+Here's a basic example of exporting a PyTorch GPT2 from HuggingFace to ONNX using SparseML:
+
+```python
+from sparseml import export
+
+# Load your PyTorch model
+from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer
+model = SparseAutoModelForCausalLM.from_pretrained("gpt2")
+tokenizer = SparseAutoTokenizer.from_pretrained("gpt2")
+
+# Export the model to ONNX
+export(
+    model=model,
+    tokenizer=tokenizer,
+    target_path="./onnx-export",
+)
+```
+
+For unsupported frameworks or the above doesn't work for your custom models, you can export models to ONNX using the native APIs of the training framework or supported third-party pathways:
+- [PyTorch](https://pytorch.org/docs/stable/onnx.html)
+- [TensorFlow](https://github.com/onnx/tensorflow-onnx)
+- [Keras](https://github.com/onnx/keras-onnx)
+- [JAX](https://github.com/google/jaxonnxruntime)
@@ -0,0 +1,134 @@
+---
+tags:
+- sparsification
+- model optimization
+- model compression k
+keywords:
+- model compression
+- model acceleration
+- neural network optimization
+- efficiency
+- pruning
+- quantization
+- distillation
+description: A comprehensive overview of sparsification techniques used to create smaller, faster, and more energy-efficient neural networks while maintaining accuracy.
+sidebar_label: Sparsification
+sidebar_position: 1
+---
+
+# Sparsification: Compressing Neural Networks
+
+Sparsification encompasses a range of powerful techniques used to compress and optimize neural networks.
+By strategically removing or reducing the significance of less important connections and information within a model, sparsification leads to retaining accuracy while resulting in:
+- <b>Smaller Model Sizes:</b> Reduced storage requirements and memory footprint, simplifying deployment.
+- <b>Faster Inference:</b> Significant boosts in computational speed, especially on resource-constrained hardware, promoting real-time applications.
+- <b>Reduced Energy Consumption:</b> Enable efficient execution for servers, edge environments, and mobile devices, lowering costs and broadening usage.
+
+This guide delves into the core concepts of sparsification, and in it, you'll learn:
+- <b>The Purpose of Sparsification:</b> Discover the benefits and motivations behind optimizing neural networks.
+- <b>Essential Techniques:</b> Explore the key methods used to achieve sparsification.
+- <b>Application Strategies:</b> Understand how to implement sparsification at different stages of the model's lifecycle.
+- <b>Practical Recipes:</b> Get guidance on applying sparsification techniques to everyday use cases.
+
+## Techniques
+
+Sparsification techniques can be broadly categorized into several key methods, each with its unique approach to compressing and optimizing neural networks:
+
+### Quantization
+
+Quantization reduces the precision of weights and activations in a neural network, for example, from 32-bit floating-point numbers to 8-bit integers.
+Quantization can be applied to weights, activations, or both and can be done statically (before deployment) or dynamically (at runtime).
+It decreases model size and memory usage, often leading to faster inference, particularly with specialized hardware support for low-precision arithmetic.
+
+### Pruning
+
+Pruning eliminates redundant or less important connections within a model.
+Pruning can be done in either a structured or unstructured manner, where structured pruning changes the model's shape, and unstructured pruning keeps the shape intact while introducing zeros in the weights (sparsity).
+This results in a smaller model and faster inference due to reduced compute provided the engine/hardware supports sparse computation.
+
+### Knowledge Distillation
+
+Distillation generally trains a smaller or more compressed "student" model to mimic the behavior of a larger, unoptimized "teacher" model.
+It enables the creation of more compressed models that are easier to deploy and execute while leveraging the knowledge and performance of the larger model to maintain accuracy.
+Distillation is further broken down into granularity levels, such as model-level, layer-level, and instance-level distillation.
+
+### Low Rank Approximation
+
+Low-rank approximations (LoRA), also known as matrix factorization, matrix decomposition, or tensor decomposition, reduce the rank of the weight matrices in a neural network, effectively compressing the model.
+This technique is based on the observation that the weight matrices of neural networks are often low-rank, meaning they can be approximated by a product of two smaller matrices.
+It can be particularly effective for compressing a model's large, fully connected layers.
+It's also known to be used in conjunction with other compression techniques, such as quantization (QLoRA), to enable faster fine-tuning.
+
+### Conditional Computation
+
+Conditional computation selectively activates only parts of a model based on the input data, leading to dynamic sparsity.
+This can be achieved through techniques such as gating, where a gating network decides which parts of the model to execute, or through adaptive computation, where the model learns to skip or reduce computation based on the input, such as Mixture of Experts (MoE) techniques.
+Conditional computation can significantly speed up inference time, especially for models with large, redundant, or unnecessary computations.
+
+### Regularization
+
+Regularization methods such as L1 and L2 can be used to encourage sparsity in a neural network's weights.
+Adding a regularization term to the loss function incentivizes the model to reduce overfitting and learn simpler representations, which can lead to sparser models.
+Regularization can be used with other techniques, such as pruning, to further enhance the sparsity of a model.
+
+### Weight Sharing
+
+Weight sharing involves sharing the weights of a neural network across different parts of the model, effectively reducing the number of unique weights and thereby reducing the model size.
+This can be done by clustering similar weights and sharing the same weight value across multiple connections.
+Weight sharing can be particularly effective for reducing a model's memory footprint, especially when combined with other compression techniques.
+
+### Architecture Search
+
+Techniques such as neural architecture search (NAS) can automatically discover more efficient and compact neural network architectures.
+By searching over a large space of possible architectures, NAS can identify smaller, faster, and more accurate models than hand-designed architectures.
+NAS can be used to optimize existing models or discover entirely new architectures tailored to specific tasks or constraints.
+
+### Compound Sparsification
+
+Compound sparsification combines multiple techniques to achieve even more significant compression and optimization.
+By leveraging the strengths of different methods, compound sparsification can create smaller, faster, and more energy-efficient models than those produced by individual techniques.
+For example, pruning can be combined with quantization and distillation to create highly compressed models that retain high accuracy.
+
+## Application
+
+Sparsification techniques can be applied at different stages of a model's lifecycle with varying degrees of complexity and effectiveness:
+
+### Post-Training / One-Shot
+
+Sparsification can be applied post-training, where a pre-trained model is compressed using pruning, quantization, or distillation techniques.
+Post-training is often the most straightforward approach to sparsification, as it does not require changes to the training process or hyperparameters.
+However, post-training sparsification may have the same level of compression or performance as techniques applied during training.
+It is particularly practical for quantization but less effective for pruning.
+
+### Training Aware
+
+Sparsification can also be applied during training, where the model is trained with sparsification techniques such as pruning, quantization, and distillation.
+This approach can lead to more effective compression and optimization as the model adapts to the sparsity constraints during training.
+Training-aware sparsification can be more complex and computationally intensive than post-training sparsification, but it can often achieve better results regarding model size, speed, and accuracy.
+
+### Transfer Learning
+
+Sparsification can be combined with transfer learning, where a sparsified, pre-trained model is fine-tuned on a new task or dataset.
+This approach can leverage the knowledge and compression of the pre-trained model without the complexity of sparsification hyperparameters or training from scratch.
+Transfer learning with sparsification can be particularly effective for quickly adapting compressed models to new tasks or domains with fewer resources and complexity while closely matching the performance of training-aware techniques.
+
+## Recipes
+
+Sparsification recipes provide a structured and reusable way to define the steps and parameters for optimizing neural networks.
+They encapsulate the specific sparsification techniques, hyperparameters, and necessary training adjustments into a single configuration file.
+Sparsification recipes can be shared, reused, and adapted across different models, tasks, and domains, making experimenting with and deploying compressed models easier.
+
+Recipes are core to the sparsification process through SparseML, a comprehensive framework for sparsification and model optimization.
+Additionally, models generally available in the SparseZoo or our HuggingFace model hub include the recipes used to train them, making it easy to reproduce and adapt the training process.
+
+Throughout the sparsification guides, you'll find example recipes for different techniques and applications, providing a hands-on approach to implementing and experimenting with sparsification.
+
+A general workflow for sparsification using SparseML is as follows:
+1. Define a sparsification recipe for the desired technique and application.
+2. Integrate SparseML into your experimentation pipelines or utilize the pre-built pipelines in SparseML.
+3. Apply the sparsification recipe to your model through one-shot, training-aware, or transfer learning methods.
+4. Evaluate the compressed model on your desired metrics and tasks.
+
+---
+
+Dive into the guides in this section to learn more about the core sparsification techniques, applications, and recipes for compressing and optimizing neural networks.
@@ -12,7 +12,7 @@ This guide is for people interested in exporting their Hugging Face-compatible L
 
 SparseML provides tools for optimizing machine learning models for deployment. To install it along with the necessary support for Hugging Face Transformers, open your terminal and run:
 ```bash
-pip install sparseml[transformers]==1.7
+pip install sparseml[transformers]
 ```
 
 > #### Note on system requirements
@@ -52,7 +52,7 @@ After exporting your model, you can run inference using DeepSparse.
 1. **Install DeepSparse LLM**:
 Install the DeepSparse library, which is specifically designed for running inference on large language models (LLMs) efficiently.
 ```bash
-pip install deepsparse[llm]==1.7
+pip install deepsparse[llm]
 ```
 
 2. **Load Your Model and Run Inference**:
 
@@ -40,11 +40,7 @@ options:
 Example command:
 ```bash
 wget https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v0.4-pruned50-quant/raw/main/recipe.yaml # download recipe
-sparseml.transformers.text_generation.oneshot \
- --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
- --dataset open_platypus --recipe recipe.yaml \
- --output_dir ./obcq_deployment \
- --precision float16
+sparseml.transformers.text_generation.oneshot --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dataset_name open_platypus --recipe recipe.yaml --output_dir ./obcq_deployment --precision float16
 ```
 ## How to Evaluate the One-shot Model
 Next, evaluate the model's performance using the [lm-evaluation-harness framework](https://github.com/neuralmagic/lm-evaluation-harness).
@@ -60,16 +56,16 @@ Evaluate on the `hellaswag` task:
 start=`date +%s`
 python main.py \
  --model hf-causal-experimental \
- --model_args pretrained=../obcq_deployment,trust_remote_code=True \
+ --model_args pretrained=obcq_deployment,trust_remote_code=True \
  --tasks hellaswag \
  --batch_size 64 \
  --no_cache \
  --write_out \
- --output_path "../obcq_deployment/hellaswag.json" \
+ --output_path "obcq_deployment/hellaswag.json" \
  --device "cuda:0" \
  --num_fewshot 0
-end=`date +%s`
-echo Execution time was `expr $end - $start` seconds.
+ end=`date +%s`
+ echo Execution time was `expr $end - $start` seconds.
 ```
 The results obtained in this case are:
 ```
@@ -266,12 +262,7 @@ Save the recipe to a file named `recipe.yaml`.
 
 Run one-shot quantization on any Mistral-based model, for example, `zephyr-7b-beta`: 
 ```bash
-sparseml.transformers.text_generation.oneshot \
- --model HuggingFaceH4/zephyr-7b-beta \
- --dataset open_platypus \
- --recipe recipe.yaml \
- --output_dir ./output_oneshot \
- --precision float16
+sparseml.transformers.text_generation.oneshot --model_name HuggingFaceH4/zephyr-7b-beta --dataset_name open_platypus --recipe recipe.yaml --output_dir ./output_oneshot --precision float16
 ```
 We set `precision` to `float16` because quantization is not supported for the `bfloat16` data type as of this writing.
Original file line number	Diff line number	Diff line change
@@ -149,4 +149,4 @@ All you need is to add `/docs` at the end of your host URL:
`149`	`149`
`150`	`150`	`localhost:5543/docs`
`151`	`151`
`152`		`-<img src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/server/img/swagger_ui.png" alt="Swagger UI For Viewing Model Pipeline" width="1200" height="524" />`
	`152`	`+<img src="https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/server/img/endpoints.png" alt="Swagger UI For Viewing Model Pipeline" width="1200" height="524" />`
Original file line number	Diff line number	Diff line change
`@@ -14,3 +14,4 @@ Explore these foundational guides that will provide essential information and in`
`14`	`14`	`<DocCardList>`
`15`	`15`	`</DocCardList>`
`16`	`16`
	`17`	`+`