aws-neuron · natemail-aws · Jan 6, 2025
@@ -32,7 +32,7 @@ ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"
 # Include framework tensorflow-neuron or torch-neuronx and compiler (compiler not needed for inference)
 RUN pip3 install \
     torch-neuronx \
-    --extra-index-url=https://pip.repos.neuron.amazonaws.com
+    --index-url=https://pip.repos.neuron.amazonaws.com
 
 # Include your APP dependencies here.
 # RUN ...

@@ -105,8 +105,8 @@ RUN pip install --no-cache-dir -U \
     "awscli<2" \
     boto3
 
-RUN pip install neuron-cc[tensorflow] --extra-index-url https://pip.repos.neuron.amazonaws.com \
- && pip install "torch-neuron>=1.10.2,<1.10.3" --extra-index-url https://pip.repos.neuron.amazonaws.com \
+RUN pip install neuron-cc[tensorflow] --index-url https://pip.repos.neuron.amazonaws.com \
+ && pip install "torch-neuron>=1.10.2,<1.10.3" --index-url https://pip.repos.neuron.amazonaws.com \
  && pip install torchserve==$TS_VERSION \
  && pip install --no-deps --no-cache-dir -U torchvision==0.11.3 \
  # Install TF 1.15.5 to override neuron-cc[tensorflow]'s installation of tensorflow==1.15.0

@@ -32,7 +32,7 @@ ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"
 # Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
 RUN pip3 install \
     torch-neuron \
-    --extra-index-url=https://pip.repos.neuron.amazonaws.com
+    --index-url=https://pip.repos.neuron.amazonaws.com
 
 # Include your APP dependencies here.
 # RUN ...

@@ -15,7 +15,7 @@ RUN cd /tmp \
 
 RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
 RUN update-alternatives --install /usr/local/bin/pip pip /usr/local/bin/pip3 1
-RUN pip install mxnet-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com
+RUN pip install mxnet-neuron --index-url=https://pip.repos.neuron.amazonaws.com
 RUN pip install multi-model-server
 
 

@@ -26,7 +26,7 @@ RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEU
 # Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
 RUN pip3 install \
     torch-neuron \
-    --extra-index-url=https://pip.repos.neuron.amazonaws.com
+    --index-url=https://pip.repos.neuron.amazonaws.com
 
 # Include your APP dependencies here.
 # RUN/ENTRYPOINT/CMD ...

@@ -36,7 +36,7 @@ ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"
 # Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
 RUN pip3 install \
     torch-neuron \
-    --extra-index-url=https://pip.repos.neuron.amazonaws.com
+    --index-url=https://pip.repos.neuron.amazonaws.com
 
 # Include your APP dependencies here.
 # RUN ...

@@ -54,7 +54,7 @@ pip repository.
 
     .. code:: bash
 
-        python3 -m pip install jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com
+        python3 -m pip install jax-neuronx[stable] --index-url=https://pip.repos.neuron.amazonaws.com
 
     The second is to install packages ``jax``, ``jaxlib``, ``libneuronxla``,
     and ``neuronx-cc`` separately, with ``jax-neuronx`` being an optional addition.
@@ -65,7 +65,8 @@ pip repository.
 
     .. code:: bash
 
-        python3 -m pip install jax==0.4.31 jaxlib==0.4.31 jax-neuronx libneuronxla neuronx-cc==2.* --extra-index-url=https://pip.repos.neuron.amazonaws.com
+        python3 -m pip install jax==0.4.31 jaxlib==0.4.31
+        python3 -m pip install jax-neuronx libneuronxla neuronx-cc==2.* --index-url=https://pip.repos.neuron.amazonaws.com
 
 We can now run some simple JAX programs on the Trainium or Inferentia
 accelerators.

@@ -108,7 +108,7 @@ Modify Pip repository configurations to point to the Neuron repository:
 
    tee $VIRTUAL_ENV/pip.conf > /dev/null <<EOF
    [global]
-   extra-index-url = https://pip.repos.neuron.amazonaws.com
+   index-url = https://pip.repos.neuron.amazonaws.com
    EOF
 
 Install neuron packages:

@@ -54,8 +54,8 @@ following :
    git clone https://github.com/aws/aws-neuron-sdk
    cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
    export BERT_LARGE_SAVED_MODEL="/path/to/user/bert-large/savedmodel"
-   pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/
-   pip install neuron_cc==1.13.5.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com
+   pip install tensorflow_neuron==1.15.5.2.8.9.0 --index-url=https://pip.repos.neuron.amazonaws.com/
+   pip install neuron_cc==1.13.5.0 --index-url=https://pip.repos.neuron.amazonaws.com
    python bert_model.py --input_saved_model $BERT_LARGE_SAVED_MODEL --output_saved_model ./bert-saved-model-neuron --batch_size=6 --aggressive_optimizations
 
 This compiles BERT-Large pointed to by $BERT_LARGE_SAVED_MODEL for an
@@ -89,7 +89,7 @@ BERT-Large demo server :
 .. code:: bash
 
    cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
-   pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/
+   pip install tensorflow_neuron==1.15.5.2.8.9.0 --index-url=https://pip.repos.neuron.amazonaws.com/
    python bert_server.py --dir bert-saved-model-neuron --batch 6 --parallel 4
 
 This loads 4 BERT-Large models, one into each of the 4 NeuronCores found

@@ -25,7 +25,7 @@ def main():
         if tfn_version < LooseVersion('1.15.0.1.0.1333.0'):
             raise RuntimeError(
                 'tensorflow-neuron version {} is too low for this demo. Please upgrade '
-                'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
+                'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
 
     with open(args.image, 'rb') as f:
         img_jpg_bytes = f.read()

@@ -39,7 +39,7 @@ def main():
         if tfn_version < LooseVersion('1.15.0.1.0.1333.0'):
             raise RuntimeError(
                 'tensorflow-neuron version {} is too low for this demo. Please upgrade '
-                'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
+                'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
     predictor_list = [tf.contrib.predictor.from_saved_model(args.saved_model) for _ in range(args.num_sessions)]
 
     val_dataset = get_val_dataset(args.instances_val2017_json, args.val2017)

@@ -296,12 +296,12 @@ def main():
         if neuroncc_version < LooseVersion('1.0.18000'):
             raise RuntimeError(
                 'neuron-cc version {} is too low for this demo. Please upgrade '
-                'by "pip install -U neuron-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(neuroncc_version))
+                'by "pip install -U neuron-cc --index-url=https://pip.repos.neuron.amazonaws.com"'.format(neuroncc_version))
         tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
         if tfn_version < LooseVersion('1.15.3.1.0.1900.0'):
             raise RuntimeError(
                 'tensorflow-neuron version {} is too low for this demo. Please upgrade '
-                'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
+                'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
 
     sys.path.append(os.getcwd())
     from DeepLearningExamples.PyTorch.Detection.SSD.src import model as torch_ssd300_model
@@ -336,7 +336,7 @@ def main():
             if not op.get_attr('executable'):
                 raise AttributeError(
                     'Neuron executable (neff) is empty. Please check neuron-cc is installed and working properly '
-                    '("pip install neuron-cc --force --extra-index-url=https://pip.repos.neuron.amazonaws.com" '
+                    '("pip install neuron-cc --force --index-url=https://pip.repos.neuron.amazonaws.com" '
                     'to force reinstall neuron-cc).')
             model_config = op.node_def.attr['model_config'].list
             if model_config.i:

@@ -51,17 +51,17 @@ index.
 
 ::
 
-    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps
+    pip install --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps
 
 
 Specific versions of ``torch-neuron`` with cxx11 ABI support can be installed
 just like standard versions of ``torch-neuron``.
 
 ::
 
-    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron>=1.8" --no-deps
-    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron==1.9.1" --no-deps
-    pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron<1.10" --no-deps
+    pip install --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron>=1.8" --no-deps
+    pip install --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron==1.9.1" --no-deps
+    pip install --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron<1.10" --no-deps
 
 .. important::
 
@@ -117,7 +117,7 @@ is to download the wheel and unpack the contents:
 
 .. code:: bash
 
-    pip download --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps
+    pip download --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps
     wheel unpack torch_neuron-*.whl
 
 If the exact version of the ``torch-neuron`` package is known and no

@@ -33,7 +33,7 @@ If you encounter an error like below, it is because the model size is larger tha
 To compile such large models, use the :ref:`separate_weights=True <torch_neuron_trace_api>` flag. Note,
 ensure that you have the latest version of compiler installed to support this flag.
 You can upgrade neuron-cc using 
-:code:`python3 -m pip install neuron-cc[tensorflow] -U --force --extra-index-url=https://pip.repos.neuron.amazonaws.com`
+:code:`python3 -m pip install neuron-cc[tensorflow] -U --force --index-url=https://pip.repos.neuron.amazonaws.com`
 
 ::
 

@@ -218,83 +218,82 @@ compiled and executed if there are extra mark-steps or functions with
 implicit mark-steps. Additionally, more graphs can be generated if there
 are different execution paths taken due to control-flows.
 
-Automatic casting of float tensors to BFloat16
-----------------------------------------------
-
-With PyTorch Neuron, the default behavior is for torch.float (FP32) and torch.double (FP64) tensors
-to be mapped to torch.float in hardware. To reduce memory footprint and improve performance,
-torch.float and torch.double tensors can automatically be converted to BFloat16 by setting
-the environment variable ``XLA_USE_BF16=1``. Alternatively, torch.float can automatically be converted 
-to BFloat16 and torch.double converted to FP32 by setting the environment variable ``XLA_DOWNCAST_BF16=1``.
-
-Automatic Mixed-Precision
--------------------------
-
-BF16 mixed-precision using PyTorch Autocast
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-By default, the compiler automatically cast internal FP32 operations to
-BF16. You can disable this and allow PyTorch's BF16 mixed-precision to
-do the casting. PyTorch's BF16 mixed-precision is achieved by casting
-certain operations to operate BF16. We currently use CUDA's list of
-operations that can operate in BF16:
-
-.. code:: bash
-
-   _convolution
-   _convolution
-   _convolution_nogroup
-   conv1d
-   conv2d
-   conv3d
-   conv_tbc
-   conv_transpose1d
-   conv_transpose2d
-   conv_transpose3d
-   convolution
-   cudnn_convolution
-   cudnn_convolution_transpose
-   cudnn_convolution
-   cudnn_convolution_transpose
-   cudnn_convolution
-   cudnn_convolution_transpose
-   prelu
-   addmm
-   addmv
-   addr
-   matmul
-   mm
-   mv
-   linear
-   addbmm
-   baddbmm
-   bmm
-   chain_matmul
-   linalg_multi_dot
+Full BF16 with stochastic rounding enabled
+------------------------------------------
 
-To enable PyTorch's BF16 mixed-precision, first turn off the Neuron
-compiler auto-cast:
+Previously, on torch-neuronx 2.1 and earlier, the environmental variables ``XLA_USE_BF16`` or ``XLA_DOWNCAST_BF16`` provided full casting to BF16 with stochastic rounding enabled by default. These environmental variables are deprecated in torch-neuronx 2.5, although still functional with warnings. To replace ``XLA_USE_BF16`` or ``XLA_DOWNCAST_BF16`` with stochastic rounding on Neuron, set ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1`` and use the ``torch.nn.Module.to`` method to cast model floating-point parameters and buffers to data-type BF16 as follows:
 
 .. code:: python
 
-   os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"
+    os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "1"
+
+    # model is created
+    model.to(torch.bfloat16)
+
+Stochastic rounding is needed to enable faster convergence for full BF16 model.
+
+If the loss is to be kept in FP32, initialize it with ``dtype=torch.float`` as follows:
+
+.. code:: python
+
+    running_loss = torch.zeros(1, dtype=torch.float).to(device)
 
-Next, overwrite torch.cuda.is_bf16_supported to return True:
+Similarly, if the optimizer states are to be kept in FP32, convert the gradients to FP32 before optimizer computations:
 
 .. code:: python
 
-   torch.cuda.is_bf16_supported = lambda: True
+    grad = p.grad.data.float()
 
-Next, per recommendation from official PyTorch documentation, place only
-the forward-pass of the training step in the torch.autocast scope:
+For a full example, please see the :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`, which has been updated to use ``torch.nn.Module.to`` instead of ``XLA_DOWNCAST_BF16``.
+
+BF16 in GPU-compatible mode without stochastic rounding enabled
+---------------------------------------------------------------
+
+Full BF16 training in GPU-compatible mode would enable faster convergence without the need for stochastic rounding, but would require a FP32 copy of weights/parameters to be saved and used in the optimizer. To enable BF16 in GPU-compatible mode without stochastic rounding enabled, use the ``torch.nn.Module.to`` method to cast model floating-point parameters and buffers to data-type bfloat16 as follows without setting ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1``:
 
 .. code:: python
 
-   with torch.autocast(dtype=torch.bfloat16, device_type='cuda'):
+    # model is created
+    model.to(torch.bfloat16)
+
+In the initializer of the optimizer, for example AdamW, you can add code like the following code snippet to make a FP32 copy of weights:
+
+.. code:: python
+
+        # keep a copy of weights in highprec
+        self.param_groups_highprec = []
+        for group in self.param_groups:
+            params = group['params']
+            param_groups_highprec = [p.data.float() for p in params]
+            self.param_groups_highprec.append({'params': param_groups_highprec})
+
+In the :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`, this mode can be enabled by pasing ``--optimizer=AdamW_FP32ParamsCopy`` option to ``dp_bert_large_hf_pretrain_hdf5.py`` and setting ``NEURON_RT_STOCHASTIC_ROUNDING_EN=0`` (or leave it unset).
+
+.. _automatic_mixed_precision_autocast:
+
+BF16 automatic mixed precision using PyTorch Autocast
+-----------------------------------------------------
+
+By default, the compiler automatically casts internal FP32 operations to
+BF16. You can disable this and allow PyTorch's BF16 automatic mixed precision function (``torch.autocast``) to
+do the casting of certain operations to operate in BF16.
+
+To enable PyTorch's BF16 mixed-precision, first turn off the Neuron
+compiler auto-cast:
+
+.. code:: python
+
+   os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"
+
+Next, per recommendation from official PyTorch `torch.autocast documentation <https://pytorch.org/docs/stable/amp.html#autocasting>`__, place only
+the forward-pass of the training step in the ``torch.autocast`` scope with ``xla`` device type:
+
+.. code:: python
+
+   with torch.autocast(dtype=torch.bfloat16, device_type='xla'):
        # forward pass
 
-The device type is CUDA because we are using CUDA's list of BF16
-compatible operations as mentioned above.
+The device type is XLA because we are using PyTorch-XLA's autocast backend. The PyTorch-XLA `autocast mode source code <https://github.com/pytorch/xla/blob/master/torch_xla/csrc/autocast_mode.cpp>`_ lists which operations are casted to lower precision BF16 ("lower precision fp cast policy" section), which are maintained in FP32 ("fp32 cast policy"), and which are promoted to the widest input types ("promote" section).
 
 Example showing the original training code snippet:
 
@@ -319,7 +318,7 @@ The following shows the training loop modified to use BF16 autocast:
    def train_loop_fn(train_loader):
        for i, data in enumerate(train_loader):
            torch.cuda.is_bf16_supported = lambda: True
-           with torch.autocast(dtype=torch.bfloat16, device_type='cuda'):
+           with torch.autocast(dtype=torch.bfloat16, device_type='xla'):
                inputs = data[0]
                labels = data[3]
                outputs = model(inputs, labels=labels)
@@ -328,7 +327,7 @@ The following shows the training loop modified to use BF16 autocast:
            optimizer.step()
            xm.mark_step()        
 
-For a full example of BF16 mixed-precision, see :ref:`PyTorch Neuron BERT Pretraining Tutorial <hf-bert-pretraining-tutorial>`.
+For a full example of BF16 mixed-precision, see :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`.
 
 See official PyTorch documentation for more details about
 `torch.autocast <https://pytorch.org/docs/stable/amp.html#autocasting>`__
@@ -370,6 +369,12 @@ intermediate results such as loss values. In such case, the printing of
 lazy tensors should be wrapped using ``xm.add_step_closure()`` to avoid
 unnecessary compilation-and-executions.
 
+Aggregate the data transfers between host CPUs and devices
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For best performance, you may try to aggregate the data transfers between host CPUs and devices.
+For example, increasing the value for `batches_per_execution` argument when instantiating ``MpDeviceLoader`` can help increase performance for certain where there's frequent host-device traffic like ViT as described in `a blog <https://towardsdatascience.com/ai-model-optimization-on-aws-inferentia-and-trainium-cfd48e85d5ac>`_. NOTE: Increasing `batches_per_execution` value would delay the mark-step for multiple batches specified by this value, increasing graph size and could lead to out-of-memory (device OOM) error.
+
 Ensure common initial weights across workers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -396,13 +401,6 @@ be loaded using ``serialization.load`` api. More information on this here: `Savi
 
 FAQ
 ---
-
-What is the difference between Trainium and Inferentia?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Trainium is an accelerator designed to speed up training, whereas
-Inferentia is an accelerator designed to speed up inference.
-
 Debugging and troubleshooting
 -----------------------------
 

@@ -72,7 +72,7 @@
 		pip install awscli
 
 		# Install packages from  repos
-		python -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com"
+		python -m pip config set global.index-url "https://pip.repos.neuron.amazonaws.com"
 
 		# Install Python packages - Transformers package is needed for BERT
 		python -m pip install torch-neuronx=="1.11.0.1.*" "neuronx-cc==2.*"
@@ -144,7 +144,7 @@
 		pip install awscli	
 
 		# Install packages from repos
-		python -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com"
+		python -m pip config set global.index-url "https://pip.repos.neuron.amazonaws.com"
 
 		# Install Python packages - Transformers package is needed for BERT
 		python -m pip install torch-neuronx=="1.11.0.1.*" "neuronx-cc==2.*"