Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minor documentation changes post-2.21 #1082

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion containers/docker-example/inference/Dockerfile-inference
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"
# Include framework tensorflow-neuron or torch-neuronx and compiler (compiler not needed for inference)
RUN pip3 install \
torch-neuronx \
--extra-index-url=https://pip.repos.neuron.amazonaws.com
--index-url=https://pip.repos.neuron.amazonaws.com

# Include your APP dependencies here.
# RUN ...
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,8 @@ RUN pip install --no-cache-dir -U \
"awscli<2" \
boto3

RUN pip install neuron-cc[tensorflow] --extra-index-url https://pip.repos.neuron.amazonaws.com \
&& pip install "torch-neuron>=1.10.2,<1.10.3" --extra-index-url https://pip.repos.neuron.amazonaws.com \
RUN pip install neuron-cc[tensorflow] --index-url https://pip.repos.neuron.amazonaws.com \
&& pip install "torch-neuron>=1.10.2,<1.10.3" --index-url https://pip.repos.neuron.amazonaws.com \
&& pip install torchserve==$TS_VERSION \
&& pip install --no-deps --no-cache-dir -U torchvision==0.11.3 \
# Install TF 1.15.5 to override neuron-cc[tensorflow]'s installation of tensorflow==1.15.0
Expand Down
2 changes: 1 addition & 1 deletion containers/docker-example/inference/Dockerfile-libmode
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"
# Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
RUN pip3 install \
torch-neuron \
--extra-index-url=https://pip.repos.neuron.amazonaws.com
--index-url=https://pip.repos.neuron.amazonaws.com

# Include your APP dependencies here.
# RUN ...
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ RUN cd /tmp \

RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
RUN update-alternatives --install /usr/local/bin/pip pip /usr/local/bin/pip3 1
RUN pip install mxnet-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com
RUN pip install mxnet-neuron --index-url=https://pip.repos.neuron.amazonaws.com
RUN pip install multi-model-server


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEU
# Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
RUN pip3 install \
torch-neuron \
--extra-index-url=https://pip.repos.neuron.amazonaws.com
--index-url=https://pip.repos.neuron.amazonaws.com

# Include your APP dependencies here.
# RUN/ENTRYPOINT/CMD ...
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ ENV PATH="/opt/bin/:/opt/aws/neuron/bin:${PATH}"
# Include framework tensorflow-neuron or torch-neuron and compiler (compiler not needed for inference)
RUN pip3 install \
torch-neuron \
--extra-index-url=https://pip.repos.neuron.amazonaws.com
--index-url=https://pip.repos.neuron.amazonaws.com

# Include your APP dependencies here.
# RUN ...
Expand Down
5 changes: 3 additions & 2 deletions frameworks/jax/setup/jax-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ pip repository.

.. code:: bash

python3 -m pip install jax-neuronx[stable] --extra-index-url=https://pip.repos.neuron.amazonaws.com
python3 -m pip install jax-neuronx[stable] --index-url=https://pip.repos.neuron.amazonaws.com

The second is to install packages ``jax``, ``jaxlib``, ``libneuronxla``,
and ``neuronx-cc`` separately, with ``jax-neuronx`` being an optional addition.
Expand All @@ -65,7 +65,8 @@ pip repository.

.. code:: bash

python3 -m pip install jax==0.4.31 jaxlib==0.4.31 jax-neuronx libneuronxla neuronx-cc==2.* --extra-index-url=https://pip.repos.neuron.amazonaws.com
python3 -m pip install jax==0.4.31 jaxlib==0.4.31
python3 -m pip install jax-neuronx libneuronxla neuronx-cc==2.* --index-url=https://pip.repos.neuron.amazonaws.com

We can now run some simple JAX programs on the Trainium or Inferentia
accelerators.
Expand Down
2 changes: 1 addition & 1 deletion frameworks/mxnet-neuron/tutorials/bert_mxnet/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ Modify Pip repository configurations to point to the Neuron repository:

tee $VIRTUAL_ENV/pip.conf > /dev/null <<EOF
[global]
extra-index-url = https://pip.repos.neuron.amazonaws.com
index-url = https://pip.repos.neuron.amazonaws.com
EOF

Install neuron packages:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ following :
git clone https://github.com/aws/aws-neuron-sdk
cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
export BERT_LARGE_SAVED_MODEL="/path/to/user/bert-large/savedmodel"
pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/
pip install neuron_cc==1.13.5.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com
pip install tensorflow_neuron==1.15.5.2.8.9.0 --index-url=https://pip.repos.neuron.amazonaws.com/
pip install neuron_cc==1.13.5.0 --index-url=https://pip.repos.neuron.amazonaws.com
python bert_model.py --input_saved_model $BERT_LARGE_SAVED_MODEL --output_saved_model ./bert-saved-model-neuron --batch_size=6 --aggressive_optimizations

This compiles BERT-Large pointed to by $BERT_LARGE_SAVED_MODEL for an
Expand Down Expand Up @@ -89,7 +89,7 @@ BERT-Large demo server :
.. code:: bash

cd ~/aws-neuron-sdk/src/examples/tensorflow/bert_demo/
pip install tensorflow_neuron==1.15.5.2.8.9.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com/
pip install tensorflow_neuron==1.15.5.2.8.9.0 --index-url=https://pip.repos.neuron.amazonaws.com/
python bert_server.py --dir bert-saved-model-neuron --batch 6 --parallel 4

This loads 4 BERT-Large models, one into each of the 4 NeuronCores found
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def main():
if tfn_version < LooseVersion('1.15.0.1.0.1333.0'):
raise RuntimeError(
'tensorflow-neuron version {} is too low for this demo. Please upgrade '
'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))

with open(args.image, 'rb') as f:
img_jpg_bytes = f.read()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def main():
if tfn_version < LooseVersion('1.15.0.1.0.1333.0'):
raise RuntimeError(
'tensorflow-neuron version {} is too low for this demo. Please upgrade '
'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
predictor_list = [tf.contrib.predictor.from_saved_model(args.saved_model) for _ in range(args.num_sessions)]

val_dataset = get_val_dataset(args.instances_val2017_json, args.val2017)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -296,12 +296,12 @@ def main():
if neuroncc_version < LooseVersion('1.0.18000'):
raise RuntimeError(
'neuron-cc version {} is too low for this demo. Please upgrade '
'by "pip install -U neuron-cc --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(neuroncc_version))
'by "pip install -U neuron-cc --index-url=https://pip.repos.neuron.amazonaws.com"'.format(neuroncc_version))
tfn_version = LooseVersion(pkg_resources.get_distribution('tensorflow-neuron').version)
if tfn_version < LooseVersion('1.15.3.1.0.1900.0'):
raise RuntimeError(
'tensorflow-neuron version {} is too low for this demo. Please upgrade '
'by "pip install -U tensorflow-neuron --extra-index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))
'by "pip install -U tensorflow-neuron --index-url=https://pip.repos.neuron.amazonaws.com"'.format(tfn_version))

sys.path.append(os.getcwd())
from DeepLearningExamples.PyTorch.Detection.SSD.src import model as torch_ssd300_model
Expand Down Expand Up @@ -336,7 +336,7 @@ def main():
if not op.get_attr('executable'):
raise AttributeError(
'Neuron executable (neff) is empty. Please check neuron-cc is installed and working properly '
'("pip install neuron-cc --force --extra-index-url=https://pip.repos.neuron.amazonaws.com" '
'("pip install neuron-cc --force --index-url=https://pip.repos.neuron.amazonaws.com" '
'to force reinstall neuron-cc).')
model_config = op.node_def.attr['model_config'].list
if model_config.i:
Expand Down
10 changes: 5 additions & 5 deletions frameworks/torch/torch-neuron/setup/pytorch-install-cxx11.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,17 +51,17 @@ index.

::

pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps
pip install --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps


Specific versions of ``torch-neuron`` with cxx11 ABI support can be installed
just like standard versions of ``torch-neuron``.

::

pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron>=1.8" --no-deps
pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron==1.9.1" --no-deps
pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron<1.10" --no-deps
pip install --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron>=1.8" --no-deps
pip install --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron==1.9.1" --no-deps
pip install --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 "torch-neuron<1.10" --no-deps

.. important::

Expand Down Expand Up @@ -117,7 +117,7 @@ is to download the wheel and unpack the contents:

.. code:: bash

pip download --extra-index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps
pip download --index-url=https://pip.repos.neuron.amazonaws.com/cxx11 torch-neuron --no-deps
wheel unpack torch_neuron-*.whl

If the exact version of the ``torch-neuron`` package is known and no
Expand Down
2 changes: 1 addition & 1 deletion frameworks/torch/torch-neuron/troubleshooting-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ If you encounter an error like below, it is because the model size is larger tha
To compile such large models, use the :ref:`separate_weights=True <torch_neuron_trace_api>` flag. Note,
ensure that you have the latest version of compiler installed to support this flag.
You can upgrade neuron-cc using
:code:`python3 -m pip install neuron-cc[tensorflow] -U --force --extra-index-url=https://pip.repos.neuron.amazonaws.com`
:code:`python3 -m pip install neuron-cc[tensorflow] -U --force --index-url=https://pip.repos.neuron.amazonaws.com`

::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -218,83 +218,82 @@ compiled and executed if there are extra mark-steps or functions with
implicit mark-steps. Additionally, more graphs can be generated if there
are different execution paths taken due to control-flows.

Automatic casting of float tensors to BFloat16
----------------------------------------------

With PyTorch Neuron, the default behavior is for torch.float (FP32) and torch.double (FP64) tensors
to be mapped to torch.float in hardware. To reduce memory footprint and improve performance,
torch.float and torch.double tensors can automatically be converted to BFloat16 by setting
the environment variable ``XLA_USE_BF16=1``. Alternatively, torch.float can automatically be converted
to BFloat16 and torch.double converted to FP32 by setting the environment variable ``XLA_DOWNCAST_BF16=1``.

Automatic Mixed-Precision
-------------------------

BF16 mixed-precision using PyTorch Autocast
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, the compiler automatically cast internal FP32 operations to
BF16. You can disable this and allow PyTorch's BF16 mixed-precision to
do the casting. PyTorch's BF16 mixed-precision is achieved by casting
certain operations to operate BF16. We currently use CUDA's list of
operations that can operate in BF16:

.. code:: bash

_convolution
_convolution
_convolution_nogroup
conv1d
conv2d
conv3d
conv_tbc
conv_transpose1d
conv_transpose2d
conv_transpose3d
convolution
cudnn_convolution
cudnn_convolution_transpose
cudnn_convolution
cudnn_convolution_transpose
cudnn_convolution
cudnn_convolution_transpose
prelu
addmm
addmv
addr
matmul
mm
mv
linear
addbmm
baddbmm
bmm
chain_matmul
linalg_multi_dot
Full BF16 with stochastic rounding enabled
------------------------------------------

To enable PyTorch's BF16 mixed-precision, first turn off the Neuron
compiler auto-cast:
Previously, on torch-neuronx 2.1 and earlier, the environmental variables ``XLA_USE_BF16`` or ``XLA_DOWNCAST_BF16`` provided full casting to BF16 with stochastic rounding enabled by default. These environmental variables are deprecated in torch-neuronx 2.5, although still functional with warnings. To replace ``XLA_USE_BF16`` or ``XLA_DOWNCAST_BF16`` with stochastic rounding on Neuron, set ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1`` and use the ``torch.nn.Module.to`` method to cast model floating-point parameters and buffers to data-type BF16 as follows:

.. code:: python

os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"
os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "1"

# model is created
model.to(torch.bfloat16)

Stochastic rounding is needed to enable faster convergence for full BF16 model.

If the loss is to be kept in FP32, initialize it with ``dtype=torch.float`` as follows:

.. code:: python

running_loss = torch.zeros(1, dtype=torch.float).to(device)

Next, overwrite torch.cuda.is_bf16_supported to return True:
Similarly, if the optimizer states are to be kept in FP32, convert the gradients to FP32 before optimizer computations:

.. code:: python

torch.cuda.is_bf16_supported = lambda: True
grad = p.grad.data.float()

Next, per recommendation from official PyTorch documentation, place only
the forward-pass of the training step in the torch.autocast scope:
For a full example, please see the :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`, which has been updated to use ``torch.nn.Module.to`` instead of ``XLA_DOWNCAST_BF16``.

BF16 in GPU-compatible mode without stochastic rounding enabled
---------------------------------------------------------------

Full BF16 training in GPU-compatible mode would enable faster convergence without the need for stochastic rounding, but would require a FP32 copy of weights/parameters to be saved and used in the optimizer. To enable BF16 in GPU-compatible mode without stochastic rounding enabled, use the ``torch.nn.Module.to`` method to cast model floating-point parameters and buffers to data-type bfloat16 as follows without setting ``NEURON_RT_STOCHASTIC_ROUNDING_EN=1``:

.. code:: python

with torch.autocast(dtype=torch.bfloat16, device_type='cuda'):
# model is created
model.to(torch.bfloat16)

In the initializer of the optimizer, for example AdamW, you can add code like the following code snippet to make a FP32 copy of weights:

.. code:: python

# keep a copy of weights in highprec
self.param_groups_highprec = []
for group in self.param_groups:
params = group['params']
param_groups_highprec = [p.data.float() for p in params]
self.param_groups_highprec.append({'params': param_groups_highprec})

In the :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`, this mode can be enabled by pasing ``--optimizer=AdamW_FP32ParamsCopy`` option to ``dp_bert_large_hf_pretrain_hdf5.py`` and setting ``NEURON_RT_STOCHASTIC_ROUNDING_EN=0`` (or leave it unset).

.. _automatic_mixed_precision_autocast:

BF16 automatic mixed precision using PyTorch Autocast
-----------------------------------------------------

By default, the compiler automatically casts internal FP32 operations to
BF16. You can disable this and allow PyTorch's BF16 automatic mixed precision function (``torch.autocast``) to
do the casting of certain operations to operate in BF16.

To enable PyTorch's BF16 mixed-precision, first turn off the Neuron
compiler auto-cast:

.. code:: python

os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"

Next, per recommendation from official PyTorch `torch.autocast documentation <https://pytorch.org/docs/stable/amp.html#autocasting>`__, place only
the forward-pass of the training step in the ``torch.autocast`` scope with ``xla`` device type:

.. code:: python

with torch.autocast(dtype=torch.bfloat16, device_type='xla'):
# forward pass

The device type is CUDA because we are using CUDA's list of BF16
compatible operations as mentioned above.
The device type is XLA because we are using PyTorch-XLA's autocast backend. The PyTorch-XLA `autocast mode source code <https://github.com/pytorch/xla/blob/master/torch_xla/csrc/autocast_mode.cpp>`_ lists which operations are casted to lower precision BF16 ("lower precision fp cast policy" section), which are maintained in FP32 ("fp32 cast policy"), and which are promoted to the widest input types ("promote" section).

Example showing the original training code snippet:

Expand All @@ -319,7 +318,7 @@ The following shows the training loop modified to use BF16 autocast:
def train_loop_fn(train_loader):
for i, data in enumerate(train_loader):
torch.cuda.is_bf16_supported = lambda: True
with torch.autocast(dtype=torch.bfloat16, device_type='cuda'):
with torch.autocast(dtype=torch.bfloat16, device_type='xla'):
inputs = data[0]
labels = data[3]
outputs = model(inputs, labels=labels)
Expand All @@ -328,7 +327,7 @@ The following shows the training loop modified to use BF16 autocast:
optimizer.step()
xm.mark_step()

For a full example of BF16 mixed-precision, see :ref:`PyTorch Neuron BERT Pretraining Tutorial <hf-bert-pretraining-tutorial>`.
For a full example of BF16 mixed-precision, see :ref:`PyTorch Neuron BERT Pretraining Tutorial (Data-Parallel) <hf-bert-pretraining-tutorial>`.

See official PyTorch documentation for more details about
`torch.autocast <https://pytorch.org/docs/stable/amp.html#autocasting>`__
Expand Down Expand Up @@ -370,6 +369,12 @@ intermediate results such as loss values. In such case, the printing of
lazy tensors should be wrapped using ``xm.add_step_closure()`` to avoid
unnecessary compilation-and-executions.

Aggregate the data transfers between host CPUs and devices
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For best performance, you may try to aggregate the data transfers between host CPUs and devices.
For example, increasing the value for `batches_per_execution` argument when instantiating ``MpDeviceLoader`` can help increase performance for certain where there's frequent host-device traffic like ViT as described in `a blog <https://towardsdatascience.com/ai-model-optimization-on-aws-inferentia-and-trainium-cfd48e85d5ac>`_. NOTE: Increasing `batches_per_execution` value would delay the mark-step for multiple batches specified by this value, increasing graph size and could lead to out-of-memory (device OOM) error.

Ensure common initial weights across workers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -396,13 +401,6 @@ be loaded using ``serialization.load`` api. More information on this here: `Savi

FAQ
---

What is the difference between Trainium and Inferentia?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Trainium is an accelerator designed to speed up training, whereas
Inferentia is an accelerator designed to speed up inference.

Debugging and troubleshooting
-----------------------------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
pip install awscli

# Install packages from repos
python -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com"
python -m pip config set global.index-url "https://pip.repos.neuron.amazonaws.com"

# Install Python packages - Transformers package is needed for BERT
python -m pip install torch-neuronx=="1.11.0.1.*" "neuronx-cc==2.*"
Expand Down Expand Up @@ -144,7 +144,7 @@
pip install awscli

# Install packages from repos
python -m pip config set global.extra-index-url "https://pip.repos.neuron.amazonaws.com"
python -m pip config set global.index-url "https://pip.repos.neuron.amazonaws.com"

# Install Python packages - Transformers package is needed for BERT
python -m pip install torch-neuronx=="1.11.0.1.*" "neuronx-cc==2.*"
Loading