Skip to content

Commit

Permalink
v4.0.2411 changelog
Browse files Browse the repository at this point in the history
  • Loading branch information
abouteiller committed Nov 14, 2024
1 parent cb32a62 commit 2666112
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 43 deletions.
70 changes: 30 additions & 40 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,155 +5,145 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).


Unreleased (master)
-------------------

v4.0.2411
---------

### Added

- Add DTD CUDA support including NEW tiles in DTD

- PaRSEC API 4.0 (still changing)

- PaRSEC API 4.0
- Add DTD CUDA support including NEW tiles in DTD.
- Add RoCM/HIP device support.
- Add IrisXE/Level0 device support (experimental).
- Enable users to manage their own data copies without PaRSEC
interfering. Data copies are marked as being owned by PaRSEC or
not and managed by PaRSEC or not. A data copy owned by PaRSEC can
be reclaimed by PaRSEC when its reference count reaches 0, a data
copy managed by PaRSEC can be copied / moved onto a different
device, while a data copy not managed by PaRSEC will never be
moved by the runtime.

- Add an info system, and introduce two info hooks. See parsec/class/info.h
for details. The info system allows the user to register info objects
with different levels of structures and dynamic objects in the PaRSEC
runtime.

- PTG supports user-defined routines to move data between GPU and
CPU, and user-defined sizes for buffers allocated on the GPU.

- PTG supports reshaping data propagated between local tasks and
the speficiation of two types on acccesses to data colletions.
- PINS log SCHEDULE_BEGIN and SCHEDULE_END events to better track tasks lifecycle.
- Detect and report oversubscribed binding of core resources.
- Thread binding can be disabled (new MCA parameter).
- Load balancing between GPUs can be tuned (device_load_balance_skew MCA parameter).
- Load balancing exclusivity between CPU/GPUs can be disabled (device_load_balance_allow_cpu MCA parameter).
- Data sent in messages can be of variable size.
- New API parsec_context_query can be used to obtain information on the system, like the number of devices, etc.

### Changed

- Single letter command line options have been replaced with --mca parameters.
--help is now --parsec-help.

- Renamed symbols related to data distribution to properly prefix them with
the `parsec_` prefix. The old symbols have been deprecated.

- DTD interface change: the global array parsec_dtd_arena_datatypes
is replaced with functions to create, destroy, and get arena
datatypes for DTD, and these objects now live inside the
parsec context.

- PARSEC_SUCCESS changed to 0 (from -1), all values for PARSEC_ERR_XYZ changed.

- PaRSEC now requires CMake 3.18.

- PaRSEC now requires CMake 3.21.
- PaRSEC profiling tools now require Python 3.x
- PaRSEC profiling system does not require for local dicitonaries to
be identical between ranks anymore.
- time_estimate functions can be used to control task load balancing.

### Deprecated

- PaRSEC API 3.0

- data distribution w/o the `parsec_` prefix. Further documentation (including a
sed script) can be found in contrib/renaming.

### Removed

- PaRSEC API 3.0
- RECURSIVE Device support (this is temporary and will be restored in a future version).
- Removed obsolete dbp2paje tool; h5totrace is the replacement tool
to use. This removes the optional dependency on GTG.

- Removed all command line options not prefixed by --mca, except for --parsec-help
and --parsec-version.
- Using more than PARSEC_GPU_MAX_WORKSPACE workspaces per device will now cause an error (instead of computing incorrect values).
- PTG property 'weight' (replaced by 'time_estimate').

### Fixed

- DTD Termination detection would occasionally assert.
- Multiple bugs with GPU data ownership causing crashes and incorrect results when executing with more than 1 GPU.
- Device-to-device memory copies would not work in some scenarios.
- Suboptimal ordering of members in broadcast tree could cause performance reduction.
- Cray MPI and MPICH would crash durint MPI_Cancel and when using NULL datatypes.
- Do not report incorrect flops/s capabilities (device_show_capabilities MCA parameter).
- On some systems PaRSEC would allocate more GPU memory than is available on the device.
- Performance with large number of GPU tasks with the same priority would be poor due to overhead of sorting by priority.

### Security


v3.0.2012
---------

- PaRSEC API 3.0

- PaRSEC now requires CMake 3.16.

- New configure system to ease the installation of PaRSEC. See
INSTALL for details. This system automates installation on most DOE
leadership systems.

- Split DPLASMA and PaRSEC into separate repositories. PaRSEC moves from
cmake-2.0 to cmake-3.12, using targets. Targets are exported for
third-party integration

- Add visualization tools to extract user-defined properties from the
application (see: PR 229 visualization-tools)

- Automate expression of required data transfers from host-to-device and
device-to-host to satisfy depencencies (and anti-dependencies). PaRSEC tracks
multiple versions of the same data as data copies with a coherency algorithm
that initiates data transfers as needed. The heurisitic for the eviction policy
in out-of-memory event on GPU has been optimized to allow for efficient
operation in larger than GPU memory problems.

- Add support for MPI out-of-order matching capabilities; Added capability
for compute threads to send direct control messages to indicate completion
of tasks to remote nodes (without delegation to the communication thread)

- Remove communication mode EAGER from the runtime. It had a rare
but hard to correct bug that would rarely deadlock, and the performance
benefit was small.

- Add a Map operator on the Block Cyclic matrix data collection that
performs in-place data transformation on the collection with a user provided
operator.

- Add support in the runtime for user-defined properties evaluated at
runtime and easy to export through a shared memory region (see: PR
229 visualization-tools)

- Add a PAPI-SDE interface to the parsec library, to expose internal
counters via the PAPI-Software Defined Events interface.

- Add a backend support for OTF2 in the profiling mechanism. OTF2 is
used automatically if a OTF2 installation is found.

- Add a MCA parameter to control the number of ejected blocks from GPU
memory (device_cuda_max_number_of_ejected_data). Add a MCA parameter
to control wether or not the GPU engine will take some time to sort
the first N tasks of the pending queue (device_cuda_sort_pending_list).

- Reshape the users vision of PaRSEC: they only have to include a single
header (parsec.h) for most usages, and link with a single library
(-lparsec).

- Update the PaRSEC DSL handling of initial tasks. We now rely on 2
pieces of information: the number of DSL tasks, and the number of
tasks imposed by the system (all types of data transfer).

- Add a purely local scheduler (ll), that uses a single LIFO per
thread. Each schedule operation does 1 atomic (push in local queue),
each select operation does up to t atomics (pop in local queue, then
try any other thread's queue until they are all tested empty).

- Add a --ignore-properties=... option to parsec_ptgpp

- Change API of hash tables: allow keys of arbitrary size. The API
features how to build a key from a task; how to hash a key into
1 <= N <= 64 bits; and how to compare twy keys (plus a printing
function to debug).

- Change behavior of DEBUG_HISTORY: log all information inside
a buffer of fixed size (MCA parameter) per thread, do not allocate
memory during logging, and use timestamp to re-order output
when the user calls dump()

- DTD interface is updated (new flag to send pointer as parameter,
unpacking of paramteres is simpler etc).

- DTD provides mca param (dtd_debug_verbose) to print information
about traversal of DAG in a separate output stream from the default.

Expand Down
9 changes: 6 additions & 3 deletions INSTALL.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,14 @@ below. From 1 to 2 (included) they are mandatory. Everything else is
optional, they provide nice features not critical to the normal usage
of this software package.

1. cmake version 3.18 or above. cmake can be found in the debian
1. cmake version 3.21 or above. cmake can be found in the debian
package cmake, or as sources at the CMake_ download page
2. Any MPI library Open MPI, MPICH2, MVAPICH or any vendor blessed
implementation.
3. hwloc_ for processor and memory locality features
4. For using PINS (instrumentation based on PAPI) PAPI_ is required
5. For the profiling tools you need several libraries.
4. AMD and NVIDIA device support require HIP_>=5 and CUDA_>=4 respectively
5. For using PINS (instrumentation based on PAPI) PAPI_ is required
6. For the profiling tools you need several libraries.

- Vite_ a visualization environment (only required for visualization)
- GD_ usually available on most of the Linux distribution via GraphViz
Expand All @@ -39,6 +40,8 @@ of this software package.
.. _CMake: http://www.cmake.org/
.. _hwloc: http://www.open-mpi.org/projects/hwloc/
.. _PAPI: http://icl.cs.utk.edu/papi/
.. _HIP: https://rocm.docs.amd.com/projects/HIP/en/latest/index.html
.. _CUDA: https://developer.nvidia.com/cuda-toolkit
.. _Vite: https://gforge.inria.fr/projects/vite/
.. _GD: http://www.graphviz.org/

Expand Down

0 comments on commit 2666112

Please sign in to comment.