Skip to content

Releases: IntelPython/dpctl

v0.19.0

28 Feb 19:25
1336b31
Compare
Choose a tag to compare

This release features official, out-of-the-box support for compiling dpctl for specified AMD GPU architectures, the addition of new function tensor.top_k, a radix-sort-based implementation of sorting functions, and improvements to interoperability with DLPack through tensor.dldevice_to_sycl_device and tensor.sycl_device_to_dldevice.

A number of adjustments were also made to improve performance of dpctl reductions (i.e., sum, min, max, etc.), accumulators (i.e., cumulative_sum, cumulative_logsumexp), and copy-and-cast operations.

Added

  • Support for compiling dpctl for specified AMD GPU architecture with use of CodePlay oneAPI plug-in gh-1731
  • Added tensor.top_k per Python Array API specification gh-1921
  • Added functions tensor.dldevice_to_sycl_device and tensor.sycl_device_to_dldevice for converting between DLPack and sycl devices, and a method get_device_id to dpctl.SyclDevice to improve interoperability with DLPack protocol gh-1953
  • Added DPCTL_OFFLOAD_COMPRESS cmake option (set to OFF by default) to toggle --offload-compress linker option when building dpctl gh-1961

Changed

  • Improved performance of copy-and-cast operations from numpy.ndarray to tensor.usm_ndarray for contiguous inputs gh-1829
  • py_sort and py_argsort now throw py::value_error if inputs are not C-contiguous gh-1838
  • Improved performance of copying operation to C-/F-contig array, with optimization for batch of square matrices gh-1850
  • Improved performance of tensor.argsort function for all types gh-1859
  • Improved performance of tensor.sort and tensor.argsort for short arrays in the range [16, 64] elements gh-1866
  • Implemented radix sort algorithm to be used in dpt.sort and dpt.argsort gh-1867, gh-1883
  • Extended dpctl.SyclTimer with device_timer keyword, implementing different methods of collecting device times gh-1872
  • dpctl changed to see GPU devices out of the box in virtual environment on Windows gh-1922
  • Improved performance of tensor.cumulative_sum, tensor.cumulative_prod, tensor.cumulative_logsumexp as well as performance of boolean indexing gh-1923, gh-1942
  • Improved performance of tensor.min, tensor.max, tensor.logsumexp, tensor.reduce_hypot for floating point type arrays by at least 2x gh-1932, gh-1937
  • Updated Cython examples to use scikit-build gh-1935
  • Reduced binary size of _tensor_accumulation_impl by 13 MB gh-1957
  • Extended tensor.asarray to support objects that implement __usm_ndarray__ property to be interpreted as usm_ndarray objects gh-1959
  • tensor.usm_ndarray object disallows implicit conversions to NumPy array gh-1964
  • stream arguments in tensor.usm_ndarray methods now raise an error if stream is not a tensor.SyclQueue gh-1969
  • dpctl initialization sets subprocess to use SPAWN method on Linux to enable gdb-oneapi to debug kernels submitted from Python applications gh-1971
  • Reduced binary size of _tensor_elementwise_impl gh-1976
  • Allow dpctl.SyclQueue.memcpy to and from multi-dimensional buffers gh-1985

Fixed

  • Fixed a bug in tensor.roll for very large values of shift gh-1869
  • Fix for tensor.result_type when all inputs are Python built-in scalars gh-1877
  • Improved error in constructors tensor.full and tensor.full_like when provided a non-numeric fill value gh-1878
  • Added a check for pointer alignment when copying to C-contiguous memory gh-1890, gh-1891
  • Fixed dpctl installed into virtual environment not finding DPC++ runtime libraries by adding DPCTL_WITH_REDIST cmake option (set to OFF by default) gh-1893
  • Fixed incorrect result (issue gh-1901) in tensor.cumulative_sum and in advanced indexing gh-1902
  • Fixed __setitem__() for tensor.usm_ndarray when passed an empty boolean mask gh-1915
  • tensor.from_dlpack docstring now shows that return type can be NumPy array and stipulates when this will be the case gh-1919
  • Fixed docstring in helper class in DLPack tests gh-1920
  • Fixed a bug in tensor.astype where copy=False would not be respected for 1d arrays when order keyword is specified gh-1928
  • Replaced deprecated CL/sycl.hpp with recommended sycl/sycl.hpp in examples gh-1933
  • Fixed tensor.take_along_axis and tensor.put_along_axis raising an error for tensor.uint64 indices when given an array of dimension greater than 1 gh-1934
  • Fixed unexpected results of tensor.sum with a requested output type of bool gh-1958
  • Use std::move to avoid unnecessary copying of temporary in triul_ctor.cpp gh-1960
  • Make stream a keyword-only argument in tensor.usm_ndarray.to_device per requirement by array API specification gh-1966
  • Improve efficiency of copy implementation and avoid an unnecessary kernel invocation in tensor.argsort for 1d input gh-1967
  • Corrected uses of NumPy constructors with tensor.usm_ndarray inputs in test suite gh-1968
  • Fixed array API namespace inspection utilities showing complex128 as a valid dtype on devices without double precision and device keywords not working with dpctl.SyclQueue or filter strings gh-1979
  • Fixed a bug in test_sycl_device_interface.cpp which would cause compilation to fail with Clang version 20.0 gh-1989
  • Fixed memory leaks in smart-pointer-managed USM temporaries in synchronizing kernel calls gh-2002
  • UsmNDArray_MakeSimpleFromPtr and UsmNDArray_MakeFromPtr now raise an error when provided an invalid typenum before attempting to create the array gh-2003
  • Fixed typos in tensor.from_numpy and tensor.astype gh-2006

Maintenance

  • Revert pinning of cmake to 3.26 on Windows gh-1823
  • Update black version used in Python code style workflow gh-1828
  • Fixed CI/CD workflow for building conda packages on Windows gh-1831
  • Revert work-around in test_sycl_kernel_submit.py for problem in MKL 2024.2.0 gh-1836
  • Do not use Mambaforge variant of miniforge as deprecated gh-1844
  • Use pybind11=2.13.6 gh-1845
  • Remove unnecessary include in C++ header file gh-1846
  • Build translation unit "simplify_iteration_space.cpp" compiled multiple times as a static library gh-1847
  • Add instructions for installing dpctl from Intel PyPi channel gh-1860
  • Fix warnings when generating docs gh-1855, gh-1861
  • Align conda recipe with conda-forge's {{ stdlib("c") }} migration gh-1868
  • Add missing include of SYCL header to "math_utils.hpp" gh-1899
  • Add support of CV-qualifiers in is_complex<T> helper gh-1900
  • Tuning work for elementwise functions with modest performance gains (under 10%) gh-1889
  • Reduce binary ...
Read more

v0.18.3

07 Dec 18:21
69be39d
Compare
Choose a tag to compare

This is a bug fix release which supports use of dpctl in virtual environment on Windows, resolving gh-1745.

v0.18.2

03 Dec 20:58
7bac769
Compare
Choose a tag to compare

This is a bug-fix release, see https://github.com/IntelPython/dpctl/milestone/15.

It backports fixes for

  • tensor.result_type behavior for scalars (see gh-1874) and
  • errors when using dpctl in virtual environment on Linux (gh-1892).

Changes from PR gh-1899 were also backported.

v0.18.1

14 Oct 11:56
5e5513f
Compare
Choose a tag to compare

This is incremental release where only installation instructions in README were updated to reflect the change in location of index with Python packages built by Intel(R) relative to 0.18.0 release.

v0.18.0

30 Sep 10:42
786365e
Compare
Choose a tag to compare

This release reaches an important milestone of making offloading fully asynchronous.

Calls to dpctl.tensor submit tasks for execution to DPC++ runtime and return without waiting for execution of these tasks to finish.
The sequential semantics a user comes to expect from execution of Python script is preserved though.

The full list of changes that went into this release are:

Added

  • Implement tensor.take_along_axis per Python Array API specification gh-1778
  • Implement tensor.put_along_axis to complement tensor.take_along_axis gh-1798
  • Support for 'device=tensor.kDLCPU' in tensor.from_dlpack function and tensor.usm_ndarray.__dlpack__ method gh-1781
  • Support DLPack on Windows gh-1746
  • Implement tensor.nextafter function per Python Array API specification gh-1730
  • Implement tensor.count_nonzero and tensor.diff functions from Python array API specification gh-1732, gh-1780
  • Add support for order="K" to *_like array creation functions, and change default order keyword value from 'C' to 'K' gh-1808
  • Support for 'max dimensions' in Array API capabilities info data gh-1774
  • Add support for device aspect 'emulated' gh-1691
  • dpctl::tensor::usm_memory class defined in dpctl4pybind11.hpp adds constructor to create Python USM memory objects viewing into existing USM allocations, which can be made by an external library gh-1782
  • Add support for COVERAGE build type in project's CMake script gh-1692

Change

  • Change ownership of USM allocation by dpctl.memory objects, make executions of dpctl.tensor operations asynchronous gh-1705
  • Add support for Python scalars by tensor.where function gh-1719
  • Optimize division by Python scalar in statistical functions tensor.mean, tensor.std, tensor.var gh-1820
  • Use transcendental functions from sycl namespace instead of std namespace gh-1707
  • Changes for compatibility with recent NumPy in runtime environment gh-1735, gh-1772, gh-1804
  • Array creation function tensor.zeros to use asynchronous memset operation gh-1806
  • The setter of tensor.usm_ndarray.shape property now supports Python scalar value gh-1786
  • Use 'pyproject.toml' instead of 'setup.py' aligning with current packaging best practices gh-1660
  • No longer set SOVERSION property in DPCTLSyclInterface library on Linux gh-1773
  • Update version of 'pybind11' used gh-1758, gh-1812
  • Handle possible exceptions by usm_host_allocator used with std::vector gh-1791
  • Use dpctl::tensor::offset_utils::sycl_free_noexcept instead of sycl::free in host_task tasks associated with life-time management of temporary USM allocations gh-1797
  • Add "same_kind"-style casting for in-place mathematical operators of tensor.usm_ndarray gh-1827, gh-1830

Fixed

  • Fix setting of release variable Sphinx config file gh-1685
  • Handle possible NULL return value from device aspect queries DPCTLDevice_GetMaxWorkGroupSize1d and DPCTLDevice_GetMaxWorkGroupSize2d gh-1690
  • Add license header to conda script files gh-1695
  • Fix tensor.round behavior on CUDA devices gh-1700
  • Add missing #include <sstream> gh-1701
  • Fix for issue 1724 gh-1728
  • Correct USM type for return array of tensor.extract function gh-1727
  • Fix for tensor.unique_all and tensor.unique_inverse to always return index arrays with default indexing data type gh-1741
  • Propagate read-only flag from __sycl_usm_array_interface__ in tensor.asarray function gh-1756
  • tensor.clip to handle Python scalars which are out of bound for the data type of integral array gh-1759
  • Avoid dead-locking by releasing GIL around blocking operations in libtensor gh-1753
  • Element-wise tensor.divide and comparison operations allow greater range of Python integer and integer array combinations gh-1771
  • Fix for unexpected behavior when using floating point types for array indexing gh-1792
  • Enable pytest --pyargs dpctl.tests gh-1833

Maintenance

  • Improve performance of test_sort_complex_fp_nan gh-1704
  • Improve exception wording raised by tensor.broadcast_arrays() gh-1720
  • Remove template keyword in method call of sycl::kernel_bundle gh-1726
  • Backport changelog edits from maintenance/0.17.x gh-1736
  • Replace uses of 'intel' channels in docs and readme file gh-1737
  • Update references to deprecated environment variable SYCL_DEVICE_FILTER gh-1740
  • Correction for installation instruction steps gh-1754
  • Fix for crash during testing with open source SYCL bundle by updating CPU RT library used gh-1762
  • Add missing include to fix build break with newer LLVM gh-1776
  • Add #include <utility> for definition of std::move used gh-1787
  • Change to CMake script to accomodate DPC++ transition from PI to UR architecture gh-1788
  • Document tensor._flags.Flags class gh-1794
  • Fix for unreferenced unreleased bug in copy-and-cast code logic gh-1799
  • Explicitly include headers used in C++ translation units implementing reduction operations gh-1802
  • Clean-up uses of Strided1DIndexer class gh-1805
  • Tweak to readability of C++ code implementing matrix-matrix multiplication gh-1810
  • Do not add sycl::event associated with compute task to vector of events representing execution of host_task gh-1807
  • Remove 'level-zero' conda package from run-time dependencies of 'dpctl' since Intel GPU driver stack now explicitly depends on libze1 package which provides Level-Zero loader library gh-1801, gh-1840
  • Use dedicated type-support matrices for in-place element-wise binary operations gh-1816
  • Remove recommendation to install wheels from Anaconda PyPI index gh-1819
  • Removed use of post-link and pre-unlink conda scripts in dpctl gh-1821
  • Pin compiler used to build 0.18.0 version to 2025.0.0 gh-1822
  • A varienty of changes to continuous integration/delivery (CI/CD) supporting scripts to keep CI running smoothly:
    gh-1686, gh-1688, gh-1697, gh-1698, gh-1703, gh-1702, gh-1709, gh-1712, gh-1713, gh-1722, gh-1725, gh-1729, gh-1733, [gh-1721](https...
Read more

0.17.0

14 Jul 13:51
Compare
Choose a tag to compare

This release features updated documentation web-page https://intelpython.github.io/dpctl/latest/index.html, adds cumulative reductions,
and complies with revision 2023.12 of Python Array API specification.

Added

  • Added pybind11 caster for sycl::half to map to/from Python float to "dpctl4pybind11.hpp" header: gh-1655
  • Added support for DLPack data interchange per Python Array API 2023.12 specification: gh-1667
  • Implemented tensor.cumulative_sum, tensor.cumulative_prod and tensor.cumulative_logsumexp: gh-1602

Changed

  • Expanded documentation for dpctl: gh-1619
  • Expanded utils.intel_device_info functionality: gh-1656
  • Improved performance of elementwise operations: gh-1651
  • Efficiency improvement by avoiding unnecessary copying of sycl::queue: gh-1645
  • dpctl uses pybind11 2.12.0: gh-1640
  • Improved performance of tensor.reshape operation with order="F" when copying is needed, or requested: gh-1677

Fixed

  • Fixed initialization of byte type constants in dpctl_capi Python/C API loader class in "dpctl4pybind11.hpp": gh-1665
  • Fixed crash in tensor.sort reported for a CPU device and a CUDA device: gh-1676
  • Fixed race condition in accumulation kernel for custom operations that caused test failures with AMD CPUs: gh-1624
  • Fixed comparison operators for mixed signed and unsigned integral types: gh-1650
  • Support use of index arrays of different integral types in indexing operations: gh-47
  • Fixed source code to compile for NVidia(TM) GPUs with DPC++ 2024.1: gh-1630
  • Corrected tensor.tile for scalar inputs and empty repetitions: gh-1628
  • Fixed support for out keyword in tensor.matmul: gh-1610
  • Fixed bug in basic slicing of empty arrays: gh-1680
  • Fixed bug in tensor.bitwise_invert for boolean input array: gh-1681
  • Fixed bug in tensor.repeat on zero-size input arrays: gh-1682

New Contributors

Full Changelog: https://github.com/IntelPython/dpctl/blob/master/CHANGELOG.md

v0.16.1

11 Apr 01:25
1f13ce8
Compare
Choose a tag to compare

This release includes bug fixes and provides a change needed by numba_dpex project to support dispatching kernels
consuming instances of sycl::local_accessor template type.

Changed

  • Changed behavior of dpctl.tensor.usm_ndarray.__dlpack_device__ method to return device id of the parent unpartitioned device if array is allocated on a sub-device instead of raising an exception: #1604
  • Array creation functions and the usm_ndarray constructor in dpctl.tensor submodule now use cached default-selected device to improve performance: #1606
  • Changed treatment of axis keyword for dpctl.tensor.tensordot and dpctl.tensor.vecdot to align with Python Array API 2023.12 specification: #1608
  • Changed implementation of DPCTLQueue_SubmitRange, DPCTLQueue_SubmitNDRange in DPCTLSyclInterface library to support sycl::local_accessor arguments needed by numba_dpex; the enum DPCTLKernelArgT\ ype to correspond to C++ disjoint types: #1609, #1611, #1612

Fixed

  • Fixed a crash on Windows platform during execution of getter of dpctl.SyclPlatfom.default_context property: : #1604
  • Fixed kernel submission error on NVidia CUDA GPUs during dpctl.tensor.matmul operation: #1605
  • Fixed corruption of context cache table entries: #1607
  • Fixed incorrect result from dpctl.tensor.tensordot reported in issue #1570: #1608
  • Fixed output of python -m dpctl --library to fix specified library name: #1615

v0.16.0

28 Mar 02:59
Compare
Choose a tag to compare

This release is virtually identical to 0.15.1 as far as features are concerned.

This release is meant to be built with DPC++ 2024.1.0, that no longer support older integrated Gen9 Intel GPUs, such as those that came with Intel Core 10th generation and older.

v0.15.1

10 Feb 21:51
Compare
Choose a tag to compare

Summary

This release reaches milestone of 100% compliance of dpctl.tensor functions with Python Array API 2022.12 standard for the main namespace.

Added

  • Added reduction functions dpctl.tensor.min, dpctl.tensor.max, dpctl.tensor.argmin, dpctl.tensor.argmax, and dpctl.tensor.prod per Python Array API specifications: #1399
  • Added dedicated in-place operations for binary elementwise operations and deployed them in Python operators of dpctl.tensor.usm_ndarray type: #1431, #1447
  • Added new elementwise functions dpctl.tensor.cbrt, dpctl.tensor.rsqrt, dpctl.tensor.exp2, dpctl.tensor.copysign, dpctl.tensor.angle, and dpctl.tensor.reciprocal: #1443, #1474
  • Added statistical functions dpctl.tensor.mean, dpctl.tensor.std, dpctl.tensor.var per Python Array API specifications: #1465
  • Added sorting functions dpctl.tensor.sort and dpctl.tensor.argsort, and set functions dpctl.tensor.unique_values, dpctl.tensor.unique_counts, dpctl.tensor.unique_inverse, dpctl.tensor.unique_all: #1483
  • Added linear algebra functions from the Array API namespace dpctl.tensor.matrix_transpose, dpctl.tensor.matmul, dpctl.tensor.vecdot, and dpctl.tensor.tensordot: #1490, #1525, #1541
  • Added dpctl.tensor.clip function: #1444, #1505
  • Added custom reduction functions dpt.logsumexp (reduction using binary function dpctl.tensor.logaddexp), dpt.reduce_hypot (reduction using binary function dpctl.tensor.hypot): #1446
  • Added inspection API to query capabilities of Python Array API specification implementation: #1469
  • Support for compilation for NVIDIA(R) sycl target with use of CodePlay oneAPI plug-in: #1411, #1124
  • Added dpctl.utils.intel_device_info function to query additional information about Intel(R) GPU devices: gh-1428 and gh-1445
  • Added support for two new device descriptors, dpctl.SyclDevice.max_mem_alloc_size and dpctl.SyclDevice.max_clock_frequency: #1530

Changed

  • Functions dpctl.tensor.result_type and dpctl.tensor.can_cast became device-aware: #1488, #1473
  • Implementation of method dpctl.SyclEvent.wait_for changed to use sycl::event::wait instead of sycl::event::wait_and_throw: gh-1436
  • dpctl.tensor.astype was changed to support device keyword as per Python Array API specification: #1511
  • C++ header files in libtensor/include/kernels containing implementations of SYCL kernels no longer depends on "pybind11.h": #1516

Fixed

v0.15.0

29 Sep 16:06
5bd924e
Compare
Choose a tag to compare

Summary

The 0.15.0 represents a milestone in which dpctl.tensor.usm_ndarray object now implements all special Python operators, except __matmul__ and __rmatmul__.

The dpctl.tensor increases its array-API conformance test suite pass rate to 81.8%, (passed: 916, failed: 84, skipped: 119).

Details

Added

  • Added dpctl.tensor.floor, dpctl.tensor.ceil, dpctl.tensor.trunc elementwise functions.
  • Added dpctl.tensor.hypot, dpctl.tensor.logaddexp elementwise functions.
  • Added trigonometric (dpctl.tensor.sin, dpctl.tensor.cos, dpctl.tensor.tan) and hyperbolic (dpctl.tensor.sinh, dpctl.tensor.cosh, dpctl.tensor.tanh) elementwise functions and their inverses (dpctl.tensor.asin, dpctl.tensor.asinh, dpctl.tensor.acos, dpctl.tensor.acosh, dpctl.tensor.atan, dpctl.tensor.atanh).
  • Added dpctl.tensor.round function.
  • Added dpctl.tensor.sign and dpctl.tensor.remainder elementwise functions.
  • Added bitwise elementwise functions dpctl.tensor.bitwise_and, dpctl.tensor.bitwise_xor, dpctl.tensor.bitwise_or, dpctl.tensor.bitwise_invert
  • Added bitwise shift functions dpctl.tensor.bitwise_left_shift and dpctl.tensor.bitwise_right_shift.
  • Added dpctl.tensor.atan2 and dpctl.tensor.signbit elementwise functions.
  • Added dpctl.tensor.minumum and dpctl.tensor.maximum binary elementwise functions.
  • Supported equality checking and hashing for dpctl.SyclPlatform.
  • Implemented types property for all unary and binary elementwise functions #1361
  • Added dpctl.tensor.repeat and dpctl.tensor.tile functions.
  • Added dpctl.tensor.matrix_transpose function.

Changed

  • Enabled support for Python arithmetic, in-place arithmetic, reflexive arithmetic, comparison, and bitwise operators for dpctl.tensor.usm_ndarray type #1324.
  • Removed dpctl.tensor.numpy_usm_shared obsolete class and associated tests which were being skipped #1310
  • Transitioned dpctl codebase to Cython 3.
  • Improved performance of boolean reduction functions dpctl.tensor.all and dpctl.tensor.any.
  • Improved performance of summation function dpctl.tensor.sum.
  • Improved in-place arithmetic operations for addition, subtraction and multiplication.
  • Updated codebase per SYCL-2020 intel/llvm compiler deprecation warnings.
  • Improved performance of advanced boolean indexing for arrays whose size fits in 32-bit signed integer type.
  • Removed deprecated DPCTLDevice_GetMaxWorkItemSizes function from the SyclInterface library.
  • Improved performance of dpctl.tensor.reshape in the case when a copy is being made.
  • Improved performance of dpctl.tensor.roll function.

Fixed