Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update blis to 1.0 #6

Open
wants to merge 749 commits into
base: master
Choose a base branch
from
Open

Update blis to 1.0 #6

wants to merge 749 commits into from

Conversation

honnibal
Copy link
Member

No description provided.

fgvanzee and others added 30 commits September 21, 2021 14:54
Details:
- Modified .travis.yml so that only commits to 'master', 'dev', and
  'amd' branches get built by Travis CI. Thanks to Devin Matthews for
  helping to track down the syntax for this change.
Details:
- Updated FAQ.md to include two new questions, reordered an existing
  question, and also removed an outdated and redundant question about
  BLIS vs. AMD BLIS.
- Updated Sandboxes.md to use 'gemmlike' as its main example, along with
  other smaller details.
- Added ARM as a funder to README.md.
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
- Fixed for 6x8 only, 4x4 & 4x8 pending;
- Installed to config firestorm as benchmark seems to show better perf:
   Old:
blis_dgemm_ukr_c                     6     8   320    36.87   2.43e-17   PASS
blis_dgemm_ukr_c                     6     8   352    40.55   1.04e-17   PASS
blis_dgemm_ukr_c                     6     8   384    44.24   5.68e-17   PASS
blis_dgemm_ukr_c                     6     8   416    41.67   3.51e-17   PASS
blis_dgemm_ukr_c                     6     8   448    34.41   2.94e-17   PASS
blis_dgemm_ukr_c                     6     8   480    42.53   2.35e-17   PASS

   New:
blis_dgemm_ukr_r                     6     8   352    50.69   1.59e-17   PASS
blis_dgemm_ukr_r                     6     8   384    49.15   5.55e-17   PASS
blis_dgemm_ukr_r                     6     8   416    50.44   2.86e-17   PASS
blis_dgemm_ukr_r                     6     8   448    46.92   3.12e-17   PASS
blis_dgemm_ukr_r                     6     8   480    48.08   4.08e-17   PASS
…es for testsuite inaries.

- RPATH entries (and DYLD_LIBRARY_PATH) do nothing on macOS unless the install_name of the library starts with @rpath/. While the install_name can be set to the absolute install path, this makes the installation non-relocatable. When using @path in the install_name, install paths within the normal DYLD_LIBRARY_PATH work with no changes on the user side, but for install paths off the beaten track, users must specify an RPATH entry when linking (or modify DYLD_LIBRARY_PATH at runtime). Perhaps this could be made into a configure-time option.
- Having relocable testsuite binaries is not necessarily a priority but it is easy to do with @executable_path (macOS) or $ORIGIN (linux/BSD).
Details:
- Reworked support for ARM hardware detection in bli_cpuid.c to parse 
  the result of a CPUID-like instruction.
- Added a64fx support to bli_gks.c.
- #include arm64 and arm32 family headers from bli_arch_config.h.
- Fix the ordering of the "armsve" and "a64fx" strings in the 
  config_name string array in bli_arch.c. The ordering did not match
  the ordering of the corresponding arch_t values in bli_type_defs.h,
  as it should have all along.
- Added clang support to make_defs.mk in arm64, cortexa53, cortexa57 
  subconfigs.
- Updated arm64 and arm32 families in config_registry.
- Updated docs/HardwareSupport.md to reflect added ARM support.
- Thanks to Dave Love, RuQing Xu, and Devin Matthews for their
  contributions in this PR (flame#344).
Adds `--enable-rpath/--disable--rpath` (default disabled) to use an install_name starting with @rpath/. Otherwise, set the install_name to the absolute path of the install library, which was the previous behavior.
Move unused ARM SVE kernels to "old" directory.
bli_error: more cleanup on the error strings array
Add an option to use an @rpath-dependent install_name on macOS
Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.
Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c.
For safety upon similar strategies in the future,
 change all [mn]_[iter/left] into signed ints.
Commenting out <sys/sysctl.h> due to possibly a Xcode bug.
This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.
ct-clmsn and others added 30 commits July 27, 2023 15:23
Details:
- Minor changes to test_libblis.c to support hpx.
Details:
- Added build/detect/android/bionic.h header to test whether the 
  __BIONIC__ cpp macro is defined.
- In common.mk, only add -lrt to LDFLAGS when Bionic is not present.
- CREDITS file update.
Details:
- Fixed various typos in API documentation in docs/BLIS*API.md and 
  comments in the source code examples within examples/?api/*.c.
Details:
- Thanks to Igor Zhuravlov for PR flame#753 (commit 915daaa).
Details:
- Added reminders that #include "blis.h" must be added to source files
  in order to access BLIS API function prototypes. Thanks to Barry Smith
  for suggesting this improvement.
- Fixed pre-existing typos.
- CREDITS file update.
Details:
- Previously, the timpl_t id being used when a thrcomm_t is being
  initialized was set within the bli_thrcomm_init() dispatch function
  after the timpl_t-specific bli_thrcomm_init_*() function returned. But
  it just occurred to me that each bli_thrcomm_init_*() function already
  intrinsically knows its own timpl_t value. This commit shifts the
  setting of the thrcomm_t.ti field into the corresponding
  bli_thrcomm_init_*() function for each timpl_t type (e.g. single,
  openmp, pthreads, hpx).
- Removed long-deprecated code dating back nearly 10 years.
- Whitespace changes
- Comment updates.
Details:
- Commit 2db31e0 (flame#755) inserted logic into common.mk that attempts to
  preprocess build/detect/android/bionic.h to determine whether the
  __BIONIC__ macro is defined (in which case -lrt should not be included
  in LDFLAGS). However, the path to bionic.h was encoded without regard
  to DIST_PATH, and so utilizing common.mk anywhere that isn't the top-
  level directory (such as in the testsuite directory) resulted in a
  compiler error:

    gcc: error: build/detect/android/bionic.h: No such file or directory
    gcc: fatal error: no input files
    compilation terminated.

  This commit adds a $(DIST_PATH) prefix to the path to bionic.h so that
  it can be located from other applications' Makefiles that use BLIS's
  makefile fragments.
Details:
- Revamped bli_init_apis() and bli_finalize_apis() to use separate
  bli_pthread_switch_t objects for each of the five sub-API init
  functions, with the objects for the 'ind' and 'rntm' sub-APIs being
  declared with BLIS_THREAD_LOCAL. This allows some APIs to be treated
  as thread-local and the rest as thread-shared. Thanks to Edward Smyth
  for requesting application thread-specific rntm_t structs, which
  inspired these change.
- Combined bli_thread_init_from_env() and bli_pack_init_from_env() into
  a new function, bli_rntm_init_rntm_from_env(), and placed the combined
  code in bli_rntm.c inside of a new bli_rntm_init() function. Then
  removed the (now empty) bli_pack_init() and _finalize() function defs.
- Deprecated bli_rntm_init() for the purposes of initializing a rntm_t
  (temporarily preserving it as bli_rntm_clear() in a cpp-undefined code
  block) so that the function name could be used for the aforementioned
  bli_rntm_init() function.
- Updated libblis_test_pobj_create() in test_libblis.c to use a static
  rntm_t initializer instead of the deprecated bli_rntm_init()
  function-based option.
- Minor updates to docs/Multithreading.md, including removal of
  bli_rntm_init() in the example of how to initialize rntm_t structs.
- Changed the return value of bli_gks_init(), bli_ind_init(),
  bli_memsys_init(), bli_thread_init(), and bli_rntm_init() (and their
  finalize() counterparts) from 'void' to 'int' so that those functions
  match the function type expected by bli_pthread_switch_on()/_off().
  Those init/finalize functions now return 0 to indicate success, which
  is needed so that the switch actually changes state from off to on
  and vice versa.
- Defined bli_thread_reset(), which copies the contents of the
  global_rntm_at_init() struct into the global_rntm struct (for the
  current application thread).
- Guard calls to bli_pthread_mutex_lock()/_unlock() in
  - bli_pack_set_pack_a() and _pack_b()
  - bli_rntm_init_from_global()
  - bli_thread_set_ways()
  - bli_thread_set_num_threads()
  - bli_thread_set_thread_impl()
  - bli_thread_reset()
  - bli_l3_ind_oper_set_enable()
  with #ifdef BLIS_DISABLE_TLS (since TLS precludes the possibility of
  race conditions).
- In frame/base/bli_rntm.c, declare global_rntm, global_rntm_at_init,
  and global_rntm_mutex as BLIS_THREAD_LOCAL so that separate
  application threads can change the number of ways of BLIS parallelism
  independently from one another.
- Access global_rntm only via a new private (not exported) function,
  bli_global_rntm(). Defined a similar function for a rntm_t new to
  this commit, global_rntm_at_init, which preserves the state of the
  global rntm at initialization-time.
- In frame/3/bli_l3_ind.c, added a guard to the declaration of the
  static variable oper_st_mutex with #ifdef BLIS_DISABLE_TLS so that the
  mutex is omitted altogether when TLS is enabled (which prevents the
  compiler from warning about an unused variable).
- Removed redundant code from bli_thread.c:
    #ifdef BLIS_ENABLE_HPX
    #include "bli_thread_hpx.h"
    #endif
  since this code is already present in bli_thread.h.
- Thanks to Minh Quan Ho for his review of and feedback on this commit.
- Comment updates.
Details:
- Replaced 404'd link in docs/Multithreading.md with an archive from
   The Wayback Machine.
- CREDITS file update.
Details:
- Fixed hpx::for_each invocation and replace with hpx::for_loop. The HPX 
  runtime was initialized using hpx::start, but the hpx::for_each 
  function was being called on a non-hpx runtime (i.e standard BLIS 
  runtime - single main thread). To run hpx::for_each on HPX runtime 
  correctly, the code now uses hpx::run_as_hpx_thread(func, args...). 
- Replaced hpx::for_each with hpx::for_loop, which eliminates use of 
  hpx::util::counting_iterator.
- Employ hpx::execution::chunk_size(1) to make sure that a thread 
  resides on a particular core.
- Replaced hpx::apply() with updated version hpx::post().
- Initialize tdata->id = 0 in libblis.c to 0, as it is the main thread 
  and is needed for writing results to output file.
- By default, if not specified, the HPX runtime uses all N threads/cores 
  available in the system. But, if we want to only specify n_threads out 
  N threads, we use hpx::execution::experimental::num_cores(n_threads).
Details:
- Forward-ported 'altra' and 'altramax' subconfigurations from the
  older 'stable' branch lineage [1]. These subconfigs primarily target
  the Ampere Altra and AltraMax (ARM) processors. They also contain
  "QuickStart" directories with information and scripts to help
  use BLIS on these microarchitectures. Thanks to Jeff Diamond and
  Leick Robinson for developing these subconfigs and resources.
- Updated kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c according to
  changes in the 'stable' lineage, mostly related to re-enabling of
  assembly code branches that target general stride IO.

[1] Note that the 'stable' branch is being used to make sure that more
    recent commits do not introduce unreasonable performance
    regressions. As such, the name should be interpreted as shorthand
    for "performance stable," not "API stable."
Details:
- Expanded existing BLAS compatibility APIs to provide interfaces to
  [cz]symv_(), [cz]syr_(). This was easy since those operations were
  already implemented natively in BLIS; the APIs were previously
  omitted only because they were not formally part of the BLAS.
- Implemented [cz]rot_() by feeding code from LAPACK 3.11 through
  f2c.
- Thanks to James Foster for pointing out that LAPACK contains these
  additional symbols, which prompted these additions, as well as for
  testing the [cz]rot_() functions from Julia's test infrastructure.
- CREDITS file update.
Details:
- Previously, disabling the sba via --disable-sba-pools resulted in a
  segfault due to a sanity-check-triggering abort(). The problem was
  that the sba, as currently used in the l3 thread decorators, did not
  yet (fully) support pools being disabled. The solution entailed
  creating wrapper function, bli_sba_array_elem(), which either calls
  bli_apool_array_elem() (when sba pools are enabled at configure time)
  or returns a NULL sba_pool pointer (when sba pools are disabled), and
  calling bli_sba_array_elem() in place of bli_apool_array_elem(). Note
  that the NULL pointer returned by bli_sba_array_elem() when the sba
  pools are disabled does no harm since in that situation the pointer
  goes unreferenced when acquiring and releasing small blocks. Thanks to
  John Mather for reporting this bug.
- Guarded the bodies of bli_sba_init() and bli_sba_finalize() with
  #ifdef BLIS_ENABLE_SBA_POOLS. I don't think this was actually necessary
  to fix the aforementioned bug, but it seems like good practice.
- Moved the code in bli_l3_thrinfo_create() that checked that the array*
  pointer is non-NULL before calling bli_sba_array_elem() (previously
  bli_apool_array_elem()) into the definition of bli_sba_array_elem().
- Renamed various instances of 'pool' variables and function parameters
  to 'sba_pool' to emphasize what kind of pool it represents.
- Whitespace changes.
Details:
- Parse $(CC_VENDOR) values of "nvc" in 'zen3' make_defs.mk file.
- Minor refactor to accommodate above edit.
- CREDITS file update.
Details:
- Fixed a bug that resulted in BLIS non-deterministically calling the
  gemmsup handler, irrespective of the thresholds that are registered
  via bli_cntx_set_blkszs().
- Deep dive: In bli_cntx_init_ref.c, the default values for the gemmsup
  thresholds (BLIS_[MNK]T blocksizes) wre being set to zero so that no
  operation ever matched the criteria for gemmsup (unless specific sup
  thresholds are registered). HOWEVER, these thresholds are set via
  bli_cntx_set_blkszs() which calls bli_blksz_copy_if_pos(), which was
  only coping the thresholds into the gks' cntx_t if the values were
  strictly positive. Thus, the zero values passed into
  bli_cntx_set_blkszs() were being ignored and those threshold slots
  within the gks were left uninitialized. The upshot of this is that the
  reference gemmsup handler was being called for gemm problems
  essentially at random (and as it turns out, very rarely the reference
  gemmsup implementation would encounter a divide-by-zero error).
- The problem was fixed by changing bli_blksz_copy_if_pos() so that it
  copies values that are non-negative (values >= 0 instead of > 0). The
  function was also renamed to bli_blksz_copy_if_nonneg()
- Also needed to standardize use of -1 as the sole value to embed into
  blksz_t structs as a signal to bli_cntx_set_blkszs() to *not* register
  a value for that slot (and instead let whatever existing values
  remain). This required updates to the bli_cntx_init_*() functions for
  bgq, cortexa9, knc, penryn, power7, and template subconfigs, as some
  of these codes were using 0 instead of -1.
- Fixes flame#781. Thanks to Devin Matthews for identifying, diagnosing, and
  proposing a fix for this issue.
Details:
- Fixed hpx barrier synchronization. HPX was hanging on larger cores
  because blis was using non-hpx synchronization primitives. But when
  using hpx-runtime only hpx-synchronization primitives should be used.
  Hence, a C style wrapper hpx_barrier_t is introduced to perform hpx
  barrier operations.
- Replaced hpx::for_loop with hpx::futures. Using hpx::for_loop with
  hpx::barrier on n_threads greater than actual hardware thread count
  causes synchronization issues making hpx hanging. This can be avoided
  by using hpx::futures, which are relatively very lightweight, robust
  and scalable.
Details:
- Added a new 'sifive_x280' subconfiguration for SiFive's x280 RISC-V 
  instruction set architecture. The subconfig registers kernels from a 
  correspondingly new kernel set, also named 'sifive_x280'.
- Added the aforementioned kernel set, which includes intrinsics- and
  assembly-based implementations of most level-1v kernels along with
  level-1f kernels axpy2v dotaxpyv, packm kernels, and level-3 gemm,
  gemmtrsm_l, and gemmtrsm_u microkernels (plus supporting files).
- Registered the 'sifive_x280' subconfig as belonging to a singleton 
  family by the same name.
- Added an entry to '.travis.yml' to test the new subconfig via qemu.
- Updates to 'travis/do_riscv.sh' script to support the 'sifive_x280' 
  subconfig and to reflect updated tarball names.
- Special thanks to Lee Killough, Devin Matthews, and Angelika Schwarz
  for their engagement on this commit.
Details:
- Install one-line headers to INCDIR whose entire purpose is to
  #include the actual headers within the local 'blis' header directory
  so that applications can #include "blis.h" instead of #include
  <blis/blis.h> (and/or "cblas.h" instead of <blis/cblas.h> if CBLAS is
  enabled) when headers are installed to global paths. (Note that
  INCDIR is the installation prefix for headers as specified by
  '--includedir=INCDIR', which defaults to 'PREFIX/include' if not
  specified.) Not sure how this problem went unreported for so long,
  since presumably any user trying to #include "blis.h" from a global
  installation would have encountered a compiler error.
- The one-line blis.h and cblas.h headers now reside in the 'build'
  directory, ready to install as is.
- Thanks to to Jed Brown for reporting this via Issue flame#786, and for
  Devin Matthews and Mo Zhou for their engagement.
- Harmonized the rule in the top-level Makefile for installing blis.pc
  into SHAREDIR/pkgconfig with conventions for others vis-a-vis
  verbosity/non-verbosity.
- (cherry picked from commit 141a6c9)
Details:
- Fixed a segfault in the non-gemm test drivers in test/3 that was the
  result of sometimes leaving either .n_str or .k_str fields of the
  params_t struct uninitialized, depending on the operation in question.
  For example, in test_hemm.c, init_def_params() would only initialize
  the .m_str and .n_str fields, but not the .k_str field. Even though
  hemm doesn't use a 'k' dimension, the proc_params() function (called
  via parse_cl_params()) universally attempts to convert all three into
  integers via sscanf(), which was understandably failing when one of
  those strings was a NULL pointer. I'm not sure how this code ever
  worked to begin with. Special thanks to Leick Robinson for finding and
  reporting this bug.
- (cherry picked from commit 1236dda)
Details:
- Previously, in cblas.h, bli_config.h was being #included *after*
  bli_system.h, which meant that the BLIS_ENABLE_SYSTEM macro was
  never defined in time for proper OS detection. This bug only
  affected cblas.h -- blis.h had been correctly #including
  bli_config.h before bli_system.h since fb93d24. Thanks to
  Edward Smyth for reporting this bug and suggesting the fix.
- (cherry picked from commit a72e456)
Details:
- Previously, the standalone performance drivers in test/3 were written
  under the assumption that the user would want to explicitly test
  either native execution *or* 1m. But because the accompanying runme.sh
  script defaults to passing "native" in for the -i command line option
  (which explicitly sets the induced method type), running the script
  without modification causes the test drivers to use slow reference
  microkernels on systems where native complex-domain microkernels are
  not registered -- which will yield poor performance for complex-domain
  level-3 operations. Furthermore, even if a user was aware of this, the
  test drivers did not support any single value for the -i option that
  would test BLIS using the library's default behavior -- that is, using
  1m on systems where it is needed and native execution on systems that
  have native microkernels implemented and registered.
- This commit addresses the aforementioned issue by supporting a new
  value for the -i option: "auto". The "auto" value causes the driver
  to avoid explicitly setting the induced method altogether, leaving
  BLIS's default behavior in place. This "auto" option is also now the
  default setting within the runme.sh script. Thanks to Leick Robinson
  for finding and reporting this issue.
- Also added support for "nat" as a shorthand for "native", which
  the help text already (erroneously) claimed was supported.
- (cherry picked from commit fd1a7e3)
Details:
- Request default induced method behavior of BLIS via "-i auto" when
  running the standalone performance drivers in test/3 via the runme.sh
  script present in that directory. (Previously, the runme.sh script
  would use "-i native" by default.) This change was originally intended
  for fd1a7e3.
- (cherry picked from commit cad5149)
Details:
- (cherry picked from commit 06dddf1)

CHANGELOG update (1.0)

Details:
- (cherry picked from commit a876918)

Version file update (1.0)

Details:
- (cherry picked from commit c2af113)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.