forked from flame/blis
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update blis to 1.0 #6
Open
honnibal
wants to merge
749
commits into
master
Choose a base branch
from
blis-1.0
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Details: - Modified .travis.yml so that only commits to 'master', 'dev', and 'amd' branches get built by Travis CI. Thanks to Devin Matthews for helping to track down the syntax for this change.
Details: - Updated FAQ.md to include two new questions, reordered an existing question, and also removed an outdated and redundant question about BLIS vs. AMD BLIS. - Updated Sandboxes.md to use 'gemmlike' as its main example, along with other smaller details. - Added ARM as a funder to README.md.
Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.
- Fixed for 6x8 only, 4x4 & 4x8 pending; - Installed to config firestorm as benchmark seems to show better perf: Old: blis_dgemm_ukr_c 6 8 320 36.87 2.43e-17 PASS blis_dgemm_ukr_c 6 8 352 40.55 1.04e-17 PASS blis_dgemm_ukr_c 6 8 384 44.24 5.68e-17 PASS blis_dgemm_ukr_c 6 8 416 41.67 3.51e-17 PASS blis_dgemm_ukr_c 6 8 448 34.41 2.94e-17 PASS blis_dgemm_ukr_c 6 8 480 42.53 2.35e-17 PASS New: blis_dgemm_ukr_r 6 8 352 50.69 1.59e-17 PASS blis_dgemm_ukr_r 6 8 384 49.15 5.55e-17 PASS blis_dgemm_ukr_r 6 8 416 50.44 2.86e-17 PASS blis_dgemm_ukr_r 6 8 448 46.92 3.12e-17 PASS blis_dgemm_ukr_r 6 8 480 48.08 4.08e-17 PASS
…es for testsuite inaries. - RPATH entries (and DYLD_LIBRARY_PATH) do nothing on macOS unless the install_name of the library starts with @rpath/. While the install_name can be set to the absolute install path, this makes the installation non-relocatable. When using @path in the install_name, install paths within the normal DYLD_LIBRARY_PATH work with no changes on the user side, but for install paths off the beaten track, users must specify an RPATH entry when linking (or modify DYLD_LIBRARY_PATH at runtime). Perhaps this could be made into a configure-time option. - Having relocable testsuite binaries is not necessarily a priority but it is easy to do with @executable_path (macOS) or $ORIGIN (linux/BSD).
Details: - Reworked support for ARM hardware detection in bli_cpuid.c to parse the result of a CPUID-like instruction. - Added a64fx support to bli_gks.c. - #include arm64 and arm32 family headers from bli_arch_config.h. - Fix the ordering of the "armsve" and "a64fx" strings in the config_name string array in bli_arch.c. The ordering did not match the ordering of the corresponding arch_t values in bli_type_defs.h, as it should have all along. - Added clang support to make_defs.mk in arm64, cortexa53, cortexa57 subconfigs. - Updated arm64 and arm32 families in config_registry. - Updated docs/HardwareSupport.md to reflect added ARM support. - Thanks to Dave Love, RuQing Xu, and Devin Matthews for their contributions in this PR (flame#344).
Adds `--enable-rpath/--disable--rpath` (default disabled) to use an install_name starting with @rpath/. Otherwise, set the install_name to the absolute path of the install library, which was the previous behavior.
Move unused ARM SVE kernels to "old" directory.
bli_error: more cleanup on the error strings array
Add an option to use an @rpath-dependent install_name on macOS
[ci skip]
Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.
Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c. For safety upon similar strategies in the future, change all [mn]_[iter/left] into signed ints.
Commenting out <sys/sysctl.h> due to possibly a Xcode bug.
This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.
Details: - Minor changes to test_libblis.c to support hpx.
Details: - Added build/detect/android/bionic.h header to test whether the __BIONIC__ cpp macro is defined. - In common.mk, only add -lrt to LDFLAGS when Bionic is not present. - CREDITS file update.
Details: - Fixed various typos in API documentation in docs/BLIS*API.md and comments in the source code examples within examples/?api/*.c.
Details: - Added reminders that #include "blis.h" must be added to source files in order to access BLIS API function prototypes. Thanks to Barry Smith for suggesting this improvement. - Fixed pre-existing typos. - CREDITS file update.
Details: - Previously, the timpl_t id being used when a thrcomm_t is being initialized was set within the bli_thrcomm_init() dispatch function after the timpl_t-specific bli_thrcomm_init_*() function returned. But it just occurred to me that each bli_thrcomm_init_*() function already intrinsically knows its own timpl_t value. This commit shifts the setting of the thrcomm_t.ti field into the corresponding bli_thrcomm_init_*() function for each timpl_t type (e.g. single, openmp, pthreads, hpx). - Removed long-deprecated code dating back nearly 10 years. - Whitespace changes - Comment updates.
Details: - Commit 2db31e0 (flame#755) inserted logic into common.mk that attempts to preprocess build/detect/android/bionic.h to determine whether the __BIONIC__ macro is defined (in which case -lrt should not be included in LDFLAGS). However, the path to bionic.h was encoded without regard to DIST_PATH, and so utilizing common.mk anywhere that isn't the top- level directory (such as in the testsuite directory) resulted in a compiler error: gcc: error: build/detect/android/bionic.h: No such file or directory gcc: fatal error: no input files compilation terminated. This commit adds a $(DIST_PATH) prefix to the path to bionic.h so that it can be located from other applications' Makefiles that use BLIS's makefile fragments.
Details: - Revamped bli_init_apis() and bli_finalize_apis() to use separate bli_pthread_switch_t objects for each of the five sub-API init functions, with the objects for the 'ind' and 'rntm' sub-APIs being declared with BLIS_THREAD_LOCAL. This allows some APIs to be treated as thread-local and the rest as thread-shared. Thanks to Edward Smyth for requesting application thread-specific rntm_t structs, which inspired these change. - Combined bli_thread_init_from_env() and bli_pack_init_from_env() into a new function, bli_rntm_init_rntm_from_env(), and placed the combined code in bli_rntm.c inside of a new bli_rntm_init() function. Then removed the (now empty) bli_pack_init() and _finalize() function defs. - Deprecated bli_rntm_init() for the purposes of initializing a rntm_t (temporarily preserving it as bli_rntm_clear() in a cpp-undefined code block) so that the function name could be used for the aforementioned bli_rntm_init() function. - Updated libblis_test_pobj_create() in test_libblis.c to use a static rntm_t initializer instead of the deprecated bli_rntm_init() function-based option. - Minor updates to docs/Multithreading.md, including removal of bli_rntm_init() in the example of how to initialize rntm_t structs. - Changed the return value of bli_gks_init(), bli_ind_init(), bli_memsys_init(), bli_thread_init(), and bli_rntm_init() (and their finalize() counterparts) from 'void' to 'int' so that those functions match the function type expected by bli_pthread_switch_on()/_off(). Those init/finalize functions now return 0 to indicate success, which is needed so that the switch actually changes state from off to on and vice versa. - Defined bli_thread_reset(), which copies the contents of the global_rntm_at_init() struct into the global_rntm struct (for the current application thread). - Guard calls to bli_pthread_mutex_lock()/_unlock() in - bli_pack_set_pack_a() and _pack_b() - bli_rntm_init_from_global() - bli_thread_set_ways() - bli_thread_set_num_threads() - bli_thread_set_thread_impl() - bli_thread_reset() - bli_l3_ind_oper_set_enable() with #ifdef BLIS_DISABLE_TLS (since TLS precludes the possibility of race conditions). - In frame/base/bli_rntm.c, declare global_rntm, global_rntm_at_init, and global_rntm_mutex as BLIS_THREAD_LOCAL so that separate application threads can change the number of ways of BLIS parallelism independently from one another. - Access global_rntm only via a new private (not exported) function, bli_global_rntm(). Defined a similar function for a rntm_t new to this commit, global_rntm_at_init, which preserves the state of the global rntm at initialization-time. - In frame/3/bli_l3_ind.c, added a guard to the declaration of the static variable oper_st_mutex with #ifdef BLIS_DISABLE_TLS so that the mutex is omitted altogether when TLS is enabled (which prevents the compiler from warning about an unused variable). - Removed redundant code from bli_thread.c: #ifdef BLIS_ENABLE_HPX #include "bli_thread_hpx.h" #endif since this code is already present in bli_thread.h. - Thanks to Minh Quan Ho for his review of and feedback on this commit. - Comment updates.
Details: - Replaced 404'd link in docs/Multithreading.md with an archive from The Wayback Machine. - CREDITS file update.
Details: - Fixed hpx::for_each invocation and replace with hpx::for_loop. The HPX runtime was initialized using hpx::start, but the hpx::for_each function was being called on a non-hpx runtime (i.e standard BLIS runtime - single main thread). To run hpx::for_each on HPX runtime correctly, the code now uses hpx::run_as_hpx_thread(func, args...). - Replaced hpx::for_each with hpx::for_loop, which eliminates use of hpx::util::counting_iterator. - Employ hpx::execution::chunk_size(1) to make sure that a thread resides on a particular core. - Replaced hpx::apply() with updated version hpx::post(). - Initialize tdata->id = 0 in libblis.c to 0, as it is the main thread and is needed for writing results to output file. - By default, if not specified, the HPX runtime uses all N threads/cores available in the system. But, if we want to only specify n_threads out N threads, we use hpx::execution::experimental::num_cores(n_threads).
Details: - Forward-ported 'altra' and 'altramax' subconfigurations from the older 'stable' branch lineage [1]. These subconfigs primarily target the Ampere Altra and AltraMax (ARM) processors. They also contain "QuickStart" directories with information and scripts to help use BLIS on these microarchitectures. Thanks to Jeff Diamond and Leick Robinson for developing these subconfigs and resources. - Updated kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c according to changes in the 'stable' lineage, mostly related to re-enabling of assembly code branches that target general stride IO. [1] Note that the 'stable' branch is being used to make sure that more recent commits do not introduce unreasonable performance regressions. As such, the name should be interpreted as shorthand for "performance stable," not "API stable."
Details: - Expanded existing BLAS compatibility APIs to provide interfaces to [cz]symv_(), [cz]syr_(). This was easy since those operations were already implemented natively in BLIS; the APIs were previously omitted only because they were not formally part of the BLAS. - Implemented [cz]rot_() by feeding code from LAPACK 3.11 through f2c. - Thanks to James Foster for pointing out that LAPACK contains these additional symbols, which prompted these additions, as well as for testing the [cz]rot_() functions from Julia's test infrastructure. - CREDITS file update.
Details: - Previously, disabling the sba via --disable-sba-pools resulted in a segfault due to a sanity-check-triggering abort(). The problem was that the sba, as currently used in the l3 thread decorators, did not yet (fully) support pools being disabled. The solution entailed creating wrapper function, bli_sba_array_elem(), which either calls bli_apool_array_elem() (when sba pools are enabled at configure time) or returns a NULL sba_pool pointer (when sba pools are disabled), and calling bli_sba_array_elem() in place of bli_apool_array_elem(). Note that the NULL pointer returned by bli_sba_array_elem() when the sba pools are disabled does no harm since in that situation the pointer goes unreferenced when acquiring and releasing small blocks. Thanks to John Mather for reporting this bug. - Guarded the bodies of bli_sba_init() and bli_sba_finalize() with #ifdef BLIS_ENABLE_SBA_POOLS. I don't think this was actually necessary to fix the aforementioned bug, but it seems like good practice. - Moved the code in bli_l3_thrinfo_create() that checked that the array* pointer is non-NULL before calling bli_sba_array_elem() (previously bli_apool_array_elem()) into the definition of bli_sba_array_elem(). - Renamed various instances of 'pool' variables and function parameters to 'sba_pool' to emphasize what kind of pool it represents. - Whitespace changes.
Details: - Parse $(CC_VENDOR) values of "nvc" in 'zen3' make_defs.mk file. - Minor refactor to accommodate above edit. - CREDITS file update.
Details: - Fixed a bug that resulted in BLIS non-deterministically calling the gemmsup handler, irrespective of the thresholds that are registered via bli_cntx_set_blkszs(). - Deep dive: In bli_cntx_init_ref.c, the default values for the gemmsup thresholds (BLIS_[MNK]T blocksizes) wre being set to zero so that no operation ever matched the criteria for gemmsup (unless specific sup thresholds are registered). HOWEVER, these thresholds are set via bli_cntx_set_blkszs() which calls bli_blksz_copy_if_pos(), which was only coping the thresholds into the gks' cntx_t if the values were strictly positive. Thus, the zero values passed into bli_cntx_set_blkszs() were being ignored and those threshold slots within the gks were left uninitialized. The upshot of this is that the reference gemmsup handler was being called for gemm problems essentially at random (and as it turns out, very rarely the reference gemmsup implementation would encounter a divide-by-zero error). - The problem was fixed by changing bli_blksz_copy_if_pos() so that it copies values that are non-negative (values >= 0 instead of > 0). The function was also renamed to bli_blksz_copy_if_nonneg() - Also needed to standardize use of -1 as the sole value to embed into blksz_t structs as a signal to bli_cntx_set_blkszs() to *not* register a value for that slot (and instead let whatever existing values remain). This required updates to the bli_cntx_init_*() functions for bgq, cortexa9, knc, penryn, power7, and template subconfigs, as some of these codes were using 0 instead of -1. - Fixes flame#781. Thanks to Devin Matthews for identifying, diagnosing, and proposing a fix for this issue.
Details: - Fixed hpx barrier synchronization. HPX was hanging on larger cores because blis was using non-hpx synchronization primitives. But when using hpx-runtime only hpx-synchronization primitives should be used. Hence, a C style wrapper hpx_barrier_t is introduced to perform hpx barrier operations. - Replaced hpx::for_loop with hpx::futures. Using hpx::for_loop with hpx::barrier on n_threads greater than actual hardware thread count causes synchronization issues making hpx hanging. This can be avoided by using hpx::futures, which are relatively very lightweight, robust and scalable.
Details: - Added a new 'sifive_x280' subconfiguration for SiFive's x280 RISC-V instruction set architecture. The subconfig registers kernels from a correspondingly new kernel set, also named 'sifive_x280'. - Added the aforementioned kernel set, which includes intrinsics- and assembly-based implementations of most level-1v kernels along with level-1f kernels axpy2v dotaxpyv, packm kernels, and level-3 gemm, gemmtrsm_l, and gemmtrsm_u microkernels (plus supporting files). - Registered the 'sifive_x280' subconfig as belonging to a singleton family by the same name. - Added an entry to '.travis.yml' to test the new subconfig via qemu. - Updates to 'travis/do_riscv.sh' script to support the 'sifive_x280' subconfig and to reflect updated tarball names. - Special thanks to Lee Killough, Devin Matthews, and Angelika Schwarz for their engagement on this commit.
Details: - Install one-line headers to INCDIR whose entire purpose is to #include the actual headers within the local 'blis' header directory so that applications can #include "blis.h" instead of #include <blis/blis.h> (and/or "cblas.h" instead of <blis/cblas.h> if CBLAS is enabled) when headers are installed to global paths. (Note that INCDIR is the installation prefix for headers as specified by '--includedir=INCDIR', which defaults to 'PREFIX/include' if not specified.) Not sure how this problem went unreported for so long, since presumably any user trying to #include "blis.h" from a global installation would have encountered a compiler error. - The one-line blis.h and cblas.h headers now reside in the 'build' directory, ready to install as is. - Thanks to to Jed Brown for reporting this via Issue flame#786, and for Devin Matthews and Mo Zhou for their engagement. - Harmonized the rule in the top-level Makefile for installing blis.pc into SHAREDIR/pkgconfig with conventions for others vis-a-vis verbosity/non-verbosity. - (cherry picked from commit 141a6c9)
Details: - Fixed a segfault in the non-gemm test drivers in test/3 that was the result of sometimes leaving either .n_str or .k_str fields of the params_t struct uninitialized, depending on the operation in question. For example, in test_hemm.c, init_def_params() would only initialize the .m_str and .n_str fields, but not the .k_str field. Even though hemm doesn't use a 'k' dimension, the proc_params() function (called via parse_cl_params()) universally attempts to convert all three into integers via sscanf(), which was understandably failing when one of those strings was a NULL pointer. I'm not sure how this code ever worked to begin with. Special thanks to Leick Robinson for finding and reporting this bug. - (cherry picked from commit 1236dda)
Details: - Previously, in cblas.h, bli_config.h was being #included *after* bli_system.h, which meant that the BLIS_ENABLE_SYSTEM macro was never defined in time for proper OS detection. This bug only affected cblas.h -- blis.h had been correctly #including bli_config.h before bli_system.h since fb93d24. Thanks to Edward Smyth for reporting this bug and suggesting the fix. - (cherry picked from commit a72e456)
Details: - Previously, the standalone performance drivers in test/3 were written under the assumption that the user would want to explicitly test either native execution *or* 1m. But because the accompanying runme.sh script defaults to passing "native" in for the -i command line option (which explicitly sets the induced method type), running the script without modification causes the test drivers to use slow reference microkernels on systems where native complex-domain microkernels are not registered -- which will yield poor performance for complex-domain level-3 operations. Furthermore, even if a user was aware of this, the test drivers did not support any single value for the -i option that would test BLIS using the library's default behavior -- that is, using 1m on systems where it is needed and native execution on systems that have native microkernels implemented and registered. - This commit addresses the aforementioned issue by supporting a new value for the -i option: "auto". The "auto" value causes the driver to avoid explicitly setting the induced method altogether, leaving BLIS's default behavior in place. This "auto" option is also now the default setting within the runme.sh script. Thanks to Leick Robinson for finding and reporting this issue. - Also added support for "nat" as a shorthand for "native", which the help text already (erroneously) claimed was supported. - (cherry picked from commit fd1a7e3)
Details: - Request default induced method behavior of BLIS via "-i auto" when running the standalone performance drivers in test/3 via the runme.sh script present in that directory. (Previously, the runme.sh script would use "-i native" by default.) This change was originally intended for fd1a7e3. - (cherry picked from commit cad5149)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.