-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
#113 reminded me about this thing... The target instruction set or architecture isn't specified now, so AVX intrinsics don't work in C++ and C
error: always_inline function '_mm256_add_pd' requires target feature 'avx', but would be inlined into function 'add' that is compiled without support for 'avx'
I think it would be reasonable to use march=native
, which allows using available intrinsics and optimization for the current CPU architecture as well.
uniapi
Metadata
Metadata
Assignees
Type
Projects
Status
Next Up
Milestone
Relationships
Development
Select code repository
Activity
kazk commentedon Apr 29, 2021
I can add
-march=x86-64 -mtune=generic
.-march=native
will be too specific to the host VM, and I don't want that.See https://stackoverflow.com/a/54163496 and https://stackoverflow.com/a/10134204
error256 commentedon Apr 29, 2021
-march=x86-64 -mtune=generic
doesn't really change anything here, I think it's the default.OK, I've solved my original problem with intrinsics with
__attribute__((__target__("avx")))
, so it doesn't matter that much now, but anyway...Solutions are compiled to be executed just once on the same type of system. So what exactly is the issue here, is it that there can possibly be significantly different CPUs that the difference in performance with
native
will be greater that without it? Technically, JIT compilers produce native code (at least by default, I think), so if this is a problem, I think it's already there with certain other languages and that's why C# or Java code can sometimes, at least in theory, be faster than C. Or am I missing something else here?kazk commentedon Apr 29, 2021
If
-march=x86-64
doesn't do anything, forget it. I thought it still enabled what you wanted.No, I wasn't worried about performance much. I'll try explaining my thoughts based on my limited knowledge around this area. Please bear with me and let me know if I'm misunderstanding.
Assumptions:
-march=native
adds compiler options very specific to the CPU-march=x86-64
covers what you wantI guess my third assumption was incorrect?
Concern: Having code that depends on the current host VM CPU type. I don't change VM types often, but it's another variable that needs to be kept in mind if it was depended on, so I wanted to avoid it if possible (remember, I thought
-march=x86-64
covered what you wanted). I also thought this might make testing the change against published kata more difficult.Just for an example, on Google Cloud, there are VM machine types that trades cost for less control, for example E2 that doesn't guarantee the processor types. If we used this machine type with
-march=native
, submissions can be executed on different processor types (any of Intel Skylake, Broadwell, Haswell, and AMD EPYC Rome processors). Can this cause the same code to fail to compile depending on the VM it happened to be compiled on? Don't worry about performance differences. (I'm not planning to do this.)I guess my point is that
-march=native
expands to many compiler options that's difficult to keep track of.-march=native
expanded to:Instead of
-march=native
, is it possible to add some flags explicitly to get what you want?kazk commentedon Apr 29, 2021
To be clear, if you think
-march=native
is a safe default for our use case, I'm not against it. I trust you more on this.error256 commentedon Apr 30, 2021
I've never used it for anything serious, I've never even used C++ for anything serious, so don't just trust me.
Yes, code compiled with
-march=native
is supposed to be executed on CPUs with the same architecture as where it's compiled. I didn't know that it added thecache
options, so it looks like it's supposed to be the same system, but I don't see how they can affect anything apart from optimization.Partially... Now that I've found the
__target__
function attribute, it's not too important. For the whole program, instruction sets can be enabled separately, but all questions of compatibility will be mostly the same as withmarch
, but if there's a known lower bound of CPUs, instruction sets can be enabled like-mavx2
or-mavx
.Yes, if it uses functions that use instruction sets that may of may not be available, but...
Things that work or don't work depending on the CPU model are already possible, even without the
__target__
attribute there's inline asm, it's just much less convenient; so there isn't much that would change here.uniapi commentedon Jul 12, 2021
There should be no worrying about adding flags
-mavx -mavx2
if you use machine not older than 2015 Q3 for AMD and Intel.Because adding
-march=native
may not succeed to enable AVX and AVX2 support and in this case the compilation stops with errors like these:Just add: -mavx -mavx2error256 commentedon Jul 12, 2021
@uniapi I don't see how any other
-m*
can be safer thannative
.native
should always match the current platform, so it should be the safest of all-m
s as long as the program is compiled for one-time usage on the same machine.What circumstances are you talking about? It will only fail when there really is no AVX/AVX2 support.
Do I misunderstand something?
uniapi commentedon Jul 12, 2021
@error256
Ok! One of my machine is
skylake
and gcc is11.1.0
.Sure you do know skylake has support for AVX!
So when i compile with the flag
-march=native
i do get the following errors:But everything is ok when i do compile with:
-mavx -mavx2
Yes
-march=native
is completely safe but it may not succeed to enable AVXes as i've just explained above.But it seems that
Intel Pentium Dual-Core E6500 CPU
does support AVX.error256 commentedon Jul 12, 2021
@uniapi First, it looks like you're using gcc. But the behavior should be the same: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
So I don't know how you got that error. Are you using a VM? Does
lscpu
reportavx
in the flags? Whatmarch
doesgcc -v -E - -march=native </dev/null 2>&1 | grep march
show?uniapi commentedon Jul 12, 2021
@error256
This is not truth!
Here is the option for
-march=native
onx86-64 skylake
:And this one for
-march=skylake
on the same platform:So
native
is 10 lines shorter because not all instructions are enabled!Notice that
avx2
is not enabled withnative
but is turned on withskylake
That is the quoted statement is not true!
uniapi commentedon Jul 12, 2021
The OS should have support for AVX enabled.
And that is the same reason why
gdb
does not showymm
registers while debugging though you may successfully run AVX code like me!error256 commentedon Jul 12, 2021
Interesting observation, but it doesn't change the fact that, according to the documentation,
native
should enable all supported instruction sets, so it's probably a bug. I don't know your OS, but I've found this bug for gcc in macOS.But that's gcc again. What about clang?
uniapi commentedon Jul 12, 2021
It's not a bug!
The same thing with clang: 'native' throws an error and
avx2
does not!and gdb version is 10.2 but it does not recognize ymm registers to print.
It's because the OS support is not enabled for AVX2.
My OS is
SunOS 11.3
orx86_64-pc-solaris2.11
.And here the output you asked from
gcc -v -E - -march=native </dev/null 2>&1 | grep march
:error256 commentedon Jul 12, 2021
It means the behavior of
-march=native
is actually correct in this case and the actual problem is that "the OS support is not enabled for AVX2", which would need to be dealt with first in any serious scenario.uniapi commentedon Jul 12, 2021
Exactly!
So the first option is to check the OS support and add
-march=native
if enabled.And the second option if the OS support is disabled is to check the CPU info (probably hardcoded) for AVX2 presence and add
-mavx -mavx2
if present.uniapi commentedon Jul 12, 2021
Here was the hack for old ubuntu and i try to find the similar hack for my os.
Urfoex commentedon Jul 13, 2021
Interesting topic and interesting find.
But, isn't this for a coding challenge website?
I wouldn't make it all science and hacky to get things running in weird scenarios.
It should work in the general case without a huge support and maintenance cost.
And pointing out what @kazk said at:
#118 (comment)
This thing runs in the cloud without a fixed CPU. If you are going to specific, special and hacky, your code will work or not from one run to the next.
Phoronix sometimes does benchmarks on compiler optimizations:
https://www.phoronix.com/scan.php?page=article&item=gcc-10900k-compiler&num=1
https://www.phoronix.com/scan.php?page=article&item=clang-12-opt&num=1
https://www.phoronix.com/scan.php?page=article&item=amd-znver3-gcc11&num=1
https://www.phoronix.com/scan.php?page=article&item=gcc10-gcc11-5950x&num=1
https://www.phoronix.com/scan.php?page=article&item=clang-12-5950x&num=1
C++ is a beast (not just) when it comes to optimizations.
Without changing code, but the right compiler - flag - CPU combination, it can run way faster - or way slower. (Not to mention Profile-Guided-Optimizations.)
It is already annoying when challenge code passes or fails because of time constraint and being on a faster or slower machine.
But having it pass or fail because it does or doesn't support some CPU feature - even more annoying, I'd say.
uniapi commentedon Jul 13, 2021
@Urfoex i just shared the link for those who may have similar problems...
But the Codewars solution is much more easier as i've described in two scenarios.
error256 commentedon Jul 13, 2021
My suggestion is about enabling whatever the compiler thinks is available for the current system, not specifically about AVX/AVX2, which was supposed to be enabled automatically by that for any modern CPU. But it turns out it not necessarily is.
Does it even happen on Linux? (=> Is this observation even important in this context?) What exactly the OS support for AVX2 is and why can it even be turned off? Surely there must be a valid reason...
uniapi commentedon Jul 13, 2021
Also this scenario is observed when running on Virtual Machines even though you have a cool CPU.
But i have the same opinion that all instructions should be enabled but at least AVX/AVX2.
So then (according to you) the best scenario is to parse the target architecture and inject it to the compiler -march flag.
If on skylake then
-march=skylake
If on skylake-avx512 then
-march=skylake-avx512
If on pentium4m then
-march=pentium4m
and so on...
And it seems this is the best that we could have from the running CPU!
error256 commentedon Jul 13, 2021
OS, VM, whatever, the question still holds. Why doesn't the virtual machine report AVX2? Is there a reason? Is it an option?
uniapi commentedon Jul 16, 2021
@error256 'm not sure about that for 100%. I do know that you won't be able to use XSAVE if using Hyper-V and also know that it's possible (at least it seems possible) to turn on AVX2 support with VboxManage.
So now i still do not have the answer! Probably the developers of those Operating Systems should know))
And yet! How did you succeed to use AVX2 intrinsics on Codewars? What constructions or attributes did you use?
error256 commentedon Jul 16, 2021
#118 (comment)
https://www.codewars.com/kumite/608aea3fa35c7c003251bb42?sel=608b0afb4865f700290a610f
(AVX, AVX2 - no difference here.)
uniapi commentedon Sep 13, 2021
@error256 thanks for your link!
And yeah! you are right! It's not appropriate to use
-march=avx2
because it's C and it should be portable!So if running on arm we could
#include <arm_neon.h>
.So vote for adding
-march=native