what's the difference between __builtin_popcountll and_mm_popcnt_u64?

what's the difference between __builtin_popcountll and_mm_popcnt_u64? - c

I was trying to how many 1 in 512MB memory and I found two possible methods, _mm_popcnt_u64() and __builtin_popcountll() in the gcc builtins.
_mm_popcnt_u64() is said to use the CPU introduction SSE4.2，which seems to be the fastest, and __builtin_popcountll() is excepted to use table lookup.
So, I think __builtin_popcountll() should be little slower than _mm_popcnt_u64().
However I got a result like this:
It took almost the same time for two methods. I highly doubt that they used the same way to work.
I also got this in popcntintrin.h
/* Calculate a number of bits set to 1. */
extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial___))
_mm_popcnt_u32 (unsigned int __X)
{
return __builtin_popcount (__X);
}
#ifdef __x86_64__
extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_popcnt_u64 (unsigned long long __X)
{
return __builtin_popcountll (__X);
}
#endif
So, I'm confused how __builtin_popcountll() works on earth

_mm_popcnt_u64 is part of <nmmintrin.h>, a header devised by Intel for utility functions for accessing SSE 4.2 instructions.
__builtin_popcountll is a GCC extension.
_mm_popcnt_u64 is portable to non-GNU compilers, and __builtin_popcountll is portable to non-SSE-4.2 CPUs. But on systems where both are available, both should compile to the exact same code.

If You compile without march flag, so with x86_64 default, builtin should be slower because it needs to dispatch function selecting between different architectures. This will cause no inlining and additional condition.

Related

How to tell gcc to not align function parameters on the stack?

I am trying to decompile an executable for the 68000 processor into C code, replacing the original subroutines with C functions one by one.
The problem I faced is that I don't know how to make gcc use the calling convention that matches the one used in the original program. I need the parameters on the stack to be packed, not aligned.
Let's say we have the following function
int fun(char arg1, short arg2, int arg3) {
return arg1 + arg2 + arg3;
}
If we compile it with
gcc -m68000 -Os -fomit-frame-pointer -S source.c
we get the following output
fun:
move.b 7(%sp),%d0
ext.w %d0
move.w 10(%sp),%a0
lea (%a0,%d0.w),%a0
move.l %a0,%d0
add.l 12(%sp),%d0
rts
As we can see, the compiler assumed that parameters have addresses 7(%sp), 10(%sp) and 12(%sp):
but to work with the original program they need to have addresses 4(%sp), 5(%sp) and 7(%sp):
One possible solution is to write the function in the following way (the processor is big-endian):
int fun(int bytes4to7, int bytes8to11) {
char arg1 = bytes4to7>>24;
short arg2 = (bytes4to7>>8)&0xffff;
int arg3 = ((bytes4to7&0xff)<<24) | (bytes8to11>>8);
return arg1 + arg2 + arg3;
}
However, the code looks messy, and I was wondering: is there a way to both keep the code clean and achieve the desired result?
UPD: I made a mistake. The offsets I'm looking for are actually 5(%sp), 6(%sp) and 8(%sp) (the char-s should be aligned with the short-s, but the short-s and the int-s are still packed):
Hopefully, this doesn't change the essence of the question.
UPD 2: It turns out that the 68000 C Compiler by Sierra Systems gives the described offsets (as in UPD, with 2-byte alignment).
However, the question is about tweaking calling conventions in gcc (or perhaps another modern compiler).

Here's a way with a packed struct. I compiled it on an x86 with -m32 and got the desired offsets in the disassembly, so I think it should still work for an mc68000:
typedef struct {
char arg1;
short arg2;
int arg3;
} __attribute__((__packed__)) fun_t;
int
fun(fun_t fun)
{
return fun.arg1 + fun.arg2 + fun.arg3;
}
But, I think there's probably a still cleaner way. It would require knowing more about the other code that generates such a calling sequence. Do you have the source code for it?
Does the other code have to remain in asm? With the source, you could adjust the offsets in the asm code to be compatible with modern C ABI calling conventions.
I've been programming in C since 1981 and spent years doing mc68000 C and assembler code (for apps, kernel, device drivers), so I'm somewhat familiar with the problem space.

It's not a gcc 'fault', it is 68k architecture that requires stack to be always aligned on 2 bytes.
So there is simply no way to break 2-byte alignment on the hardware stack.
but to work with the original program they need to have addresses
4(%sp), 5(%sp) and 7(%sp):
Accessing word or long values off the ODD memory address will immediately trigger alignment exception on 68000.

To get integral parameters passed using 2 byte alignment instead of 4 byte alignment, you can change the default int size to be 16 bit by -mshort. You need to replace all int in your code by long (if you want them to be 32 bit wide). The crude way to do that is to also pass -Dint=long to your compiler. Obviously, you will break ABI compatibility to object files compiled with -mno-short (which appears to be the default for gcc).

Initializing array in C - execution time

int a[5] = {0};
VS
typedef struct
{
int a[5];
} ArrStruct;
ArrStruct arrStruct;
sizeA = sizeof(arrStruct.a)/sizeof(int);
for (it = 0 ; it < sizeA ; ++it)
arrStruct.a[it] = 0;
Does initializing by for loop takes more execution time? if so, why?

It depends upon the compiler and the optimization flags.
On recent GCC (e.g. 4.8 or 4.9) with gcc -O3 (or probably even -O1 or -O2) it should not matter, since the same code would be emitted (GCC has even an optimization which would transform your loop into a builtin_memset which would be further optimized).
On some compilers, it could happen that the int a[5] = {0}; might be faster, because the compiler might emit e.g. vector instruction (or on x86 a rep stosw) to clear an array.
The best thing is to examine the generated (gimple representation and) assembler code (e.g. with gcc -fdump-tree-gimple -O3 -fverbose-asm -mtune=native -S) and to benchmark. Most of the cases it does not matter. Be sure to enable optimizations when compiling.
Generally, don't care about such micro-optimization; a good optimizing compiler is better than you have time to code.

It depends on the scope of the variables. For a static or global variable, the first initialization
int a[5]={0};
may be done at compile time, while the loop is run at, well, run time. Thus there is no "execution" associated with the former.
You may find the discussion of this question (and in particular this answer ) interesting.

Assembly label address incorrect on 32-bit processors

I have some simple code that finds the difference between two assembly labels:
#include <stdio.h>
static void foo(void){
__asm__ __volatile__("_foo_start:");
printf("Hello, world.\n");
__asm__ __volatile__("_foo_end:");
}
int main(void){
extern const char foo_start[], foo_end[];
printf("foo_start: %p, foo_end: %p\n", foo_start, foo_end);
printf("Difference = 0x%tx.\n", foo_end - foo_start);
foo();
return 0;
}
Now, this code works perfectly on 64-bit processors, just like you would expect it to. However, on 32-bit processors, the address of foo_start is the same as foo_end.
I'm sure it has to do with 32 to 64 bit. On i386, it results in 0x0, and x86_64 results in 0x7. On ARMv7 (32 bit), it results in 0x0, while on ARM64, it results in 0xC. (the 64-bit results are correct, I checked them with a disassembler)
I'm using Clang+LLVM to compile.
I'm wondering if it has to do with non-lazy pointers. In the assembly output of both 32-bit processor archs mentioned above, they have something like this at the end:
L_foo_end$non_lazy_ptr:
.indirect_symbol _foo_end
.long 0
L_foo_start$non_lazy_ptr:
.indirect_symbol _foo_start
.long 0
However, this is not present in the assembly output of both x86_64 and ARM64. I messed with removing the non-lazy pointers and addressing the labels directly yesterday, but to no avail. Any ideas on why this happens?
EDIT:
It appears that when compiled for 32 bit processors, foo_start[] and foo_end[] point to main. I....I'm so confused.

I didn't check on real code but suspect you are a victim of instruction reordering. As long as you do not define proper memory barriers, the compiler ist free to move your code within the function around as it sees fit since there is no interdependency between labels and printf() call.
Try adding ::: "memory" to your asm statements which should nail them where you wrote them.

I finally found the solution (or, alternative, I suppose). Apparently, the && operator can be used to get the address of C labels, removing the need for me to use inline assembly at all. I don't think it's in the C standard, but it looks like Clang supports it, and I've heard GCC does too.
#include <stdio.h>
int main(void){
foo_start:
printf("Hello, world.\n");
foo_end:
printf("Foo has ended.");
void* foo_start_ptr = &&foo_start;
void* foo_end_ptr = &&foo_end;
printf("foo_start: %p, foo_end: %p\n", foo_start_ptr, foo_end_ptr);
printf("Difference: 0x%tx\n", (long)foo_end_ptr - (long)foo_start_ptr);
return 0;
}
Now, this only works if the labels are in the same function, but for what I intend to use this for, it's perfect. No more ASM, and it doesn't leave a symbol behind. It appears to work just how I need it to. (Not tested on ARM64)

Detecting Endianness

I'm currently trying to create a C source code which properly handles I/O whatever the endianness of the target system.
I've selected "little endian" as my I/O convention, which means that, for big endian CPU, I need to convert data while writing or reading.
Conversion is not the issue. The problem I face is to detect endianness, preferably at compile time (since CPU do not change endianness in the middle of execution...).
Up to now, I've been using this :
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
...
#else
...
#endif
It's documented as a GCC pre-defined macro, and Visual seems to understand it too.
However, I've received report that the check fails for some big_endian systems (PowerPC).
So, I'm looking for a foolproof solution, which ensures that endianess is correctly detected, whatever the compiler and the target system. well, most of them at least...
[Edit] : Most of the solutions proposed rely on "run-time tests". These tests may sometimes be properly evaluated by compilers during compilation, and therefore cost no real runtime performance.
However, branching with some kind of << if (0) { ... } else { ... } >> is not enough. In the current code implementation, variable and functions declaration depend on big_endian detection. These cannot be changed with an if statement.
Well, obviously, there is fall back plan, which is to rewrite the code...
I would prefer to avoid that, but, well, it looks like a diminishing hope...
[Edit 2] : I have tested "run-time tests", by deeply modifying the code. Although they do their job correctly, these tests also impact performance.
I was expecting that, since the tests have predictable output, the compiler could eliminate bad branches. But unfortunately, it doesn't work all the time. MSVC is good compiler, and is successful in eliminating bad branches, but GCC has mixed results, depending on versions, kind of tests, and with greater impact on 64 bits than on 32 bits.
It's strange. And it also means that the run-time tests cannot be ensured to be dealt with by the compiler.
Edit 3 : These days, I'm using a compile-time constant union, expecting the compiler to solve it to a clear yes/no signal.
And it works pretty well :
https://godbolt.org/g/DAafKo

As stated earlier, the only "real" way to detect Big Endian is to use runtime tests.
However, sometimes, a macro might be preferred.
Unfortunately, I've not found a single "test" to detect this situation, rather a collection of them.
For example, GCC recommends : __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ . However, this only works with latest versions, and earlier versions (and other compilers) will give this test a false value "true", since NULL == NULL. So you need the more complete version : defined(__BYTE_ORDER__)&&(__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
OK, now this works for newest GCC, but what about other compilers ?
You may try __BIG_ENDIAN__ or __BIG_ENDIAN or _BIG_ENDIAN which are often defined on big endian compilers.
This will improve detection. But if you specifically target PowerPC platforms, you can add a few more tests to improve even more detection. Try _ARCH_PPC or __PPC__ or __PPC or PPC or __powerpc__ or __powerpc or even powerpc. Bind all these defines together, and you have a pretty fair chance to detect big endian systems, and powerpc in particular, whatever the compiler and its version.
So, to summarize, there is no such thing as a "standard pre-defined macros" which guarantees to detect big-endian CPU on all platforms and compilers, but there are many such pre-defined macros which, collectively, give a high probability of correctly detecting big endian under most circumstances.

At compile time in C you can't do much more than trusting preprocessor #defines, and there are no standard solutions because the C standard isn't concerned with endianness.
Still, you could add an assertion that is done at runtime at the start of the program to make sure that the assumption done when compiling was true:
inline int IsBigEndian()
{
int i=1;
return ! *((char *)&i);
}
/* ... */
#ifdef COMPILED_FOR_BIG_ENDIAN
assert(IsBigEndian());
#elif COMPILED_FOR_LITTLE_ENDIAN
assert(!IsBigEndian());
#else
#error "No endianness macro defined"
#endif
(where COMPILED_FOR_BIG_ENDIAN and COMPILED_FOR_LITTLE_ENDIAN are macros #defined previously according to your preprocessor endianness checks)

Instead of looking for a compile-time check, why not just use big-endian order (which is considered the "network order" by many) and use the htons/htonl/ntohs/ntohl functions provided by most UNIX-systems and Windows. They're already defined to do the job you're trying to do. Why reinvent the wheel?

Try something like:
if(*(char *)(int[]){1}) {
/* little endian code */
} else {
/* big endian code */
}
and see if your compiler resolves it at compile-time. If not, you might have better luck doing the same with a union. Actually I like defining macros using unions that evaluate to 0,1 or 1,0 (respectively) so that I can just do things like accessing buf[HI] and buf[LO].

Notwithstanding compiler-defined macros, I don't think there's a compile-time way to detect this, since determining the endianness of an architecture involves analyzing the manner in which it stores data in memory.
Here's a function which does just that:
bool IsLittleEndian () {
int i=1;
return (int)*((unsigned char *)&i)==1;
}

As others have pointed out, there isn't a portable way to check for endianness at compile-time. However, one option would be to use the autoconf tool as part of your build script to detect whether the system is big-endian or little-endian, then to use the AC_C_BIGENDIAN macro, which holds this information. In a sense, this builds a program that detects at runtime whether the system is big-endian or little-endian, then has that program output information that can then be used statically by the main source code.
Hope this helps!

This comes from p. 45 of Pointers in C:
#include <stdio.h>
#define BIG_ENDIAN 0
#define LITTLE_ENDIAN 1
int endian()
{
short int word = 0x0001;
char *byte = (char *) &word;
return (byte[0] ? LITTLE_ENDIAN : BIG_ENDIAN);
}
int main(int argc, char* argv[])
{
int value;
value = endian();
if (value == 1)
printf("The machine is Little Endian\n");
else
printf("The machine is Big Endian\n");
return 0;
}

Socket's ntohl function can be used for this purpose. Source
// Soner
#include <stdio.h>
#include <arpa/inet.h>
int main() {
if (ntohl(0x12345678) == 0x12345678) {
printf("big-endian\n");
} else if (ntohl(0x12345678) == 0x78563412) {
printf("little-endian\n");
} else {
printf("(stupid)-middle-endian\n");
}
return 0;
}

My GCC version is 9.3.0, it's configured to support powerpc64 platform, and I've tested it and verified that it supports the following macros logic:
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
......
#endif
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
.....
#endif

As of C++20, no more hacks or compiler extensions are necessary.
https://en.cppreference.com/w/cpp/types/endian
std::endian (Defined in header <bit>)
enum class endian
{
little = /*implementation-defined*/,
big = /*implementation-defined*/,
native = /*implementation-defined*/
};
If all scalar types are little-endian, std::endian::native equals std::endian::little
If all scalar types are big-endian, std::endian::native equals std::endian::big

You can't detect it at compile time to be portable across all compilers. Maybe you can change the code to do it at run-time - this is achievable.

It is not possible to detect endianness portably in C with preprocessor directives.

I took the liberty of reformatting the quoted text
As of 2017-07-18, I use union { unsigned u; unsigned char c[4]; }
If sizeof (unsigned) != 4 your test may fail.
It may be better to use
union { unsigned u; unsigned char c[sizeof (unsigned)]; }

As most have mentioned, compile time is your best bet. Assuming you do not do cross compilations and you use cmake (it will also work with other tools such as a configure script, of course) then you can use a pre-test which is a compiled .c or .cpp file and that gives you the actual verified endianness of the processor you're running on.
With cmake you use the TestBigEndian macro. It sets a variable which you can then pass to your software. Something like this (untested):
TestBigEndian(IS_BIG_ENDIAN)
...
set(CFLAGS ${CFLAGS} -DIS_BIG_ENDIAN=${IS_BIG_ENDIAN}) // C
set(CXXFLAGS ${CXXFLAGS} -DIS_BIG_ENDIAN=${IS_BIG_ENDIAN}) // C++
Then in your C/C++ code you can check that IS_BIG_ENDIAN define:
#if IS_BIG_ENDIAN
...do big endian stuff here...
#else
...do little endian stuff here...
#endif
So the main problem with such a test is cross compiling since you may be on a completely different CPU with a different endianness... but at least it gives you the endianness at time of compiling the rest of your code and will work for most projects.

I provided a general approach in C with no preprocessor, but only runtime that compute endianess for every C type.
the output if this on my Linux x86_64 architecture is:
fabrizio#toshibaSeb:~/git/pegaso/scripts$ gcc -o sizeof_endianess sizeof_endianess.c
fabrizio#toshibaSeb:~/git/pegaso/scripts$ ./sizeof_endianess
INTEGER TYPE | signed | unsigned | 0x010203... | Endianess
--------------+---------+------------+-------------------------+--------------
int | 4 | 4 | 04 03 02 01 | little
char | 1 | 1 | - | -
short | 2 | 2 | 02 01 | little
long int | 8 | 8 | 08 07 06 05 04 03 02 01 | little
long long int | 8 | 8 | 08 07 06 05 04 03 02 01 | little
--------------+---------+------------+-------------------------+--------------
FLOATING POINT| size |
--------------+---------+
float | 4
double | 8
long double | 16
Get source at: https://github.com/bzimage-it/pegaso/blob/master/scripts/sizeof_endianess.c
This is a more general approach is to not detect endianess at compilation time (not possibile) nor assume any endianess escludes another one. In fact is important to remark that endianess is not a concept of the architecture/processor but regards single type. As argued by
#Christoph at https://stackoverflow.com/a/4712594/3280080 PDP-11 for example can have different endianess at the same time.
The approach consist to set an integer to be x = 0x010203... as long is it, then print them looking at casted-at-single-byte incrementing the address by one.
Can somebody test it please in a big endian and/or mixed endianess ?

I know I'm late to this party, but here is my take.
int is_big_endian() {
return 1 & *(uint16_t*)"01";
}
This is based on the fact that '0' is 48 in decimal and '1' 49, so '1' has the LSB bit set, while '0' not. I could make them '\x00' and '\x01' but I think my version makes it more readable.

#define BIG_ENDIAN ((1 >> 1 == 0) ? 0 : 1)

glibc - force function call (no inline expansion)

I have a question regarding glibc function calls. Is there a flag to tell gcc not to inline a certain glibc function, e.g. memcpy?
I've tried -fno-builtin-memcpy and other flags, but they didn't work. The goal is that the actual glibc memcpy function is called and no inlined code (since the glibc version at compile time differs from the one at runtime). It's for testing purposes only. Normally I wan't do that.
Any solutions?
UPDATE:
Just to make it clearer: In the past memcpy works even with overlapping areas. This has changed meanwhile and I can see this changes when compiling with different glibc versions. So now I want to test if my old code (using memcpy where memmove should have been used) works correct or not on a system with a newer glibc (e.g. 2.14). But to do that, I have to make sure, that the new memcpy is called and no inlined code.
Best regards

This may not be exactly what you're looking for, but it seems to force gcc to generate an indirect call to memcpy():
#include <stdio.h>
#include <string.h>
#include <time.h>
// void *memcpy(void *dest, const void *src, size_t n)
unsigned int x = 0xdeadbeef;
unsigned int y;
int main(void) {
void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;
if (time(NULL) == 1) {
memcpy_ptr = NULL;
}
memcpy_ptr(&y, &x, sizeof y);
printf("y = 0x%x\n", y);
return 0;
}
The generated assembly (gcc, Ubuntu, x86) includes a call *%edx instruction.
Without the if (time(NULL) == 1) test (which should never succeed, but the compiler doesn't know that), gcc -O3 is clever enough to recognize that the indirect call always calls memcpy(), which can then be replaced by a movl instruction.
Note that the compiler could recognize that if memcpy_ptr == NULL then the behavior is undefined, and again replace the indirect call with a direct call, and then with a movl instruction. gcc 4.5.2 with -O3 doesn't appear to be that clever. If a later version of gcc is, you could replace the memcpy_ptr = NULL with an assignment of some actual function that behaves differently than memcpy().

In theory:
gcc -fno-inline -fno-builtin-inline ...
But then you said -fno-builtin-memcpy didn't stop the compiler from inlining it, so there's no obvious reason why this should work any better.

#undef memcpy
#define mempcy your_memcpy_replacement
Somewhere at the top but after #include obviously
And mark your_memcpy_replacement as attribute((noinline))

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight