No performance difference in different variations of the same program - c

I copied glibc's implementation of binary search algorithm, then modified it a little bit to suit my needs. I decided to test it and other things I have learned about GCC (attributes and built-ins).
The code looks as:
int main() {
uint_fast16_t a[61] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61 };
uint64_t t1 = Time(0);
for(register uint_fast16_t i = 0; i < 10000000; ++i) {
binary_search(rand() % 62, a, 61);
}
printf("%ld\n", Time(0) - t1);
return 0;
}
Now, this program runs just fine. The problem begins when I add more lines of code, for instance:
uint_fast16_t a[61] __attribute__ ((aligned (64) )) = /* ... */
In this case I would expect faster code, yet performance has not changed after multiple tests (tens of tests).
I also tested the program with alignment of 8 and 1 - no changes. I even expected gcc to throw an error/warning, because using alignment less than type size (in my case 64bit machine, uint_fast16_t is 8 bytes), but there was none.
Then another change, which was adding caching (introduced in GCC 9). I added the following code before the for loop:
caches(a, uint_fast16_t, uint_fast16_t, 61, 0, 3);
// where "caches" is:
#define caches(x, type, data_type, size, rw, l) ({ \
for(type Q2W0 = 0; Q2W0 < size; Q2W0 += 64 / sizeof(data_type)) { \
__builtin_prefetch(x + Q2W0, rw, l); \
} \
})
No change in performance as well. I figured out maybe my CPU is caching the array automatically after first binary_search so I eliminated the for loop and measure a few times again with and without the caching line, but I have not noticed any change in performance as well.
More information:
Using CentOS8 64bit latest kernel
Using GCC 9.2.1 20191120
Compiling with -O3 -Wall -pthread -lanl -Wno-missing-braces -Wmissing-field-initializers, no errors / warnings during compilation
Things are not optimised away (checked asm output)
I am pretty sure I don't know about something / I am doing something wrong.
Full code available here.

register uint_fast16_t is pre-mature optimization, let the compiler decide which variables to place in registers. Regard register as a mostly obsolete keyword.
As noted in comments, uint_fast16_t i = 0; i < 10000000 is either a bug or bad practice. You should perhaps do something like this instead:
const uint_fast16_t MAX = 10000000;
... i < MAX
In which case you should get compiler errors upon initialization, if the value does not fit. Alternatively, check the value with static assertions.
Better yet, use size_t for the loop iterator in this case.
__attribute__ ((aligned (64) )) "In this case I would expect faster code"
Why? What makes you think the array was misaligned to begin with? The compiler will not misalign variables just for the sake of it. Particularly not when the array members are declared as uint_fastnn - the whole point of using uint_fast16_t is in fact to get correct alignment.
In this case, the array results in both gcc and clang for x86/64 to spew out a bunch of .quad assembler instructions, resulting in perfectly aligned data.
Regarding the cache commands, I know too little of how they work to comment on them. It is however likely that you already have ideal data cache performance in this case - the array should be in data cache.
As for instruction cache, it's unlikely to do much good during binary search, which by its nature comes with a tonne of branches. In some cases a brute force linear search might outperform binary search for this very reason. Benchmark and see. (And make sure to bludgeon your old computer science algorithm teacher with a big O when brute force proves to be much faster than binary search.)
rand() % 62 may or may not be quite a bottleneck. Both the rand function and modulus could mean a lot of overhead depending on system.

Related

Proper way to use const, static const or global variables in TI XDAIS algorithm

I'm trying to make a melpe codec, XDAIS compliant. But i'm pretty confused about proper way to do it.
link to melpe codec : https://github.com/gegel/pairphone/tree/master/melpe
(Thanks to gegel for providing implementation)
I currently did this procedure for G711 codec (which is an extremely simple codec) but now for melpe codec, i don't know what to do with const and static const variables.
for example in classify.c :
const int16_t enlpf_coef[EN_FILTER_ORDER] = { /* Q14 */
/* the coefs of the filter (NOT h) */
6764, 4336, -274, -2536, -1491,
24, -228, -1370, -1502, -480,
383, 390, 57, -18, 104,
132, 51
};
const int16_t enhpf_coef[EN_FILTER_ORDER] = { /* Q14 */
/* the coefs of the filter (NOT h) */
7783, -5211, 439, 1707, -483,
-978, 564, 630, -861, 214,
205, -86, -82, 43, 26,
-18, 2
};
Reading about XDAIS I found following notes :
Algorithms never allocate memory themselves rather they request memory from application.
Following the same procedure as said in http://processors.wiki.ti.com/index.php/Porting_GPP_code_to_DSP_and_Codec_Engine , first i should compile this codec into a library and then use it in XDAIS wrapper functions (for example process and control methods)
in FIR filter example provided by TI, there is no const or global variable that should be stored.
Q1 - If i understand correctly algorithms should never use global variables in their implementation. Am i correct ?
Q2 - What about local variables defined in IAlg functions ? What prevent algorithms from defining too much local variables, since in requesting memory from application, algorithms don't mention anything about how much memory they need for their local variables. and where do these variables stored in memory ?
Q3 - If Q1 is correct, how should I treat these global constant variables in XDAIS ? Where exactly should I put them ?
Q4 - What is exactly the difference between requesting scratch memory from application and defining variables we need locally? since scratch memory is uninitialized too.
Thanks in advance

c - Are defined values slower than hard-coded numbers

this question might appear dumb to you but I couldn't find an answer to it and I want to be sure that it works as I think.
Recently I came across this code:
void RDP_G_SETBLENDCOLOR(void)
{
Gfx.BlendColor.R = _SHIFTR(w1, 24, 8) * 0.0039215689f;
Gfx.BlendColor.G = _SHIFTR(w1, 16, 8) * 0.0039215689f;
Gfx.BlendColor.B = _SHIFTR(w1, 8, 8) * 0.0039215689f;
Gfx.BlendColor.A = _SHIFTR(w1, 0, 8) * 0.0039215689f;
if(OpenGL.Ext_FragmentProgram && (System.Options & BRDP_COMBINER)) {
glProgramEnvParameter4fARB(GL_FRAGMENT_PROGRAM_ARB, 2, Gfx.BlendColor.R, Gfx.BlendColor.G, Gfx.BlendColor.B, Gfx.BlendColor.A);
}
}
I understand that the 0.0039215689f (which refers to 1/255) is hard-coded for optimization reasons.
Now imagine that I want to define it
for readability reasons (even if the name chosen here is not better, it's just for the example).
#define PIXEL_VALUE 0.0039215689f
void RDP_G_SETBLENDCOLOR(void)
{
Gfx.BlendColor.R = _SHIFTR(w1, 24, 8) * PIXEL_VALUE;
Gfx.BlendColor.G = _SHIFTR(w1, 16, 8) * PIXEL_VALUE;
Gfx.BlendColor.B = _SHIFTR(w1, 8, 8) * PIXEL_VALUE;
Gfx.BlendColor.A = _SHIFTR(w1, 0, 8) * PIXEL_VALUE;
if(OpenGL.Ext_FragmentProgram && (System.Options & BRDP_COMBINER)) {
glProgramEnvParameter4fARB(GL_FRAGMENT_PROGRAM_ARB, 2, Gfx.BlendColor.R, Gfx.BlendColor.G, Gfx.BlendColor.B, Gfx.BlendColor.A);
}
}
Would this define make the code execution slower?
Would this define make the code execution slower?
No, since these two code snippets are identical, because MACROS are expanded before a translation unit is compiled.
Macros do text replacement. The code that gets compiled is exactly the same as if you copied and pasted the replacement text of the macro in your code.
I believe they make no difference at all.
A macro is a pattern of text replacement. So it gets replaced before your code is compiled.
You can try preprocessing both files and see the difference in a terminal:
gcc -E 1.c -o 1.i
gcc -E 2.c -o 2.i
diff -u 1.i 2.i

0xFFFF flags in SSE

I would like to create an SSE register with values that I can store in an array of integers, from another SSE register which contains flags 0xFFFF and zeros. For example:
__m128i regComp = _mm_cmpgt_epi16(regA, regB);
For the sake of argument, lets assume that regComp was loaded with { 0, 0xFFFF, 0, 0xFFFF }. I would like to convert this into say { 0, 80, 0, 80 }.
What I had in mind was to create an array of integers, initialized to 80 and load them to a register regC. Then, do a _mm_and_si128 bewteen regC and regComp and store the result in regD. However, this does not do the trick, which led me to think that I do not understand the positive flags in SSE registers. Could someone answer the question with a brief explanation why my solution does not work?
short valA[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };
short valB[16] = { 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10 };
short ones[16] = { 1 };
short final[16];
__m128i vA, vB, vOnes, vRes, vRes2;
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
for( i=0 ; i < 16 ;i+=8){
vA = _mm_load_si128((__m128i *)&(valA)[i] );
vB = _mm_load_si128((__m128i *)&(valB)[i] );
vRes = _mm_cmpgt_epi16(vA,vB);
vRes2 = _mm_and_si128(vRes,vOnes);
_mm_storeu_si128((__m128i *)&(final)[i], vRes2);
}
You only set the first element of array ones to 1 (the rest of the array is initialised to 0).
I suggest you get rid of the array ones altogether and then change this line:
vOnes = _mm_load_si128((__m128i *)&(ones)[0] );
to:
vOnes = _mm_set1_epi16(1);
Probably a better solution though, if you just want to convert SIMD TRUE (0xffff) results to 1, would be to use a shift:
for (i = 0; i < 16; i += 8) {
vA = _mm_loadu_si128((__m128i *)&pA[i]);
vB = _mm_loadu_si128((__m128i *)&pB[i]);
vRes = _mm_cmpgt_epi16(vA, vB); // generate 0xffff/0x0000 results
vRes = _mm_srli_epi16(vRes, 15); // convert to 1/0 results
_mm_storeu_si128((__m128i *)&final[i], vRes2);
}
Try this for loading 1:
vOnes = _mm_set1_epi16(1);
This is shorter than creating a constant array.
Be careful, providing less array values than array size in C++ initializes the other values to zero. This was your error, and not the SSE part.
Don't forget the debugger, modern ones display SSE variables properly.

How to convince avr-gcc, that the memory position of a global byte array is a constant

I writing a fast "8 bit reverse"-routine for an avr-project with an ATmega2560 processor.
I'm using
GNU C (WinAVR 20100110) version 4.3.3 (avr) / compiled by GNU C version 3.4.5 (mingw-vista special r3), GMP version 4.2.3, MPFR version 2.4.1.
First I created a global lookup-table of reversed bytes (size: 0x100):
uint8_t BitReverseTable[]
__attribute__((__progmem__, aligned(0x100))) = {
0x00,0x80,0x40,0xC0,0x20,0xA0,0x60,0xE0,
0x10,0x90,0x50,0xD0,0x30,0xB0,0x70,0xF0,
[...]
0x1F,0x9F,0x5F,0xDF,0x3F,0xBF,0x7F,0xFF
};
This works as expected. That is the macro I intend to use, which should cost me only 5 cylces:
#define BITREVERSE(x) (__extension__({ \
register uint8_t b=(uint8_t)x; \
__asm__ __volatile__ ( \
"ldi r31, hi8(table)" "\n\t" \
"mov r30, ioRegister" "\n\t" \
"lpm ioRegister, z" "\n\t" \
:[ioRegister] "+r" (b) \
:[table] "g" (BitReverseTable) \
:"r30", "r31" \
); \
}))
The code to get it compiled (or not).
int main() /// Test for bitreverse
{
BITREVERSE(25);
return 0;
}
That's the error I get from the compiler:
c:/winavr-20100110/bin/../lib/gcc/avr/4.3.3/../../../../avr/bin/as.exe -mmcu=atmega2560 -o bitreverse.o C:\Users\xxx\AppData\Local\Temp/ccCefE75.s
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s: Assembler messages:
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:349: Error: constant value required
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:350: Error: constant value required
I guess the problem is here:
:[table] "g" (BitReverseTable) \
From my point of view BitReverseTable is the memory position of the array, which is fixed and known at compile time. Therefor it is constant.
Maybe I need to cast BitReverseTable into something (i tried anything I could think of). Maybe I need another constraint ("g" was my last test). I'm sure I used anything possible and impossible.
I coded an assembler version, which works fine, but instead of being an inline assembly code, this is a proper function which adds another 6 cycles (for call and ret).
Any advice or suggestions are very welcome!
Full source of bitreverse.c on pastebin.
Verbose compiler output also on pastebin
The following does seem to work on avr-gcc (GCC) 4.8.2, but it does have a distinct hacky aftertaste to me.
Edited to fix the issues pointed out by the OP (Thomas) in the comments:
The high byte of Z register is r31 (I had r30 and r31 swapped)
Newer AVR's like ATmega2560 support also lpm r,Z (older AVRs only lpm r0,Z)
Thanks for the fixes, Thomas! I do have an ATmega2560 board, but I prefer Teensies (in part because of the native USB), so I only compile-tested the code, didn't run it to verify. I should have mentioned that; apologies.
const unsigned char reverse_bits_table[256] __attribute__((progmem, aligned (256))) = {
0, 128, 64, 192, 32, 160, 96, 224, 16, 144, 80, 208, 48, 176, 112, 240,
8, 136, 72, 200, 40, 168, 104, 232, 24, 152, 88, 216, 56, 184, 120, 248,
4, 132, 68, 196, 36, 164, 100, 228, 20, 148, 84, 212, 52, 180, 116, 244,
12, 140, 76, 204, 44, 172, 108, 236, 28, 156, 92, 220, 60, 188, 124, 252,
2, 130, 66, 194, 34, 162, 98, 226, 18, 146, 82, 210, 50, 178, 114, 242,
10, 138, 74, 202, 42, 170, 106, 234, 26, 154, 90, 218, 58, 186, 122, 250,
6, 134, 70, 198, 38, 166, 102, 230, 22, 150, 86, 214, 54, 182, 118, 246,
14, 142, 78, 206, 46, 174, 110, 238, 30, 158, 94, 222, 62, 190, 126, 254,
1, 129, 65, 193, 33, 161, 97, 225, 17, 145, 81, 209, 49, 177, 113, 241,
9, 137, 73, 201, 41, 169, 105, 233, 25, 153, 89, 217, 57, 185, 121, 249,
5, 133, 69, 197, 37, 165, 101, 229, 21, 149, 85, 213, 53, 181, 117, 245,
13, 141, 77, 205, 45, 173, 109, 237, 29, 157, 93, 221, 61, 189, 125, 253,
3, 131, 67, 195, 35, 163, 99, 227, 19, 147, 83, 211, 51, 179, 115, 243,
11, 139, 75, 203, 43, 171, 107, 235, 27, 155, 91, 219, 59, 187, 123, 251,
7, 135, 71, 199, 39, 167, 103, 231, 23, 151, 87, 215, 55, 183, 119, 247,
15, 143, 79, 207, 47, 175, 111, 239, 31, 159, 95, 223, 63, 191, 127, 255,
};
#define USING_REVERSE_BITS \
register unsigned char r31 asm("r31"); \
asm volatile ( "ldi r31,hi8(reverse_bits_table)\n\t" : [r31] "=d" (r31) )
#define REVERSE_BITS(v) \
({ register unsigned char r30 asm("r30") = v; \
register unsigned char ret; \
asm volatile ( "lpm %[ret],Z\n\t" : [ret] "=r" (ret) : [r30] "d" (r30), [r31] "d" (r31) ); \
ret; })
unsigned char reverse_bits(const unsigned char value)
{
USING_REVERSE_BITS;
return REVERSE_BITS(value);
}
void reverse_bits_in(unsigned char *string, unsigned char length)
{
USING_REVERSE_BITS;
while (length-->0) {
*string = REVERSE_BITS(*string);
string++;
}
}
For older AVRs that only support lpm r0,Z, use
#define REVERSE_BITS(v) \
({ register unsigned char r30 asm("r30") = v; \
register unsigned char ret asm("r0"); \
asm volatile ( "lpm %[ret],Z\n\t" : [ret] "=t" (ret) : [r30] "d" (r30), [r31] "d" (r31) ); \
ret; })
The idea is that we use a local reg var r31, to keep the high byte of the Z register pair. The USING_REVERSE_BITS; macro defines it in the current scope, using inline assembly for two purposes: to avoid an unnecessary load of the low part of the table address into a register, and to make sure GCC knows we have stored a value into it (because it is an output operand) without having any way of knowing what the value should be, thus hopefully retaining it throughout the scope.
The REVERSE_BITS() macro yields the result, telling the compiler it needs the argument in register r30, and the table address high byte set by USING_REVERSE_BITS; in r31.
Sounds a bit complicated, but that's just because I don't know how to explain it better. It really is quite simple.
Compiling the above with avr-gcc-4.8.2 -O2 -fomit-frame-pointer -mmcu=atmega2560 -S yields the assembly source. (I do recommend using -O2 -fomit-frame-pointer.)
Omitting comments and the normal directives:
.text
reverse_bits:
ldi r31,hi8(reverse_bits_table)
mov r30,r24
lpm r24,Z
ret
reverse_bits_in:
mov r26,r24
mov r27,r25
ldi r31,hi8(reverse_bits_table)
ldi r24,lo8(-1)
add r24,r22
tst r22
breq .L2
.L8:
ld r30,X
lpm r30,Z
st X+,r30
subi r24,1
brcc .L8
.L2:
ret
.section .progmem.data,"a",#progbits
.p2align 8
reverse_bits_table:
.byte 0
.byte -128
; Rest of data omitted for brevity
In case you are wondering, on ATmega2560 GCC puts the first 8-bit parameter and the 8-bit function result both in register r24.
The first function is optimal, as far as I can tell. (On older AVRs that only support lpm r0,Z, you get an added move to copy the result from r0 to r24.)
For the second function, the setup part might not be exactly optimal (for one, you could do the tst r22 breq .L2 first thing to speed up the zero-length-array check), but I'm not sure if I could write a faster/shorter one myself; it's certainly acceptable to me.
The loop in the second function looks optimal to me. The way it uses r30 I found strange and scary at first, but then I realized it makes perfect sense -- fewer registers used, and there is no harm in reusing r30 this way (even if it is low part of Z register too), because it will be loaded with a new value from string at the start of the next iteration.
Note that in my previous edit, I mentioned that swapping the order of the function parameters yielded better code, but with Thomas's additions, that is no longer the case. The registers change, that's it.
If you are sure you always supply a larger-than-zero length, using
void reverse_bits_in(unsigned char *string, unsigned char length)
{
USING_REVERSE_BITS;
do {
*string = REVERSE_BITS(*string);
string++;
} while (--length);
}
yields
reverse_bits_in:
mov r26,r24 ; 1 cycle
mov r27,r25 ; 1 cycle
ldi r31,hi8(reverse_bits_table) ; 2 cycles
.L4:
ld r30,X ; 2 cycles
lpm r30,Z ; 3 cycles
st X+,r30 ; 2 cycles
subi r22,lo8(-(-1)) ; 1 cycle
brne .L4 ; 2 cycles
ret ; 4 cycles
which starts to look downright impressive to me: ten cycles per byte, four cycles for setup, and three cycles cleanup (brne takes just one cycle if no jump). The cycle counts I listed off the top of my head, so there are likely small errors in 'em (a cycle here or there). r26:r27 is X, and the first pointer parameter to the function is supplied in r24:r25, with length in r22.
The reverse_bits_table is in the correct section, and correctly aligned. (.p2align 8 does align to 256 bytes; it specifies an alignment where the low 8 bits are zero.)
Although GCC is notorious for superfluous register moves, I really like the code it generates above. Sure, there is always room for finessing; for the important code sequences I recommend trying different variants, even changing the order of function parameters (or declaring loop variables in local scopes), and so on, then compile using -S to see the generated code. The AVR instruction timings are simple, so it is pretty easy to compare code sequences, to see if one is clearly better. I like to remove the directives and comments first; it makes it easier to read the assembly.
The reason for the hacky aftertaste is that the GCC documentation explicitly says that "Defining such a register variable does not reserve the register; it remains available for other uses in places where flow control determines the variable's value is not live", and I just don't trust that this means the same to the GCC developers as it means to me. Even if it did right now, it might not in the future; there is no standard GCC developers ought to adhere to here, since this is a GCC-specific feature.
On the other hand, I do only rely on documented GCC behaviour, above, and although "hacky", it does generate efficient assembly from straightforward C code.
Personally, I would recommend recompiling the above test code, and looking at the generated assembly (perhaps use sed to strip out the comments and labels, and compare to a known good version?), whenever you update avr-gcc.
Questions?

What is wrong with my program?

I cannot figure out what is wrong. I spent a few hours trying to debug this. I am compiling with gcc -m32 source.c -o source
How else can I approach this when debugging? Right now, I am isolating the code in many different ways and everything is working the way I expect but its working the wrong way when I have it all together.
This program takes an input and then looks for the highest position with the 1 bit.
I removed my code for now.
in bitsearch, you are storing num in eax, you store a special value in edx in order to perform check. check is testing if the highest bit is set (indicating a negative number), and exits if its the case...
the andl instruction in check stores the result of the operation inside the second operand (eax), so the result overwrites num.
then in zero you are using edx to perform your computation... edx contains the special value of the start of the function, so your result will always be wrong.
now at the end of zero, you are going back to check, but the check is unnecessary here, you should loop back to zeroinstead...
Does the bit-search need to be implemented in assembly? A simple for loop can accomplish the same task, and is much more readable:
int num = 10;
int maxFound = -1;
for (int numShifts = 0; numShifts < 32 && num != 0; numShifts++) {
if ((num & 1) == 1) {
maxFound = numShifts;
}
num = num >> 1;
}
//the last position that had a 1 will be in maxFound
There's a neat bit-fiddling trick: x & -x isolates the last 1-bit. The following C program uses a lookup table based on de Bruijn sequences to compute the number of trailing (!) zeros of a number in constant (!) time:
unsigned int x; // find the number of trailing zeros in 32-bit x
int r; // result goes here
int table[32] =
{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = table[((uint32_t)((x & -x) * 0x077CB531U)) >> 27];
Doing this in assembly language (which I stopped learning by the age of 16) should be no problem. Now all you have to do is to reverse the bits in num and apply the technique described above.
I wrote a paper about the trick described above, but unfortunately it's not available on the web. If you're interested, I can send it to you (or anyone else who's interested) by email.
My assembly knowledge is a little rusty, but it seems to me like bitsearch is overly complicated. How about just rotating the number to the right and counting the times you need to do that until it's zero?

Resources