while (n > 1) is 25% faster than while (n)? - c

I have two logically equivalent functions:
long ipow1(int base, int exp) {
// HISTORICAL NOTE:
// This wasn't here in the original question, I edited it in,
if (exp == 0) return 1;
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
NOTICE:
These loops are equivalent because in the former case we are returning result * base (handling the case when exp is or has been reduced to 1) but in the second case we are returning result.
Strangely enough, both with -O3 and -O0 ipow1 consequently outperforms ipow2 by about 25%. How is this possible?
I'm on Windows 7, x64, gcc 4.5.2 and compiling with gcc ipow.c -O0 -std=c99.
And this is my profiling code:
int main(int argc, char *argv[]) {
LARGE_INTEGER ticksPerSecond;
LARGE_INTEGER tick;
LARGE_INTEGER start_ticks, end_ticks, cputime;
double totaltime = 0;
int repetitions = 10000;
int rep = 0;
int nopti = 0;
for (rep = 0; rep < repetitions; rep++) {
if (!QueryPerformanceFrequency(&ticksPerSecond)) printf("\tno go QueryPerformance not present");
if (!QueryPerformanceCounter(&tick)) printf("no go counter not installed");
QueryPerformanceCounter(&start_ticks);
/* start real code */
for (int i = 0; i < 55; i++) {
for (int j = 0; j < 11; j++) {
nopti = ipow1(i, j); // or ipow2
}
}
/* end code */
QueryPerformanceCounter(&end_ticks);
cputime.QuadPart = end_ticks.QuadPart - start_ticks.QuadPart;
totaltime += (double)cputime.QuadPart / (double)ticksPerSecond.QuadPart;
}
printf("\tTotal elapsed CPU time: %.9f sec with %d repetitions - %ld:\n", totaltime, repetitions, nopti);
return 0;
}

No, really, the two ARE NOT equivalent. ipow2 returns correct results when ipow1 doesn't.
http://ideone.com/MqyqU
P.S. I don't care how many comments you leave "explaining" why they're the same, it takes only a single counter-example to disprove your claims.
P.P.S. -1 on the question for your insufferable arrogance toward everyone who already tried to point this out to you.

It's becouse with while (exp > 1) the for will run from exp to 2 (it will execute with exp = 2, decrement it to 1 and then end the loop).
With while (exp), the for will run from exp to 1 (it will execute with exp = 1, decrement it to 0 and then end the loop).
So with while (exp) you have an extra iteration, which takes the extra time to run.
EDIT: Even with the multiplication after the loop with the exp>1 while, keep in mind that the multiplication is not the only thing in the loop.

If you dont want to read all of this skip to the bottom, I come up with a 21% difference just by analysis of the code.
Different systems, versions of the compiler, same compiler version built by different folks/distros will give different instruction mixes, this is just one example of what you might get.
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
0000000000000000 <ipow1>:
0: 83 fe 01 cmp $0x1,%esi
3: ba 01 00 00 00 mov $0x1,%edx
8: 7e 1d jle 27 <ipow1+0x27>
a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
10: 40 f6 c6 01 test $0x1,%sil
14: 74 07 je 1d <ipow1+0x1d>
16: 48 63 c7 movslq %edi,%rax
19: 48 0f af d0 imul %rax,%rdx
1d: d1 fe sar %esi
1f: 0f af ff imul %edi,%edi
22: 83 fe 01 cmp $0x1,%esi
25: 7f e9 jg 10 <ipow1+0x10>
27: 48 63 c7 movslq %edi,%rax
2a: 48 0f af c2 imul %rdx,%rax
2e: c3 retq
2f: 90 nop
0000000000000030 <ipow2>:
30: 85 f6 test %esi,%esi
32: b8 01 00 00 00 mov $0x1,%eax
37: 75 0a jne 43 <ipow2+0x13>
39: eb 19 jmp 54 <ipow2+0x24>
3b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
40: 0f af ff imul %edi,%edi
43: 40 f6 c6 01 test $0x1,%sil
47: 74 07 je 50 <ipow2+0x20>
49: 48 63 d7 movslq %edi,%rdx
4c: 48 0f af c2 imul %rdx,%rax
50: d1 fe sar %esi
52: 75 ec jne 40 <ipow2+0x10>
54: f3 c3 repz retq
Isolating the loops:
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
//if exp & 1 not true jump to 1d to skip
10: 40 f6 c6 01 test $0x1,%sil
14: 74 07 je 1d <ipow1+0x1d>
//result *= base
16: 48 63 c7 movslq %edi,%rax
19: 48 0f af d0 imul %rax,%rdx
//exp>>=1
1d: d1 fe sar %esi
//base *= base
1f: 0f af ff imul %edi,%edi
//while(exp>1) stayin the loop
22: 83 fe 01 cmp $0x1,%esi
25: 7f e9 jg 10 <ipow1+0x10>
Comparing something to zero normally saves you an instruction and you can see that here
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
//base *= base
40: 0f af ff imul %edi,%edi
//if exp & 1 not true jump to skip
43: 40 f6 c6 01 test $0x1,%sil
47: 74 07 je 50 <ipow2+0x20>
//result *= base
49: 48 63 d7 movslq %edi,%rdx
4c: 48 0f af c2 imul %rdx,%rax
//exp>>=1
50: d1 fe sar %esi
//no need for a compare
52: 75 ec jne 40 <ipow2+0x10>
Your timing method is going to generate a lot of error/chaos. Depending on the beat frequency of the loop and the accuracy of the timer you can create a lot of gain in one and a lot of loss in another. This method normally gives better accuracy:
starttime = ...
for(rep=bignumber;rep;rep--)
{
//code under test
...
}
endtime = ...
total = endtime - starttime;
Of course if you are running this on an operating system timing it is going to have a decent amount of error in it anyway.
Also you want to use volatile variables for your timer variables, helps the compiler to not re-arrange the order of execution. (been there seen that).
If we look at this from the perspective of the base multiplies:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int mults;
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
mults++;
}
result *= base;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
mults++;
}
return result;
}
int main ( void )
{
int i;
int j;
mults = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
mults=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
}
there are
mults 1045
mults 1595
50% more for ipow2(). Actually it is not just the multiplies it is that you are going through the loop 50% more times.
ipow1() gets a little back on the other multiplies:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int mults;
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) mults++;
exp >>= 1;
base *= base;
}
mults++;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) mults++;
exp >>= 1;
base *= base;
}
return result;
}
int main ( void )
{
int i;
int j;
mults = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
mults=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
}
ipow1() performs the result*=base a different number (more) times than ipow2()
mults 990
mults 935
being a long * int can make these more expensive. not enough to make up for the losses around the loop in ipow2().
Even without disassembling, making a rough guess on the operations/instructions you hope the compiler uses. Accounting here for processors in general not necessarily x86, some processors will run this code better than others (from a number of instructions executed perspective not counting all the other factors).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int ops;
long ipow1(int base, int exp) {
long result = 1;
ops++; //result = immediate
while (exp > 1) {
ops++; // compare exp - 1
ops++; // conditional jump
//if (exp & 1)
ops++; //exp&1
ops++; //conditional jump
if (exp & 1)
{
result *= base;
ops++;
}
exp >>= 1;
ops++;
//ops+=?; //using a signed number can cost you this on some systems
//always use unsigned unless you have a specific reason to use signed.
//if this had been a short or char variable it might cost you even more
//operations
//if this needs to be signed it is what it is, just be aware of
//the cost
base *= base;
ops++;
}
result *= base;
ops++;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
ops++;
while (exp) {
//ops++; //cmp exp-0, often optimizes out;
ops++; //conditional jump
//if (exp & 1)
ops++;
ops++;
if (exp & 1)
{
result *= base;
ops++;
}
exp >>= 1;
ops++;
//ops+=?; //right shifting a signed number
base *= base;
ops++;
}
return result;
}
int main ( void )
{
int i;
int j;
ops = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("ops %u\n",ops);
ops=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("ops %u\n",ops);
}
Assuming I counted all the major operations and didnt unfairly give one function more than another:
ops 7865
ops 9515
ipow2 is 21% slower using this analysis.
I think the big killer is the 50% more times through the loop. Granted it is data dependent, you might find inputs in a benchmark test that make the difference between functions greater or worse than the 25% you are seeing.

Your functions are not "logically equal".
while (exp > 1){...}
is NOT logically equal to
while (exp){...}
Why do you say it is?

Does this really generate the same assembly code? When I tried (with gcc 4.5.1 on OpenSuse 11.4, I will admit) I found slight differences.
ipow1.s:
cmpl $1, -24(%rbp)
jg .L4
movl -20(%rbp), %eax
cltq
imulq -8(%rbp), %rax
leave
ipow2.s:
cmpl $0, -24(%rbp)
jne .L4
movq -8(%rbp), %rax
leave
Perhaps the processor's branch prediction is just more effective with jg than with jne? It seems unlikely that one branch instruction would run 25% faster than another (especially when cmpl has done most of the heavy lifting)

Related

Compressing a 'char' array using bit packing in C [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a large array (around 1 MB) of type unsigned char (i.e. uint8_t). I know that the bytes in it can have only one of 5 values (i.e. 0, 1, 2, 3, 4). Moreover we do not need to preserve '3's from the input, they can be safely lost when we encode/decode.
So I guessed bit packing would be the simplest way to compress it, so every byte can be converted to 2 bits (00, 01..., 11).
As mentioned all elements of value 3 can be removed (i.e. saved as 0). Which gives me option to save '4' as '3'. While reconstructing (decompressing) I restore 3's to 4's.
I wrote a small function for the compression but I feel this has too many operations and just not efficient enough. Any code snippets or suggestion on how to make it more efficient or faster (hopefully keeping the readability) will be very much helpful.
/// Compress by packing ...
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
{
uint8_t temp[4];
for (int small_loop = 0; small_loop < 4; small_loop++)
{
temp[small_loop] = *in; // Load into local variable
if (temp[small_loop] == 3) // 3's are discarded
temp[small_loop] = 0;
else if (temp[small_loop] == 4) // and 4's are converted to 3
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
((temp[3] & 0x03));
} // end loop
}
Edited to make the problem more clear as it looked like I'm trying to save 5 values into 2 bits. Thanks to #Brian Cain for suggested wording.
Cross-posted on Code Review.
Your function has a bug: when loading the small array, you should write:
temp[small_loop] = in[small_loop];
You can get rid of the tests with a lookup table, either on the source data, or more efficiently on some intermediary result:
In the code below, I use a small table lookup5 to convert the values 0,1,2,3,4 to 0,1,2,0,3, and a larger one to map groups of 4 3-bit values from the source array to the corresponding byte value in the packed format:
#include <stdint.h>
/// Compress by packing ...
void compressByPacking0(uint8_t *out, uint8_t *in, uint32_t length) {
static uint8_t lookup[4096];
static const uint8_t lookup5[8] = { 0, 1, 2, 0, 3, 0, 0, 0 };
if (lookup[0] == 0) {
/* initialize lookup table */
for (int i = 0; i < 4096; i++) {
lookup[i] = (lookup5[(i >> 0) & 7] << 0) +
(lookup5[(i >> 3) & 7] << 2) +
(lookup5[(i >> 6) & 7] << 4) +
(lookup5[(i >> 9) & 7] << 6);
}
}
for (; length >= 4; length -= 4, in += 4, out++) {
*out = lookup[(in[0] << 9) + (in[1] << 6) + (in[2] << 3) + (in[3] << 0)];
}
uint8_t last = 0;
switch (length) {
case 3:
last |= lookup5[in[2]] << 4;
/* fall through */
case 2:
last |= lookup5[in[1]] << 2;
/* fall through */
case 1:
last |= lookup5[in[0]] << 0;
*out = last;
break;
}
}
Notes:
The code assumes the array does not contain values outside the specified range. Extra protection against spurious input can be achieved at a minimal cost.
The dummy << 0 are here only for symmetry and compile to no extra code.
The lookup table could be initialized statically, via a build time script or a set of macros.
You might want to unroll this loop 4 or more times, or let the compiler decide.
You could also use this simpler solution with a smaller lookup table accessed more often. Careful benchmarking will tell you which is more efficient on your target system:
/// Compress by packing ...
void compressByPacking1(uint8_t *out, uint8_t *in, uint32_t length) {
static const uint8_t lookup[4][5] = {
{ 0 << 6, 1 << 6, 2 << 6, 0 << 6, 3 << 6 },
{ 0 << 4, 1 << 4, 2 << 4, 0 << 4, 3 << 4 },
{ 0 << 2, 1 << 2, 2 << 2, 0 << 2, 3 << 2 },
{ 0 << 0, 1 << 0, 2 << 0, 0 << 0, 3 << 0 },
};
for (; length >= 4; length -= 4, in += 4, out++) {
*out = lookup[0][in[0]] + lookup[1][in[1]] +
lookup[2][in[2]] + lookup[3][in[3]];
}
uint8_t last = 0;
switch (length) {
case 3:
last |= lookup[2][in[2]];
/* fall through */
case 2:
last |= lookup[1][in[1]];
/* fall through */
case 1:
last |= lookup[0][in[0]];
*out = last;
break;
}
}
Here is yet another approach, without any tables:
/// Compress by packing ...
void compressByPacking2(uint8_t *out, uint8_t *in, uint32_t length) {
#define BITS ((1 << 2) + (2 << 4) + (3 << 8))
for (; length >= 4; length -= 4, in += 4, out++) {
*out = ((BITS << 6 >> (in[0] + in[0])) & 0xC0) +
((BITS << 4 >> (in[1] + in[1])) & 0x30) +
((BITS << 2 >> (in[2] + in[2])) & 0x0C) +
((BITS << 0 >> (in[3] + in[3])) & 0x03);
}
uint8_t last = 0;
switch (length) {
case 3:
last |= (BITS << 2 >> (in[2] + in[2])) & 0x0C;
/* fall through */
case 2:
last |= (BITS << 4 >> (in[1] + in[1])) & 0x30;
/* fall through */
case 1:
last |= (BITS << 6 >> (in[0] + in[0])) & 0xC0;
*out = last;
break;
}
}
Here is a comparative benchmark on my system, Macbook pro running OS/X, with clang -O2:
compressByPacking(1MB) -> 0.867ms
compressByPacking0(1MB) -> 0.445ms
compressByPacking1(1MB) -> 0.538ms
compressByPacking2(1MB) -> 0.824ms
The compressByPacking0 variant is fastest, almost twice as fast as your code.
It is a little disappointing, but the code is portable. You might squeeze more performance using handcoded SSE optimizations.
I have a large array (around 1 MB)
Either this is a typo, your target is seriously aging, or this compression operation is invoked repeatedly in the critical path of your application.
Any code snippets or suggestion on how to make it more efficient or
faster (hopefully keeping the readability) will be very much helpful.
In general, you will find the best information by empirically measuring the performance and inspecting the generated code. Using profilers to determine what code is executing, where there are cache misses and pipeline stalls -- these can help you tune your algorithm.
For example, you chose a stride of 4 elements. Is that just because you are mapping four input elements to a single byte? Can you use native SIMD instructions/intrinsics to operate on more elements at a time?
Also, how are you compiling for your target and how well is your compiler able to optimize your code?
Let's ask clang whether it finds any problems trying to optimize your code:
$ clang -fvectorize -O3 -Rpass-missed=licm -c tryme.c
tryme.c:11:28: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
temp[small_loop] = *in; // Load into local variable
^
tryme.c:21:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
*out = (uint8_t)((temp[0] & 0x03) << 6) |
^
tryme.c:22:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[1] & 0x03) << 4) |
^
tryme.c:23:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[2] & 0x03) << 2) |
^
tryme.c:24:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[3] & 0x03));
^
I'm not sure but maybe alias analysis is what makes it think it can't move this load. Try playing with __restrict__ to see if that has any effect.
$ clang -fvectorize -O3 -Rpass-analysis=loop-vectorize -c tryme.c
tryme.c:13:13: remark: loop not vectorized: loop contains a switch statement [-Rpass-analysis=loop-vectorize]
if (temp[small_loop] == 3) // 3's are discarded
I can't think of anything obvious that you can do about this one unless you change your algorithm. If the compression ratio is satisfactory without deleting the 3s, you could perhaps eliminate this.
So what's the generated code look like? Take a look below. How could you write it better by hand? If you can write it better yourself, either do that or feed it back into your algorithm to help guide the compiler.
Does the compiled code take advantage of your target's instruction set and registers?
Most importantly -- try executing it and see where you're spending the most cycles. Stalls from branch misprediction, unaligned loads? Maybe you can do something about those. Use what you know about the frequency of your input data to give the compiler hints about the branches in your encoder.
$ objdump -d --source tryme.o
...
0000000000000000 <compressByPacking>:
#include <stdint.h>
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
0: c1 ea 02 shr $0x2,%edx
3: 0f 84 86 00 00 00 je 8f <compressByPacking+0x8f>
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
{
uint8_t temp[4];
for (int small_loop = 0; small_loop < 4; small_loop++)
{
temp[small_loop] = *in; // Load into local variable
10: 8a 06 mov (%rsi),%al
if (temp[small_loop] == 3) // 3's are discarded
12: 3c 04 cmp $0x4,%al
14: 74 3a je 50 <compressByPacking+0x50>
16: 3c 03 cmp $0x3,%al
18: 41 88 c0 mov %al,%r8b
1b: 75 03 jne 20 <compressByPacking+0x20>
1d: 45 31 c0 xor %r8d,%r8d
20: 3c 04 cmp $0x4,%al
22: 74 33 je 57 <compressByPacking+0x57>
24: 3c 03 cmp $0x3,%al
26: 88 c1 mov %al,%cl
28: 75 02 jne 2c <compressByPacking+0x2c>
2a: 31 c9 xor %ecx,%ecx
2c: 3c 04 cmp $0x4,%al
2e: 74 2d je 5d <compressByPacking+0x5d>
30: 3c 03 cmp $0x3,%al
32: 41 88 c1 mov %al,%r9b
35: 75 03 jne 3a <compressByPacking+0x3a>
37: 45 31 c9 xor %r9d,%r9d
3a: 3c 04 cmp $0x4,%al
3c: 74 26 je 64 <compressByPacking+0x64>
3e: 3c 03 cmp $0x3,%al
40: 75 24 jne 66 <compressByPacking+0x66>
42: 31 c0 xor %eax,%eax
44: eb 20 jmp 66 <compressByPacking+0x66>
46: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4d: 00 00 00
50: 41 b0 03 mov $0x3,%r8b
53: 3c 04 cmp $0x4,%al
55: 75 cd jne 24 <compressByPacking+0x24>
57: b1 03 mov $0x3,%cl
59: 3c 04 cmp $0x4,%al
5b: 75 d3 jne 30 <compressByPacking+0x30>
5d: 41 b1 03 mov $0x3,%r9b
60: 3c 04 cmp $0x4,%al
62: 75 da jne 3e <compressByPacking+0x3e>
64: b0 03 mov $0x3,%al
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
66: 41 c0 e0 06 shl $0x6,%r8b
((temp[1] & 0x03) << 4) |
6a: c0 e1 04 shl $0x4,%cl
6d: 80 e1 30 and $0x30,%cl
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
70: 44 08 c1 or %r8b,%cl
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
73: 41 c0 e1 02 shl $0x2,%r9b
77: 41 80 e1 0c and $0xc,%r9b
((temp[3] & 0x03));
7b: 24 03 and $0x3,%al
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
((temp[1] & 0x03) << 4) |
7d: 44 08 c8 or %r9b,%al
((temp[2] & 0x03) << 2) |
80: 08 c8 or %cl,%al
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
82: 88 07 mov %al,(%rdi)
#include <stdint.h>
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
84: 48 83 c6 04 add $0x4,%rsi
88: 48 ff c7 inc %rdi
8b: ff ca dec %edx
8d: 75 81 jne 10 <compressByPacking+0x10>
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
((temp[3] & 0x03));
} // end loop
}
8f: c3 retq
In all the excitement about performance, functionality is overlooked. Code is broke.
// temp[small_loop] = *in; // Load into local variable
temp[small_loop] = in[small_loop];
Alternative:
How about a simple tight loop?
Use const and restrict to allow various optimizations.
void compressByPacking1(uint8_t* restrict out, const uint8_t* restrict in,
uint32_t length) {
static const uint8_t t[5] = { 0, 1, 2, 0, 3 };
uint32_t length4 = length / 4;
unsigned v = 0;
uint32_t i;
for (i = 0; i < length4; i++) {
for (unsigned j=0; j < 4; j++) {
v <<= 2;
v |= t[*in++];
}
out[i] = (uint8_t) v;
}
if (length & 3) {
v = 0;
for (unsigned j; j < 4; j++) {
v <<= 2;
if (j < (length & 3)) {
v |= t[*in++];
}
}
out[i] = (uint8_t) v;
}
}
Tested and found this code to be about 270% as fast (41 vs 15) (YMMV).
Tested and found to form the same output as OP's (corrected) code
Update: Tested
Unsafe version is the fastest - fastest than other ones in another answers. Tested with VS2017
const uint8_t table[4][5] =
{ { 0 << 0,1 << 0,2 << 0,0 << 0,3 << 0 },
{ 0 << 2,1 << 2,2 << 2,0 << 2,3 << 2 },
{ 0 << 4,1 << 4,2 << 4,0 << 4,3 << 4 },
{ 0 << 6,1 << 6,2 << 6,0 << 6,3 << 6 },
};
void code(uint8_t *in, uint8_t *out, uint32_t len)
{
memset(out, 0, len / 4 + 1);
for (uint32_t i = 0; i < len; i++)
out[i / 4] |= table[i & 3][in[i] % 5];
}
void code_unsafe(uint8_t *in, uint8_t *out, uint32_t len)
{
for (uint32_t i = 0; i < len; i += 4, in += 4, out++)
{
*out = table[0][in[0]] | table[1][in[1]] | table[2][in[2]] | table[3][in[3]];
}
}
To check how it is written it is enough to compile it - even online
https://godbolt.org/g/Z75NQV
There are small very simple my coding functions - just for comparition of the compiler generated code, not tested.
Does this look clearer?
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
assert( 0 == length % 4 );
for (int loop = 0; loop < length; loop += 4)
{
uint8_t temp = 0;
for (int small_loop = 0; small_loop < 4; small_loop++)
{
uint8_t inv = *in; // get next input value
switch(inv)
{
case 0: // encode as 00
case 3: // change to 0
break;
case 1:
temp |= (1 << smal_loop*2); // 1 encode as '01'
break;
case 2:
temp |= (2 << smal_loop*2); // 2 encode as '10'
break;
case 4:
temp |= (3 << smal_loop*2); // 4 encode as '11'
break;
default:
assert(0);
}
} // end inner loop
*out = temp;
} // end outer loop
}

Operator * and + produces wrong result in digital mars

I am trying to calculate a number which produce Longest Collatz sequence. But here is a strange problem. 3n+1 become 38654705674 when n is 3. I do not see an error. here is the full code:
/* 6.c -- calculates Longest Collatz sequence */
#include <stdio.h>
long long get_collatz_length(long long);
int main(void)
{
long long i;
long long current, current_count, count;
current_count = 1;
current = 1;
for(i=2;i<1000000;i++)
{
// works fine when i is 2 the next line take eternity when i is 3;
count = get_collatz_length(i);
if(current_count <= count)
{
current = i;
current_count = count;
}
}
printf("%lld %lld\n", current, current_count);
return 0;
}
long long get_collatz_length(long long num)
{
long long count;
count = 1;
while(num != 1)
{
printf("%lld\n", num);
if(num%2)
{
num = num*3+1; // here it is;
}
else
{
num/=2;
}
count++;
}
puts("");
return count;
}
It's seems to be bug in dmc compiler, that fails to handle long long type correctly. Here is narrowed test-case:
#include <stdio.h>
int main(void)
{
long long num = 3LL;
/*printf("%lld\n", num);*/
num = num * 3LL;
char *t = (char *) &num;
for (int i = 0; i < 8; i++)
printf("%x\t", t[i]);
putchar('\n');
/*printf("%lld\n", num);*/
return 0;
}
It produces (little endian, so 0x900000009 == 38 654 705 673):
9 0 0 0 9 0 0 0
From dissasembly it looks that it stores 64-bit integer as two 32-bit registers:
.data:0x000000be 6bd203 imul edx,edx,0x3
.data:0x000000c1 6bc803 imul ecx,eax,0x3
.data:0x000000c4 03ca add ecx,edx
.data:0x000000c6 ba03000000 mov edx,0x3
.data:0x000000cb f7e2 mul edx
.data:0x000000cd 03d1 add edx,ecx
.data:0x000000cf 31c0 xor eax,eax
I additionaly tested it with objconv tool, that just confirms my initial diagnose:
#include <stdio.h>
void mul(void)
{
long long a;
long long c;
a = 5LL;
c = a * 3LL;
printf("%llx\n", c);
}
int main(void)
{
mul();
return 0;
}
disassembly (single section):
>objconv.exe -fmasm ..\dm\bin\check.obj
_mul PROC NEAR
mov eax, 5 ; 0000 _ B8, 00000005
cdq ; 0005 _ 99
imul edx, edx, 3 ; 0006 _ 6B. D2, 03
imul ecx, eax, 3 ; 0009 _ 6B. C8, 03
add ecx, edx ; 000C _ 03. CA
mov edx, 3 ; 000E _ BA, 00000003
mul edx ; 0013 _ F7. E2
add edx, ecx ; 0015 _ 03. D1
push edx ; 0017 _ 52
push eax ; 0018 _ 50
push offset FLAT:?_001 ; 0019 _ 68, 00000000(segrel)
call _printf ; 001E _ E8, 00000000(rel)
add esp, 12 ; 0023 _ 83. C4, 0C
ret ; 0026 _ C3
_mul ENDP
Note that mul edx operates implicitely on eax. The result is stored in both registers, higher part (in this case 0) in stored in edx, while lower in eax.

Find memory address in hex string in kernel module

I'm writing a kernel module to find the the memory address of do_debug (0xffffffff8134f709) by first searching for the hex bytes next to the address. I'm not sure I am using the correct hex bytes: "\xe8\x61\x07\x00\x00" (I wish to stick to C and not assembly.)
struct desc_ptr idt_register;
store_idt(&idt_register);
printk("idt_register.address: %lx\n", idt_register.address); // same as in /boot/System.map*
gate_desc *idt_table = (gate_desc *)idt_register.address;
unsigned char *debug = (unsigned char *)gate_offset(idt_table[0x1]);
printk("debug: %lx\n", (unsigned long)debug); // same as in /boot/System.map*
int count = 0;
while(count < 150){
if( (*(debug) == 0xe8) && (*(debug + 1) == 0x61) && (*(debug + 2) == 0x07) && (*(debug + 3) == 0x00) && (*(debug + 4) == 0x00) ){
debug += 5;
unsigned long *do_debug = (unsigned long *)(0xffffffff00000000 | *((unsigned long *)(debug)));
if((unsigned long)do_debug != 0xffffffff8134f709){
printk("do_debug: %lx\n", (unsigned long)do_debug); // wrong address !
return;
}
break;
}
debug++;
count++;
}
gdb:
gdb ./vmlinux-3.2.0-4-amd64
...
(gdb) info line debug
Line 54 of "/build/linux-s5x2oE/linux-3.2.46/drivers/pci/hotplug/pci_hotplug_core.c" is at address 0xffffffff811cf02b <power_read_file+2> but contains no code.
Line 1322 of "/build/linux-s5x2oE/linux-3.2.46/arch/x86/kernel/entry_64.S"
starts at address 0xffffffff8134ef80 and ends at 0xffffffff8134efc0.
(gdb) disas 0xffffffff8134ef80,0xffffffff8134efc0
Dump of assembler code from 0xffffffff8134ef80 to 0xffffffff8134efc0:
0xffffffff8134ef80: callq *0x2c687a(%rip) # 0xffffffff81615800
0xffffffff8134ef86: pushq $0xffffffffffffffff
0xffffffff8134ef88: sub $0x78,%rsp
0xffffffff8134ef8c: callq 0xffffffff8134ed40
0xffffffff8134ef91: mov %rsp,%rdi
0xffffffff8134ef94: xor %esi,%esi
0xffffffff8134ef96: subq $0x1000,%gs:0x1137c
0xffffffff8134efa3: callq 0xffffffff8134f709 <do_debug>
0xffffffff8134efa8: addq $0x1000,%gs:0x1137c
0xffffffff8134efb5: jmpq 0xffffffff8134f160
0xffffffff8134efba: nopw 0x0(%rax,%rax,1)
End of assembler dump.
(gdb) x/i 0xffffffff8134efa3
0xffffffff8134efa3: callq 0xffffffff8134f709 <do_debug>
(gdb) x/xw 0xffffffff8134efa3
0xffffffff8134efa3: 0x000761e8
(gdb)
readelf:
ffffffff8134ef96: 65 48 81 2c 25 7c 13 subq $0x1000,%gs:0x1137c
ffffffff8134ef9d: 01 00 00 10 00 00
ffffffff8134efa3: e8 61 07 00 00 callq ffffffff8134f709 <do_debug>
ffffffff8134efa8: 65 48 81 04 25 7c 13 addq $0x1000,%gs:0x1137c
ffffffff8134efaf: 01 00 00 10 00 00
EDIT:
(gdb) print do_debug
$1 = {void (struct pt_regs *, long int)} 0xffffffff8134f709 <do_debug>
(gdb)
I am not familiar with linux kernel.
In the while statement, when you go into the first if statement, the debug then point to ffffffff8134efa8
ffffffff8134efa8: 65 48 81 04 25 7c 13 addq $0x1000,%gs:0x1137c # debug point here
ffffffff8134efaf: 01 00 00 10 00 00
the next statement
unsigned long *do_debug = (unsigned long *)(0xffffffff00000000 |
*((unsigned long *)(debug)));
will get this result.
do_debug = (unsigned long *)(0xffffffff00000000 | 0x01137c2504814865) = 0xffffffff04814865
if you want to get the do_debug address, you should do this:
if( (*(debug) == 0xe8) && (*(debug + 1) == 0x61) && (*(debug + 2) == 0x07) && (*(debug + 3) == 0x00) && (*(debug + 4) == 0x00) ){
//debug += 5;
uint32_t offset = ntohl(*(unsigned int *)(debug+1)); //get the do_debug function offset
unsigned long *do_debug = (debug+5)+offset; // next_code_instruction + offset
//debug+5 point to ffffffff8134efa8
//offset is 0x761
//so ffffffff8134efa8+0x761 = 0xffffffff8134f709
if((unsigned long)do_debug != 0xffffffff8134f709){
printk("do_debug: %lx\n", (unsigned long)do_debug); // wrong address !
return;
}
break;
}

Converting a floating point number in C to IEEE standard

I am trying to write a code in C that generates a random integer , performs simple calculation and then I am trying to print the values of the file in IEEE standard. But I am unable to do so , Please help.
I am unable to print it in Hexadecimal/Binary which is very important.
If I type cast the values in fprintf, I am getting this Error expected expression before double.
int main (int argc, char *argv) {
int limit = 20 ; double a[limit], b[limit]; //Inputs
double result[limit] ; int i , k ; //Outputs
printf("limit = %d", limit ); double q;
for (i= 0 ; i< limit;i++)
{
a[i]= rand();
b[i]= rand();
printf ("A= %x B = %x\n",a[i],b[i]);
}
char op;
printf("Enter the operand used : add,subtract,multiply,divide\n");
scanf ("%c", &op); switch (op) {
case '+': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] + b[k];
printf ("result= %f\n",result[k]);
}
}
break;
case '*': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] * b[k];
}
}
break;
case '/': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] / b[k];
}
}
break;
case '-': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] - b[k];
}
}
break; }
FILE *file; file = fopen("tb.txt","w"); for(k=0;k<limit;k++) {
fprintf (file,"%x\n
%x\n%x\n\n",double(a[k]),double(b[k]),double(result[k]) );
}
fclose(file); /*done!*/
}
If your C compiler supports IEEE-754 floating point format directly (because the CPU supports it) or fully emulates it, you may be able to print doubles simply as bytes. And that is the case for the x86/64 platform.
Here's an example:
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>
void PrintDoubleAsCBytes(double d, FILE* f)
{
unsigned char a[sizeof(d)];
unsigned i;
memcpy(a, &d, sizeof(d));
for (i = 0; i < sizeof(a); i++)
fprintf(f, "%0*X ", (CHAR_BIT + 3) / 4, a[i]);
}
int main(void)
{
PrintDoubleAsCBytes(0.0, stdout); puts("");
PrintDoubleAsCBytes(0.5, stdout); puts("");
PrintDoubleAsCBytes(1.0, stdout); puts("");
PrintDoubleAsCBytes(2.0, stdout); puts("");
PrintDoubleAsCBytes(-2.0, stdout); puts("");
PrintDoubleAsCBytes(DBL_MIN, stdout); puts("");
PrintDoubleAsCBytes(DBL_MAX, stdout); puts("");
PrintDoubleAsCBytes(INFINITY, stdout); puts("");
#ifdef NAN
PrintDoubleAsCBytes(NAN, stdout); puts("");
#endif
return 0;
}
Output (ideone):
00 00 00 00 00 00 00 00
00 00 00 00 00 00 E0 3F
00 00 00 00 00 00 F0 3F
00 00 00 00 00 00 00 40
00 00 00 00 00 00 00 C0
00 00 00 00 00 00 10 00
FF FF FF FF FF FF EF 7F
00 00 00 00 00 00 F0 7F
00 00 00 00 00 00 F8 7F
If IEEE-754 isn't supported directly, the problem becomes more complex. However, it can still be solved.
Here are a few related questions and answers that can help:
How do I handle byte order differences when reading/writing floating-point types in C?
Is there a tool to know whether a value has an exact binary representation as a floating point variable?
C dynamically printf double, no loss of precision and no trailing zeroes
And, of course, all the IEEE-754 related info can be found in Wikipedia.
Try this in your fprint part:
fprintf (file,"%x\n%x\n%x\n\n",*((int*)(&a[k])),*((int*)(&b[k])),*((int*)(&result[k])));
That would translate the double as an integer so it's printed in IEEE standard.
But if you're running your program on a 32-bit machine on which int is 32-bit and double is 64-bit, I suppose you should use:
fprintf (file,"%x%x\n%x%x\n%x%x\n\n",*((int*)(&a[k])),*((int*)(&a[k])+1),*((int*)(&b[k])),*((int*)(&b[k])+1),*((int*)(&result[k])),*((int*)(&result[k])+1));
In C, there are two ways to get at the bytes in a float value: a pointer cast, or a union. I recommend a union.
I just tested this code with GCC and it worked:
#include <stdio.h>
typedef unsigned char BYTE;
int
main()
{
float f = 3.14f;
int i = sizeof(float) - 1;
BYTE *p = (BYTE *)(&f);
p[i] = p[i] | 0x80; // set the sign bit
printf("%f\n", f); // prints -3.140000
}
We are taking the address of the variable f, then assigning it to a pointer to BYTE (unsigned char). We use a cast to force the pointer.
If you try to compile code with optimizations enabled and you do the pointer cast shown above, you might run into the compiler complaining about "type-punned pointer" issues. I'm not exactly sure when you can do this and when you can't. But you can always use the other way to get at the bits: put the float into a union with an array of bytes.
#include <stdio.h>
typedef unsigned char BYTE;
typedef union
{
float f;
BYTE b[sizeof(float)];
} UFLOAT;
int
main()
{
UFLOAT u;
int const i = sizeof(float) - 1;
u.f = 3.14f;
u.b[i] = u.b[i] | 0x80; // set the sign bit
printf("%f\n", u.f); // prints -3.140000
}
What definitely will not work is to try to cast the float value directly to an unsigned integer or something like that. C doesn't know you just want to override the type, so C tries to convert the value, causing rounding.
float f = 3.14;
unsigned int i = (unsigned int)f;
if (i == 3)
printf("yes\n"); // will print "yes"
P.S. Discussion of "type-punned" pointers here:
Dereferencing type-punned pointer will break strict-aliasing rules

multiplication using SSE (x*x*x)+(y*y*y)

I'm trying to optimize this function using SIMD but I don't know where to start.
long sum(int x,int y)
{
return x*x*x+y*y*y;
}
The disassembled function looks like this:
4007a0: 48 89 f2 mov %rsi,%rdx
4007a3: 48 89 f8 mov %rdi,%rax
4007a6: 48 0f af d6 imul %rsi,%rdx
4007aa: 48 0f af c7 imul %rdi,%rax
4007ae: 48 0f af d6 imul %rsi,%rdx
4007b2: 48 0f af c7 imul %rdi,%rax
4007b6: 48 8d 04 02 lea (%rdx,%rax,1),%rax
4007ba: c3 retq
4007bb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
The calling code looks like this:
do {
for (i = 0; i < maxi; i++) {
j = nextj[i];
long sum = cubeSum(i,j);
while (sum <= p) {
long x = sum & (psize - 1);
int flag = table[x];
if (flag <= guard) {
table[x] = guard+1;
} else if (flag == guard+1) {
table[x] = guard+2;
count++;
}
j++;
sum = cubeSum(i,j);
}
nextj[i] = j;
}
p += psize;
guard += 3;
} while (p <= n);
Fill one SSE register with (x|y|0|0) (since each SSE register holds 4 32-bit elements). Lets call it r1
then make a copy of that register to another register r2
Do r2 * r1, storing the result in, say r2.
Do r2 * r1 again storing the result in r2
Now in r2 you have (x*x*x|y*y*y|0|0)
Unpack the lower two elements of r2 into separate registers, add them (SSE3 has horizontal add instructions, but only for floats and doubles).
In the end, I'd actually be surprised if this turned out to be any faster than the simple code the compiler has already generated for you. SIMD is more useful if you have arrays of data you want to operate on..
This particular case is not a good fit for SIMD (SSE or otherwise). SIMD really only works well when you have contiguous arrays that you can access sequentially and process heterogeneously.
However you can at least get rid of some of the redundant operations in the scalar code, e.g. repeatedly calculating i * i * i when i is invariant:
do {
for (i = 0; i < maxi; i++) {
int i3 = i * i * i;
int j = nextj[i];
int j3 = j * j * j;
long sum = i3 + j3;
while (sum <= p) {
long x = sum & (psize - 1);
int flag = table[x];
if (flag <= guard) {
table[x] = guard+1;
} else if (flag == guard+1) {
table[x] = guard+2;
count++;
}
j++;
j3 = j * j * j;
sum = i3 + j3;
}
nextj[i] = j;
}
p += psize;
guard += 3;
} while (p <= n);

Resources