multiplication using SSE (x*x*x)+(y*y*y)

multiplication using SSE (x*x*x)+(y*y*y) - c

I'm trying to optimize this function using SIMD but I don't know where to start.
long sum(int x,int y)
{
return x*x*x+y*y*y;
}
The disassembled function looks like this:
4007a0: 48 89 f2 mov %rsi,%rdx
4007a3: 48 89 f8 mov %rdi,%rax
4007a6: 48 0f af d6 imul %rsi,%rdx
4007aa: 48 0f af c7 imul %rdi,%rax
4007ae: 48 0f af d6 imul %rsi,%rdx
4007b2: 48 0f af c7 imul %rdi,%rax
4007b6: 48 8d 04 02 lea (%rdx,%rax,1),%rax
4007ba: c3 retq
4007bb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
The calling code looks like this:
do {
for (i = 0; i < maxi; i++) {
j = nextj[i];
long sum = cubeSum(i,j);
while (sum <= p) {
long x = sum & (psize - 1);
int flag = table[x];
if (flag <= guard) {
table[x] = guard+1;
} else if (flag == guard+1) {
table[x] = guard+2;
count++;
}
j++;
sum = cubeSum(i,j);
}
nextj[i] = j;
}
p += psize;
guard += 3;
} while (p <= n);

Fill one SSE register with (x|y|0|0) (since each SSE register holds 4 32-bit elements). Lets call it r1
then make a copy of that register to another register r2
Do r2 * r1, storing the result in, say r2.
Do r2 * r1 again storing the result in r2
Now in r2 you have (x*x*x|y*y*y|0|0)
Unpack the lower two elements of r2 into separate registers, add them (SSE3 has horizontal add instructions, but only for floats and doubles).
In the end, I'd actually be surprised if this turned out to be any faster than the simple code the compiler has already generated for you. SIMD is more useful if you have arrays of data you want to operate on..

This particular case is not a good fit for SIMD (SSE or otherwise). SIMD really only works well when you have contiguous arrays that you can access sequentially and process heterogeneously.
However you can at least get rid of some of the redundant operations in the scalar code, e.g. repeatedly calculating i * i * i when i is invariant:
do {
for (i = 0; i < maxi; i++) {
int i3 = i * i * i;
int j = nextj[i];
int j3 = j * j * j;
long sum = i3 + j3;
while (sum <= p) {
long x = sum & (psize - 1);
int flag = table[x];
if (flag <= guard) {
table[x] = guard+1;
} else if (flag == guard+1) {
table[x] = guard+2;
count++;
}
j++;
j3 = j * j * j;
sum = i3 + j3;
}
nextj[i] = j;
}
p += psize;
guard += 3;
} while (p <= n);

Related

How to make a 2D char array completely empty?

I am creating a 2D array using malloc and when I iterate through the array there are characters and symbols that I do not want in there. Shouldn't the array be completely empty?
I have tried to assign the null character to the beginning of each row but that doesn't change anything.
char **structure;
structure = malloc(sizeof *structure * 2);
if (structure) {
for (size_t i = 0; i < 2; i++) {
structure[i] = malloc(sizeof *structure[i] * 20);
structure[i][0] = '\0';
}
}
for (int i = 0; i <= 2; i++) {
for (int j = 0; j < 20; j++) {
printf("%c ", structure[i][j]);
}
printf("\n");
}
I expected the output to just be blank spaces but this is what appeared:
Z Ñ P Ñ l L O
Z Ñ P Ñ N U M B

You should use the calloc function; this is what I often use. It does the same work as malloc() but it initializes all allocated bits with 0.
calloc documentation

As you are accessing each character you must clear all position explicitly to get your desired output.
char **structure;
structure = (char**)malloc( (sizeof *structure) * 2 );
if (structure)
{
for(size_t i = 0; i < 2; i++){
// structure[i];
structure[i] =(char *) malloc( sizeof *structure[i] * 20 );
for(int j=0; j<20; j++)
structure[i][j] = '\0';
}
}
for(int i = 0; i < 2; i++){ //you can not access position i==2
for(int j = 0; j < 20; j++)
{
printf("%c ", structure[i][j]);
}
printf("\n");
}
You can also use printf("%s", structure[i]) to make it work with your current code. it will work because you have made first position of both strings NULL('\0').So printf function will terminate for both character arrays without printing anything. But remember, other position than the first one of both arrays will contain garbage value.

When you are allocating memory the data inside the memory is undefined. the malloc function just giving you the memory.
if you want the memory to be initialized to 0 you can always use the standard calloc function.
If you want to initialize it after allocation you can always use the memset function. in your case
memset(structure, 0, (sizeof *structure) * 2)

You are not really allocating a 2D array of char, but an array of 2 pointers to arrays of 20 char. It is much simpler to allocate a true 2D array:
// allocate a 2D array and initialize it to null bytes
char (*structure)[20] = calloc(sizeof(*structure), 2);
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 20; j++) {
printf("%02X ", structure[i][j]);
}
printf("\n");
}
Output:
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Compressing a 'char' array using bit packing in C [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a large array (around 1 MB) of type unsigned char (i.e. uint8_t). I know that the bytes in it can have only one of 5 values (i.e. 0, 1, 2, 3, 4). Moreover we do not need to preserve '3's from the input, they can be safely lost when we encode/decode.
So I guessed bit packing would be the simplest way to compress it, so every byte can be converted to 2 bits (00, 01..., 11).
As mentioned all elements of value 3 can be removed (i.e. saved as 0). Which gives me option to save '4' as '3'. While reconstructing (decompressing) I restore 3's to 4's.
I wrote a small function for the compression but I feel this has too many operations and just not efficient enough. Any code snippets or suggestion on how to make it more efficient or faster (hopefully keeping the readability) will be very much helpful.
/// Compress by packing ...
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
{
uint8_t temp[4];
for (int small_loop = 0; small_loop < 4; small_loop++)
{
temp[small_loop] = *in; // Load into local variable
if (temp[small_loop] == 3) // 3's are discarded
temp[small_loop] = 0;
else if (temp[small_loop] == 4) // and 4's are converted to 3
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
((temp[3] & 0x03));
} // end loop
}
Edited to make the problem more clear as it looked like I'm trying to save 5 values into 2 bits. Thanks to #Brian Cain for suggested wording.
Cross-posted on Code Review.

Your function has a bug: when loading the small array, you should write:
temp[small_loop] = in[small_loop];
You can get rid of the tests with a lookup table, either on the source data, or more efficiently on some intermediary result:
In the code below, I use a small table lookup5 to convert the values 0,1,2,3,4 to 0,1,2,0,3, and a larger one to map groups of 4 3-bit values from the source array to the corresponding byte value in the packed format:
#include <stdint.h>
/// Compress by packing ...
void compressByPacking0(uint8_t *out, uint8_t *in, uint32_t length) {
static uint8_t lookup[4096];
static const uint8_t lookup5[8] = { 0, 1, 2, 0, 3, 0, 0, 0 };
if (lookup[0] == 0) {
/* initialize lookup table */
for (int i = 0; i < 4096; i++) {
lookup[i] = (lookup5[(i >> 0) & 7] << 0) +
(lookup5[(i >> 3) & 7] << 2) +
(lookup5[(i >> 6) & 7] << 4) +
(lookup5[(i >> 9) & 7] << 6);
}
}
for (; length >= 4; length -= 4, in += 4, out++) {
*out = lookup[(in[0] << 9) + (in[1] << 6) + (in[2] << 3) + (in[3] << 0)];
}
uint8_t last = 0;
switch (length) {
case 3:
last |= lookup5[in[2]] << 4;
/* fall through */
case 2:
last |= lookup5[in[1]] << 2;
/* fall through */
case 1:
last |= lookup5[in[0]] << 0;
*out = last;
break;
}
}
Notes:
The code assumes the array does not contain values outside the specified range. Extra protection against spurious input can be achieved at a minimal cost.
The dummy << 0 are here only for symmetry and compile to no extra code.
The lookup table could be initialized statically, via a build time script or a set of macros.
You might want to unroll this loop 4 or more times, or let the compiler decide.
You could also use this simpler solution with a smaller lookup table accessed more often. Careful benchmarking will tell you which is more efficient on your target system:
/// Compress by packing ...
void compressByPacking1(uint8_t *out, uint8_t *in, uint32_t length) {
static const uint8_t lookup[4][5] = {
{ 0 << 6, 1 << 6, 2 << 6, 0 << 6, 3 << 6 },
{ 0 << 4, 1 << 4, 2 << 4, 0 << 4, 3 << 4 },
{ 0 << 2, 1 << 2, 2 << 2, 0 << 2, 3 << 2 },
{ 0 << 0, 1 << 0, 2 << 0, 0 << 0, 3 << 0 },
};
for (; length >= 4; length -= 4, in += 4, out++) {
*out = lookup[0][in[0]] + lookup[1][in[1]] +
lookup[2][in[2]] + lookup[3][in[3]];
}
uint8_t last = 0;
switch (length) {
case 3:
last |= lookup[2][in[2]];
/* fall through */
case 2:
last |= lookup[1][in[1]];
/* fall through */
case 1:
last |= lookup[0][in[0]];
*out = last;
break;
}
}
Here is yet another approach, without any tables:
/// Compress by packing ...
void compressByPacking2(uint8_t *out, uint8_t *in, uint32_t length) {
#define BITS ((1 << 2) + (2 << 4) + (3 << 8))
for (; length >= 4; length -= 4, in += 4, out++) {
*out = ((BITS << 6 >> (in[0] + in[0])) & 0xC0) +
((BITS << 4 >> (in[1] + in[1])) & 0x30) +
((BITS << 2 >> (in[2] + in[2])) & 0x0C) +
((BITS << 0 >> (in[3] + in[3])) & 0x03);
}
uint8_t last = 0;
switch (length) {
case 3:
last |= (BITS << 2 >> (in[2] + in[2])) & 0x0C;
/* fall through */
case 2:
last |= (BITS << 4 >> (in[1] + in[1])) & 0x30;
/* fall through */
case 1:
last |= (BITS << 6 >> (in[0] + in[0])) & 0xC0;
*out = last;
break;
}
}
Here is a comparative benchmark on my system, Macbook pro running OS/X, with clang -O2:
compressByPacking(1MB) -> 0.867ms
compressByPacking0(1MB) -> 0.445ms
compressByPacking1(1MB) -> 0.538ms
compressByPacking2(1MB) -> 0.824ms
The compressByPacking0 variant is fastest, almost twice as fast as your code.
It is a little disappointing, but the code is portable. You might squeeze more performance using handcoded SSE optimizations.

I have a large array (around 1 MB)
Either this is a typo, your target is seriously aging, or this compression operation is invoked repeatedly in the critical path of your application.
Any code snippets or suggestion on how to make it more efficient or
faster (hopefully keeping the readability) will be very much helpful.
In general, you will find the best information by empirically measuring the performance and inspecting the generated code. Using profilers to determine what code is executing, where there are cache misses and pipeline stalls -- these can help you tune your algorithm.
For example, you chose a stride of 4 elements. Is that just because you are mapping four input elements to a single byte? Can you use native SIMD instructions/intrinsics to operate on more elements at a time?
Also, how are you compiling for your target and how well is your compiler able to optimize your code?
Let's ask clang whether it finds any problems trying to optimize your code:
$ clang -fvectorize -O3 -Rpass-missed=licm -c tryme.c
tryme.c:11:28: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
temp[small_loop] = *in; // Load into local variable
^
tryme.c:21:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
*out = (uint8_t)((temp[0] & 0x03) << 6) |
^
tryme.c:22:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[1] & 0x03) << 4) |
^
tryme.c:23:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[2] & 0x03) << 2) |
^
tryme.c:24:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[3] & 0x03));
^
I'm not sure but maybe alias analysis is what makes it think it can't move this load. Try playing with __restrict__ to see if that has any effect.
$ clang -fvectorize -O3 -Rpass-analysis=loop-vectorize -c tryme.c
tryme.c:13:13: remark: loop not vectorized: loop contains a switch statement [-Rpass-analysis=loop-vectorize]
if (temp[small_loop] == 3) // 3's are discarded
I can't think of anything obvious that you can do about this one unless you change your algorithm. If the compression ratio is satisfactory without deleting the 3s, you could perhaps eliminate this.
So what's the generated code look like? Take a look below. How could you write it better by hand? If you can write it better yourself, either do that or feed it back into your algorithm to help guide the compiler.
Does the compiled code take advantage of your target's instruction set and registers?
Most importantly -- try executing it and see where you're spending the most cycles. Stalls from branch misprediction, unaligned loads? Maybe you can do something about those. Use what you know about the frequency of your input data to give the compiler hints about the branches in your encoder.
$ objdump -d --source tryme.o
...
0000000000000000 <compressByPacking>:
#include <stdint.h>
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
0: c1 ea 02 shr $0x2,%edx
3: 0f 84 86 00 00 00 je 8f <compressByPacking+0x8f>
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
{
uint8_t temp[4];
for (int small_loop = 0; small_loop < 4; small_loop++)
{
temp[small_loop] = *in; // Load into local variable
10: 8a 06 mov (%rsi),%al
if (temp[small_loop] == 3) // 3's are discarded
12: 3c 04 cmp $0x4,%al
14: 74 3a je 50 <compressByPacking+0x50>
16: 3c 03 cmp $0x3,%al
18: 41 88 c0 mov %al,%r8b
1b: 75 03 jne 20 <compressByPacking+0x20>
1d: 45 31 c0 xor %r8d,%r8d
20: 3c 04 cmp $0x4,%al
22: 74 33 je 57 <compressByPacking+0x57>
24: 3c 03 cmp $0x3,%al
26: 88 c1 mov %al,%cl
28: 75 02 jne 2c <compressByPacking+0x2c>
2a: 31 c9 xor %ecx,%ecx
2c: 3c 04 cmp $0x4,%al
2e: 74 2d je 5d <compressByPacking+0x5d>
30: 3c 03 cmp $0x3,%al
32: 41 88 c1 mov %al,%r9b
35: 75 03 jne 3a <compressByPacking+0x3a>
37: 45 31 c9 xor %r9d,%r9d
3a: 3c 04 cmp $0x4,%al
3c: 74 26 je 64 <compressByPacking+0x64>
3e: 3c 03 cmp $0x3,%al
40: 75 24 jne 66 <compressByPacking+0x66>
42: 31 c0 xor %eax,%eax
44: eb 20 jmp 66 <compressByPacking+0x66>
46: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4d: 00 00 00
50: 41 b0 03 mov $0x3,%r8b
53: 3c 04 cmp $0x4,%al
55: 75 cd jne 24 <compressByPacking+0x24>
57: b1 03 mov $0x3,%cl
59: 3c 04 cmp $0x4,%al
5b: 75 d3 jne 30 <compressByPacking+0x30>
5d: 41 b1 03 mov $0x3,%r9b
60: 3c 04 cmp $0x4,%al
62: 75 da jne 3e <compressByPacking+0x3e>
64: b0 03 mov $0x3,%al
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
66: 41 c0 e0 06 shl $0x6,%r8b
((temp[1] & 0x03) << 4) |
6a: c0 e1 04 shl $0x4,%cl
6d: 80 e1 30 and $0x30,%cl
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
70: 44 08 c1 or %r8b,%cl
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
73: 41 c0 e1 02 shl $0x2,%r9b
77: 41 80 e1 0c and $0xc,%r9b
((temp[3] & 0x03));
7b: 24 03 and $0x3,%al
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
((temp[1] & 0x03) << 4) |
7d: 44 08 c8 or %r9b,%al
((temp[2] & 0x03) << 2) |
80: 08 c8 or %cl,%al
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
82: 88 07 mov %al,(%rdi)
#include <stdint.h>
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
84: 48 83 c6 04 add $0x4,%rsi
88: 48 ff c7 inc %rdi
8b: ff ca dec %edx
8d: 75 81 jne 10 <compressByPacking+0x10>
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
((temp[3] & 0x03));
} // end loop
}
8f: c3 retq

In all the excitement about performance, functionality is overlooked. Code is broke.
// temp[small_loop] = *in; // Load into local variable
temp[small_loop] = in[small_loop];
Alternative:
How about a simple tight loop?
Use const and restrict to allow various optimizations.
void compressByPacking1(uint8_t* restrict out, const uint8_t* restrict in,
uint32_t length) {
static const uint8_t t[5] = { 0, 1, 2, 0, 3 };
uint32_t length4 = length / 4;
unsigned v = 0;
uint32_t i;
for (i = 0; i < length4; i++) {
for (unsigned j=0; j < 4; j++) {
v <<= 2;
v |= t[*in++];
}
out[i] = (uint8_t) v;
}
if (length & 3) {
v = 0;
for (unsigned j; j < 4; j++) {
v <<= 2;
if (j < (length & 3)) {
v |= t[*in++];
}
}
out[i] = (uint8_t) v;
}
}
Tested and found this code to be about 270% as fast (41 vs 15) (YMMV).
Tested and found to form the same output as OP's (corrected) code

Update: Tested
Unsafe version is the fastest - fastest than other ones in another answers. Tested with VS2017
const uint8_t table[4][5] =
{ { 0 << 0,1 << 0,2 << 0,0 << 0,3 << 0 },
{ 0 << 2,1 << 2,2 << 2,0 << 2,3 << 2 },
{ 0 << 4,1 << 4,2 << 4,0 << 4,3 << 4 },
{ 0 << 6,1 << 6,2 << 6,0 << 6,3 << 6 },
};
void code(uint8_t *in, uint8_t *out, uint32_t len)
{
memset(out, 0, len / 4 + 1);
for (uint32_t i = 0; i < len; i++)
out[i / 4] |= table[i & 3][in[i] % 5];
}
void code_unsafe(uint8_t *in, uint8_t *out, uint32_t len)
{
for (uint32_t i = 0; i < len; i += 4, in += 4, out++)
{
*out = table[0][in[0]] | table[1][in[1]] | table[2][in[2]] | table[3][in[3]];
}
}
To check how it is written it is enough to compile it - even online
https://godbolt.org/g/Z75NQV
There are small very simple my coding functions - just for comparition of the compiler generated code, not tested.

Does this look clearer?
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
assert( 0 == length % 4 );
for (int loop = 0; loop < length; loop += 4)
{
uint8_t temp = 0;
for (int small_loop = 0; small_loop < 4; small_loop++)
{
uint8_t inv = *in; // get next input value
switch(inv)
{
case 0: // encode as 00
case 3: // change to 0
break;
case 1:
temp |= (1 << smal_loop*2); // 1 encode as '01'
break;
case 2:
temp |= (2 << smal_loop*2); // 2 encode as '10'
break;
case 4:
temp |= (3 << smal_loop*2); // 4 encode as '11'
break;
default:
assert(0);
}
} // end inner loop
*out = temp;
} // end outer loop
}

Create Function: Every 8 increments = one paragraph

I want to create a function in c, which creates every 8 increments (to be exact, when an integer holds the value 8) a paragraph/new line (and also prints an offset).
In this case, i've got an array
for (int i = 0; i < sizeof(myarray); i++){
printf(" %02hhX", myarray[i]);
}
Now i want to implement my function like this
int row = 0;
for (int i = 0; i < sizeof(myarray); i++){
printf(" %02hhX", myarray[i]);
check_newline(row);
}
The function 'check_newline' has this structure:
void check_newline(int row){
row++;
if(row==8){
offset = offset + 8;
row= 0;
printf("\n%06X", offset);
}
}
Everytime, the integer 'row' reaches the value 8 a new offset will be printed and the value of 'row' will be reset to 0.
Now, i don't know how to implement the return; and with this code, my output looks like this
000008 E0 60 66 64 38 7D E0 60 66 64 38 7D 80000008 80 00 FF FF FF FF FF FF 000010 E0 60 66 64 38 7D E0 60 66 64 38 7D 80000010 80 00 FF FF FF FF FF FF
(totaly wrong)
When i 'put the function inside my code' (so basicaly don't use a function), everything is nice, because of the missing return statement.
for (int i = 0; i < sizeof(myarray); i++){
printf(" %02hhX", myarray[i]);
row++;
if(row==8){
offset = offset + 8;
row= 0;
printf("\n%06X", offset);
}
000000 80 00 FF FF FF FF FF FF
000008 E0 60 66 64 38 7D E0 60
I have to use this kind of calculation often in my code, so a function will be more sleek.

You are over complicating things. You don't need the extra variables as you can make use of i and the % (modulo operator) to work out when you're at the beginning or end of a row like this.
for (int i = 0; i < sizeof(myarray); i++) {
if (i % 8==0) {
printf("%06X",i);
}
printf(" %02hhX", myarray[i]);
if (i % 8==7) {
printf("\n");
}
}

Converting a floating point number in C to IEEE standard

I am trying to write a code in C that generates a random integer , performs simple calculation and then I am trying to print the values of the file in IEEE standard. But I am unable to do so , Please help.
I am unable to print it in Hexadecimal/Binary which is very important.
If I type cast the values in fprintf, I am getting this Error expected expression before double.
int main (int argc, char *argv) {
int limit = 20 ; double a[limit], b[limit]; //Inputs
double result[limit] ; int i , k ; //Outputs
printf("limit = %d", limit ); double q;
for (i= 0 ; i< limit;i++)
{
a[i]= rand();
b[i]= rand();
printf ("A= %x B = %x\n",a[i],b[i]);
}
char op;
printf("Enter the operand used : add,subtract,multiply,divide\n");
scanf ("%c", &op); switch (op) {
case '+': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] + b[k];
printf ("result= %f\n",result[k]);
}
}
break;
case '*': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] * b[k];
}
}
break;
case '/': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] / b[k];
}
}
break;
case '-': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] - b[k];
}
}
break; }
FILE *file; file = fopen("tb.txt","w"); for(k=0;k<limit;k++) {
fprintf (file,"%x\n
%x\n%x\n\n",double(a[k]),double(b[k]),double(result[k]) );
}
fclose(file); /*done!*/
}

If your C compiler supports IEEE-754 floating point format directly (because the CPU supports it) or fully emulates it, you may be able to print doubles simply as bytes. And that is the case for the x86/64 platform.
Here's an example:
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>
void PrintDoubleAsCBytes(double d, FILE* f)
{
unsigned char a[sizeof(d)];
unsigned i;
memcpy(a, &d, sizeof(d));
for (i = 0; i < sizeof(a); i++)
fprintf(f, "%0*X ", (CHAR_BIT + 3) / 4, a[i]);
}
int main(void)
{
PrintDoubleAsCBytes(0.0, stdout); puts("");
PrintDoubleAsCBytes(0.5, stdout); puts("");
PrintDoubleAsCBytes(1.0, stdout); puts("");
PrintDoubleAsCBytes(2.0, stdout); puts("");
PrintDoubleAsCBytes(-2.0, stdout); puts("");
PrintDoubleAsCBytes(DBL_MIN, stdout); puts("");
PrintDoubleAsCBytes(DBL_MAX, stdout); puts("");
PrintDoubleAsCBytes(INFINITY, stdout); puts("");
#ifdef NAN
PrintDoubleAsCBytes(NAN, stdout); puts("");
#endif
return 0;
}
Output (ideone):
00 00 00 00 00 00 00 00
00 00 00 00 00 00 E0 3F
00 00 00 00 00 00 F0 3F
00 00 00 00 00 00 00 40
00 00 00 00 00 00 00 C0
00 00 00 00 00 00 10 00
FF FF FF FF FF FF EF 7F
00 00 00 00 00 00 F0 7F
00 00 00 00 00 00 F8 7F
If IEEE-754 isn't supported directly, the problem becomes more complex. However, it can still be solved.
Here are a few related questions and answers that can help:
How do I handle byte order differences when reading/writing floating-point types in C?
Is there a tool to know whether a value has an exact binary representation as a floating point variable?
C dynamically printf double, no loss of precision and no trailing zeroes
And, of course, all the IEEE-754 related info can be found in Wikipedia.

Try this in your fprint part:
fprintf (file,"%x\n%x\n%x\n\n",*((int*)(&a[k])),*((int*)(&b[k])),*((int*)(&result[k])));
That would translate the double as an integer so it's printed in IEEE standard.
But if you're running your program on a 32-bit machine on which int is 32-bit and double is 64-bit, I suppose you should use:
fprintf (file,"%x%x\n%x%x\n%x%x\n\n",*((int*)(&a[k])),*((int*)(&a[k])+1),*((int*)(&b[k])),*((int*)(&b[k])+1),*((int*)(&result[k])),*((int*)(&result[k])+1));

In C, there are two ways to get at the bytes in a float value: a pointer cast, or a union. I recommend a union.
I just tested this code with GCC and it worked:
#include <stdio.h>
typedef unsigned char BYTE;
int
main()
{
float f = 3.14f;
int i = sizeof(float) - 1;
BYTE *p = (BYTE *)(&f);
p[i] = p[i] | 0x80; // set the sign bit
printf("%f\n", f); // prints -3.140000
}
We are taking the address of the variable f, then assigning it to a pointer to BYTE (unsigned char). We use a cast to force the pointer.
If you try to compile code with optimizations enabled and you do the pointer cast shown above, you might run into the compiler complaining about "type-punned pointer" issues. I'm not exactly sure when you can do this and when you can't. But you can always use the other way to get at the bits: put the float into a union with an array of bytes.
#include <stdio.h>
typedef unsigned char BYTE;
typedef union
{
float f;
BYTE b[sizeof(float)];
} UFLOAT;
int
main()
{
UFLOAT u;
int const i = sizeof(float) - 1;
u.f = 3.14f;
u.b[i] = u.b[i] | 0x80; // set the sign bit
printf("%f\n", u.f); // prints -3.140000
}
What definitely will not work is to try to cast the float value directly to an unsigned integer or something like that. C doesn't know you just want to override the type, so C tries to convert the value, causing rounding.
float f = 3.14;
unsigned int i = (unsigned int)f;
if (i == 3)
printf("yes\n"); // will print "yes"
P.S. Discussion of "type-punned" pointers here:
Dereferencing type-punned pointer will break strict-aliasing rules

while (n > 1) is 25% faster than while (n)?

I have two logically equivalent functions:
long ipow1(int base, int exp) {
// HISTORICAL NOTE:
// This wasn't here in the original question, I edited it in,
if (exp == 0) return 1;
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
NOTICE:
These loops are equivalent because in the former case we are returning result * base (handling the case when exp is or has been reduced to 1) but in the second case we are returning result.
Strangely enough, both with -O3 and -O0 ipow1 consequently outperforms ipow2 by about 25%. How is this possible?
I'm on Windows 7, x64, gcc 4.5.2 and compiling with gcc ipow.c -O0 -std=c99.
And this is my profiling code:
int main(int argc, char *argv[]) {
LARGE_INTEGER ticksPerSecond;
LARGE_INTEGER tick;
LARGE_INTEGER start_ticks, end_ticks, cputime;
double totaltime = 0;
int repetitions = 10000;
int rep = 0;
int nopti = 0;
for (rep = 0; rep < repetitions; rep++) {
if (!QueryPerformanceFrequency(&ticksPerSecond)) printf("\tno go QueryPerformance not present");
if (!QueryPerformanceCounter(&tick)) printf("no go counter not installed");
QueryPerformanceCounter(&start_ticks);
/* start real code */
for (int i = 0; i < 55; i++) {
for (int j = 0; j < 11; j++) {
nopti = ipow1(i, j); // or ipow2
}
}
/* end code */
QueryPerformanceCounter(&end_ticks);
cputime.QuadPart = end_ticks.QuadPart - start_ticks.QuadPart;
totaltime += (double)cputime.QuadPart / (double)ticksPerSecond.QuadPart;
}
printf("\tTotal elapsed CPU time: %.9f sec with %d repetitions - %ld:\n", totaltime, repetitions, nopti);
return 0;
}

No, really, the two ARE NOT equivalent. ipow2 returns correct results when ipow1 doesn't.
http://ideone.com/MqyqU
P.S. I don't care how many comments you leave "explaining" why they're the same, it takes only a single counter-example to disprove your claims.
P.P.S. -1 on the question for your insufferable arrogance toward everyone who already tried to point this out to you.

It's becouse with while (exp > 1) the for will run from exp to 2 (it will execute with exp = 2, decrement it to 1 and then end the loop).
With while (exp), the for will run from exp to 1 (it will execute with exp = 1, decrement it to 0 and then end the loop).
So with while (exp) you have an extra iteration, which takes the extra time to run.
EDIT: Even with the multiplication after the loop with the exp>1 while, keep in mind that the multiplication is not the only thing in the loop.

If you dont want to read all of this skip to the bottom, I come up with a 21% difference just by analysis of the code.
Different systems, versions of the compiler, same compiler version built by different folks/distros will give different instruction mixes, this is just one example of what you might get.
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
0000000000000000 <ipow1>:
0: 83 fe 01 cmp $0x1,%esi
3: ba 01 00 00 00 mov $0x1,%edx
8: 7e 1d jle 27 <ipow1+0x27>
a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
10: 40 f6 c6 01 test $0x1,%sil
14: 74 07 je 1d <ipow1+0x1d>
16: 48 63 c7 movslq %edi,%rax
19: 48 0f af d0 imul %rax,%rdx
1d: d1 fe sar %esi
1f: 0f af ff imul %edi,%edi
22: 83 fe 01 cmp $0x1,%esi
25: 7f e9 jg 10 <ipow1+0x10>
27: 48 63 c7 movslq %edi,%rax
2a: 48 0f af c2 imul %rdx,%rax
2e: c3 retq
2f: 90 nop
0000000000000030 <ipow2>:
30: 85 f6 test %esi,%esi
32: b8 01 00 00 00 mov $0x1,%eax
37: 75 0a jne 43 <ipow2+0x13>
39: eb 19 jmp 54 <ipow2+0x24>
3b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
40: 0f af ff imul %edi,%edi
43: 40 f6 c6 01 test $0x1,%sil
47: 74 07 je 50 <ipow2+0x20>
49: 48 63 d7 movslq %edi,%rdx
4c: 48 0f af c2 imul %rdx,%rax
50: d1 fe sar %esi
52: 75 ec jne 40 <ipow2+0x10>
54: f3 c3 repz retq
Isolating the loops:
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
//if exp & 1 not true jump to 1d to skip
10: 40 f6 c6 01 test $0x1,%sil
14: 74 07 je 1d <ipow1+0x1d>
//result *= base
16: 48 63 c7 movslq %edi,%rax
19: 48 0f af d0 imul %rax,%rdx
//exp>>=1
1d: d1 fe sar %esi
//base *= base
1f: 0f af ff imul %edi,%edi
//while(exp>1) stayin the loop
22: 83 fe 01 cmp $0x1,%esi
25: 7f e9 jg 10 <ipow1+0x10>
Comparing something to zero normally saves you an instruction and you can see that here
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
//base *= base
40: 0f af ff imul %edi,%edi
//if exp & 1 not true jump to skip
43: 40 f6 c6 01 test $0x1,%sil
47: 74 07 je 50 <ipow2+0x20>
//result *= base
49: 48 63 d7 movslq %edi,%rdx
4c: 48 0f af c2 imul %rdx,%rax
//exp>>=1
50: d1 fe sar %esi
//no need for a compare
52: 75 ec jne 40 <ipow2+0x10>
Your timing method is going to generate a lot of error/chaos. Depending on the beat frequency of the loop and the accuracy of the timer you can create a lot of gain in one and a lot of loss in another. This method normally gives better accuracy:
starttime = ...
for(rep=bignumber;rep;rep--)
{
//code under test
...
}
endtime = ...
total = endtime - starttime;
Of course if you are running this on an operating system timing it is going to have a decent amount of error in it anyway.
Also you want to use volatile variables for your timer variables, helps the compiler to not re-arrange the order of execution. (been there seen that).
If we look at this from the perspective of the base multiplies:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int mults;
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
mults++;
}
result *= base;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
mults++;
}
return result;
}
int main ( void )
{
int i;
int j;
mults = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
mults=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
}
there are
mults 1045
mults 1595
50% more for ipow2(). Actually it is not just the multiplies it is that you are going through the loop 50% more times.
ipow1() gets a little back on the other multiplies:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int mults;
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) mults++;
exp >>= 1;
base *= base;
}
mults++;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) mults++;
exp >>= 1;
base *= base;
}
return result;
}
int main ( void )
{
int i;
int j;
mults = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
mults=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
}
ipow1() performs the result*=base a different number (more) times than ipow2()
mults 990
mults 935
being a long * int can make these more expensive. not enough to make up for the losses around the loop in ipow2().
Even without disassembling, making a rough guess on the operations/instructions you hope the compiler uses. Accounting here for processors in general not necessarily x86, some processors will run this code better than others (from a number of instructions executed perspective not counting all the other factors).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int ops;
long ipow1(int base, int exp) {
long result = 1;
ops++; //result = immediate
while (exp > 1) {
ops++; // compare exp - 1
ops++; // conditional jump
//if (exp & 1)
ops++; //exp&1
ops++; //conditional jump
if (exp & 1)
{
result *= base;
ops++;
}
exp >>= 1;
ops++;
//ops+=?; //using a signed number can cost you this on some systems
//always use unsigned unless you have a specific reason to use signed.
//if this had been a short or char variable it might cost you even more
//operations
//if this needs to be signed it is what it is, just be aware of
//the cost
base *= base;
ops++;
}
result *= base;
ops++;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
ops++;
while (exp) {
//ops++; //cmp exp-0, often optimizes out;
ops++; //conditional jump
//if (exp & 1)
ops++;
ops++;
if (exp & 1)
{
result *= base;
ops++;
}
exp >>= 1;
ops++;
//ops+=?; //right shifting a signed number
base *= base;
ops++;
}
return result;
}
int main ( void )
{
int i;
int j;
ops = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("ops %u\n",ops);
ops=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("ops %u\n",ops);
}
Assuming I counted all the major operations and didnt unfairly give one function more than another:
ops 7865
ops 9515
ipow2 is 21% slower using this analysis.
I think the big killer is the 50% more times through the loop. Granted it is data dependent, you might find inputs in a benchmark test that make the difference between functions greater or worse than the 25% you are seeing.

Your functions are not "logically equal".
while (exp > 1){...}
is NOT logically equal to
while (exp){...}
Why do you say it is?

Does this really generate the same assembly code? When I tried (with gcc 4.5.1 on OpenSuse 11.4, I will admit) I found slight differences.
ipow1.s:
cmpl $1, -24(%rbp)
jg .L4
movl -20(%rbp), %eax
cltq
imulq -8(%rbp), %rax
leave
ipow2.s:
cmpl $0, -24(%rbp)
jne .L4
movq -8(%rbp), %rax
leave
Perhaps the processor's branch prediction is just more effective with jg than with jne? It seems unlikely that one branch instruction would run 25% faster than another (especially when cmpl has done most of the heavy lifting)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

multiplication using SSE (xxx)+(yyy) - c

Related

How to make a 2D char array completely empty?

Compressing a 'char' array using bit packing in C [closed]

Create Function: Every 8 increments = one paragraph

Converting a floating point number in C to IEEE standard

while (n > 1) is 25% faster than while (n)?

Categories

Resources