Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a large array (around 1 MB) of type unsigned char (i.e. uint8_t). I know that the bytes in it can have only one of 5 values (i.e. 0, 1, 2, 3, 4). Moreover we do not need to preserve '3's from the input, they can be safely lost when we encode/decode.
So I guessed bit packing would be the simplest way to compress it, so every byte can be converted to 2 bits (00, 01..., 11).
As mentioned all elements of value 3 can be removed (i.e. saved as 0). Which gives me option to save '4' as '3'. While reconstructing (decompressing) I restore 3's to 4's.
I wrote a small function for the compression but I feel this has too many operations and just not efficient enough. Any code snippets or suggestion on how to make it more efficient or faster (hopefully keeping the readability) will be very much helpful.
/// Compress by packing ...
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
{
uint8_t temp[4];
for (int small_loop = 0; small_loop < 4; small_loop++)
{
temp[small_loop] = *in; // Load into local variable
if (temp[small_loop] == 3) // 3's are discarded
temp[small_loop] = 0;
else if (temp[small_loop] == 4) // and 4's are converted to 3
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
((temp[3] & 0x03));
} // end loop
}
Edited to make the problem more clear as it looked like I'm trying to save 5 values into 2 bits. Thanks to #Brian Cain for suggested wording.
Cross-posted on Code Review.
Your function has a bug: when loading the small array, you should write:
temp[small_loop] = in[small_loop];
You can get rid of the tests with a lookup table, either on the source data, or more efficiently on some intermediary result:
In the code below, I use a small table lookup5 to convert the values 0,1,2,3,4 to 0,1,2,0,3, and a larger one to map groups of 4 3-bit values from the source array to the corresponding byte value in the packed format:
#include <stdint.h>
/// Compress by packing ...
void compressByPacking0(uint8_t *out, uint8_t *in, uint32_t length) {
static uint8_t lookup[4096];
static const uint8_t lookup5[8] = { 0, 1, 2, 0, 3, 0, 0, 0 };
if (lookup[0] == 0) {
/* initialize lookup table */
for (int i = 0; i < 4096; i++) {
lookup[i] = (lookup5[(i >> 0) & 7] << 0) +
(lookup5[(i >> 3) & 7] << 2) +
(lookup5[(i >> 6) & 7] << 4) +
(lookup5[(i >> 9) & 7] << 6);
}
}
for (; length >= 4; length -= 4, in += 4, out++) {
*out = lookup[(in[0] << 9) + (in[1] << 6) + (in[2] << 3) + (in[3] << 0)];
}
uint8_t last = 0;
switch (length) {
case 3:
last |= lookup5[in[2]] << 4;
/* fall through */
case 2:
last |= lookup5[in[1]] << 2;
/* fall through */
case 1:
last |= lookup5[in[0]] << 0;
*out = last;
break;
}
}
Notes:
The code assumes the array does not contain values outside the specified range. Extra protection against spurious input can be achieved at a minimal cost.
The dummy << 0 are here only for symmetry and compile to no extra code.
The lookup table could be initialized statically, via a build time script or a set of macros.
You might want to unroll this loop 4 or more times, or let the compiler decide.
You could also use this simpler solution with a smaller lookup table accessed more often. Careful benchmarking will tell you which is more efficient on your target system:
/// Compress by packing ...
void compressByPacking1(uint8_t *out, uint8_t *in, uint32_t length) {
static const uint8_t lookup[4][5] = {
{ 0 << 6, 1 << 6, 2 << 6, 0 << 6, 3 << 6 },
{ 0 << 4, 1 << 4, 2 << 4, 0 << 4, 3 << 4 },
{ 0 << 2, 1 << 2, 2 << 2, 0 << 2, 3 << 2 },
{ 0 << 0, 1 << 0, 2 << 0, 0 << 0, 3 << 0 },
};
for (; length >= 4; length -= 4, in += 4, out++) {
*out = lookup[0][in[0]] + lookup[1][in[1]] +
lookup[2][in[2]] + lookup[3][in[3]];
}
uint8_t last = 0;
switch (length) {
case 3:
last |= lookup[2][in[2]];
/* fall through */
case 2:
last |= lookup[1][in[1]];
/* fall through */
case 1:
last |= lookup[0][in[0]];
*out = last;
break;
}
}
Here is yet another approach, without any tables:
/// Compress by packing ...
void compressByPacking2(uint8_t *out, uint8_t *in, uint32_t length) {
#define BITS ((1 << 2) + (2 << 4) + (3 << 8))
for (; length >= 4; length -= 4, in += 4, out++) {
*out = ((BITS << 6 >> (in[0] + in[0])) & 0xC0) +
((BITS << 4 >> (in[1] + in[1])) & 0x30) +
((BITS << 2 >> (in[2] + in[2])) & 0x0C) +
((BITS << 0 >> (in[3] + in[3])) & 0x03);
}
uint8_t last = 0;
switch (length) {
case 3:
last |= (BITS << 2 >> (in[2] + in[2])) & 0x0C;
/* fall through */
case 2:
last |= (BITS << 4 >> (in[1] + in[1])) & 0x30;
/* fall through */
case 1:
last |= (BITS << 6 >> (in[0] + in[0])) & 0xC0;
*out = last;
break;
}
}
Here is a comparative benchmark on my system, Macbook pro running OS/X, with clang -O2:
compressByPacking(1MB) -> 0.867ms
compressByPacking0(1MB) -> 0.445ms
compressByPacking1(1MB) -> 0.538ms
compressByPacking2(1MB) -> 0.824ms
The compressByPacking0 variant is fastest, almost twice as fast as your code.
It is a little disappointing, but the code is portable. You might squeeze more performance using handcoded SSE optimizations.
I have a large array (around 1 MB)
Either this is a typo, your target is seriously aging, or this compression operation is invoked repeatedly in the critical path of your application.
Any code snippets or suggestion on how to make it more efficient or
faster (hopefully keeping the readability) will be very much helpful.
In general, you will find the best information by empirically measuring the performance and inspecting the generated code. Using profilers to determine what code is executing, where there are cache misses and pipeline stalls -- these can help you tune your algorithm.
For example, you chose a stride of 4 elements. Is that just because you are mapping four input elements to a single byte? Can you use native SIMD instructions/intrinsics to operate on more elements at a time?
Also, how are you compiling for your target and how well is your compiler able to optimize your code?
Let's ask clang whether it finds any problems trying to optimize your code:
$ clang -fvectorize -O3 -Rpass-missed=licm -c tryme.c
tryme.c:11:28: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
temp[small_loop] = *in; // Load into local variable
^
tryme.c:21:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
*out = (uint8_t)((temp[0] & 0x03) << 6) |
^
tryme.c:22:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[1] & 0x03) << 4) |
^
tryme.c:23:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[2] & 0x03) << 2) |
^
tryme.c:24:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[3] & 0x03));
^
I'm not sure but maybe alias analysis is what makes it think it can't move this load. Try playing with __restrict__ to see if that has any effect.
$ clang -fvectorize -O3 -Rpass-analysis=loop-vectorize -c tryme.c
tryme.c:13:13: remark: loop not vectorized: loop contains a switch statement [-Rpass-analysis=loop-vectorize]
if (temp[small_loop] == 3) // 3's are discarded
I can't think of anything obvious that you can do about this one unless you change your algorithm. If the compression ratio is satisfactory without deleting the 3s, you could perhaps eliminate this.
So what's the generated code look like? Take a look below. How could you write it better by hand? If you can write it better yourself, either do that or feed it back into your algorithm to help guide the compiler.
Does the compiled code take advantage of your target's instruction set and registers?
Most importantly -- try executing it and see where you're spending the most cycles. Stalls from branch misprediction, unaligned loads? Maybe you can do something about those. Use what you know about the frequency of your input data to give the compiler hints about the branches in your encoder.
$ objdump -d --source tryme.o
...
0000000000000000 <compressByPacking>:
#include <stdint.h>
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
0: c1 ea 02 shr $0x2,%edx
3: 0f 84 86 00 00 00 je 8f <compressByPacking+0x8f>
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
{
uint8_t temp[4];
for (int small_loop = 0; small_loop < 4; small_loop++)
{
temp[small_loop] = *in; // Load into local variable
10: 8a 06 mov (%rsi),%al
if (temp[small_loop] == 3) // 3's are discarded
12: 3c 04 cmp $0x4,%al
14: 74 3a je 50 <compressByPacking+0x50>
16: 3c 03 cmp $0x3,%al
18: 41 88 c0 mov %al,%r8b
1b: 75 03 jne 20 <compressByPacking+0x20>
1d: 45 31 c0 xor %r8d,%r8d
20: 3c 04 cmp $0x4,%al
22: 74 33 je 57 <compressByPacking+0x57>
24: 3c 03 cmp $0x3,%al
26: 88 c1 mov %al,%cl
28: 75 02 jne 2c <compressByPacking+0x2c>
2a: 31 c9 xor %ecx,%ecx
2c: 3c 04 cmp $0x4,%al
2e: 74 2d je 5d <compressByPacking+0x5d>
30: 3c 03 cmp $0x3,%al
32: 41 88 c1 mov %al,%r9b
35: 75 03 jne 3a <compressByPacking+0x3a>
37: 45 31 c9 xor %r9d,%r9d
3a: 3c 04 cmp $0x4,%al
3c: 74 26 je 64 <compressByPacking+0x64>
3e: 3c 03 cmp $0x3,%al
40: 75 24 jne 66 <compressByPacking+0x66>
42: 31 c0 xor %eax,%eax
44: eb 20 jmp 66 <compressByPacking+0x66>
46: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4d: 00 00 00
50: 41 b0 03 mov $0x3,%r8b
53: 3c 04 cmp $0x4,%al
55: 75 cd jne 24 <compressByPacking+0x24>
57: b1 03 mov $0x3,%cl
59: 3c 04 cmp $0x4,%al
5b: 75 d3 jne 30 <compressByPacking+0x30>
5d: 41 b1 03 mov $0x3,%r9b
60: 3c 04 cmp $0x4,%al
62: 75 da jne 3e <compressByPacking+0x3e>
64: b0 03 mov $0x3,%al
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
66: 41 c0 e0 06 shl $0x6,%r8b
((temp[1] & 0x03) << 4) |
6a: c0 e1 04 shl $0x4,%cl
6d: 80 e1 30 and $0x30,%cl
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
70: 44 08 c1 or %r8b,%cl
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
73: 41 c0 e1 02 shl $0x2,%r9b
77: 41 80 e1 0c and $0xc,%r9b
((temp[3] & 0x03));
7b: 24 03 and $0x3,%al
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
((temp[1] & 0x03) << 4) |
7d: 44 08 c8 or %r9b,%al
((temp[2] & 0x03) << 2) |
80: 08 c8 or %cl,%al
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
82: 88 07 mov %al,(%rdi)
#include <stdint.h>
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
84: 48 83 c6 04 add $0x4,%rsi
88: 48 ff c7 inc %rdi
8b: ff ca dec %edx
8d: 75 81 jne 10 <compressByPacking+0x10>
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
((temp[3] & 0x03));
} // end loop
}
8f: c3 retq
In all the excitement about performance, functionality is overlooked. Code is broke.
// temp[small_loop] = *in; // Load into local variable
temp[small_loop] = in[small_loop];
Alternative:
How about a simple tight loop?
Use const and restrict to allow various optimizations.
void compressByPacking1(uint8_t* restrict out, const uint8_t* restrict in,
uint32_t length) {
static const uint8_t t[5] = { 0, 1, 2, 0, 3 };
uint32_t length4 = length / 4;
unsigned v = 0;
uint32_t i;
for (i = 0; i < length4; i++) {
for (unsigned j=0; j < 4; j++) {
v <<= 2;
v |= t[*in++];
}
out[i] = (uint8_t) v;
}
if (length & 3) {
v = 0;
for (unsigned j; j < 4; j++) {
v <<= 2;
if (j < (length & 3)) {
v |= t[*in++];
}
}
out[i] = (uint8_t) v;
}
}
Tested and found this code to be about 270% as fast (41 vs 15) (YMMV).
Tested and found to form the same output as OP's (corrected) code
Update: Tested
Unsafe version is the fastest - fastest than other ones in another answers. Tested with VS2017
const uint8_t table[4][5] =
{ { 0 << 0,1 << 0,2 << 0,0 << 0,3 << 0 },
{ 0 << 2,1 << 2,2 << 2,0 << 2,3 << 2 },
{ 0 << 4,1 << 4,2 << 4,0 << 4,3 << 4 },
{ 0 << 6,1 << 6,2 << 6,0 << 6,3 << 6 },
};
void code(uint8_t *in, uint8_t *out, uint32_t len)
{
memset(out, 0, len / 4 + 1);
for (uint32_t i = 0; i < len; i++)
out[i / 4] |= table[i & 3][in[i] % 5];
}
void code_unsafe(uint8_t *in, uint8_t *out, uint32_t len)
{
for (uint32_t i = 0; i < len; i += 4, in += 4, out++)
{
*out = table[0][in[0]] | table[1][in[1]] | table[2][in[2]] | table[3][in[3]];
}
}
To check how it is written it is enough to compile it - even online
https://godbolt.org/g/Z75NQV
There are small very simple my coding functions - just for comparition of the compiler generated code, not tested.
Does this look clearer?
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
assert( 0 == length % 4 );
for (int loop = 0; loop < length; loop += 4)
{
uint8_t temp = 0;
for (int small_loop = 0; small_loop < 4; small_loop++)
{
uint8_t inv = *in; // get next input value
switch(inv)
{
case 0: // encode as 00
case 3: // change to 0
break;
case 1:
temp |= (1 << smal_loop*2); // 1 encode as '01'
break;
case 2:
temp |= (2 << smal_loop*2); // 2 encode as '10'
break;
case 4:
temp |= (3 << smal_loop*2); // 4 encode as '11'
break;
default:
assert(0);
}
} // end inner loop
*out = temp;
} // end outer loop
}
I am trying to calculate a number which produce Longest Collatz sequence. But here is a strange problem. 3n+1 become 38654705674 when n is 3. I do not see an error. here is the full code:
/* 6.c -- calculates Longest Collatz sequence */
#include <stdio.h>
long long get_collatz_length(long long);
int main(void)
{
long long i;
long long current, current_count, count;
current_count = 1;
current = 1;
for(i=2;i<1000000;i++)
{
// works fine when i is 2 the next line take eternity when i is 3;
count = get_collatz_length(i);
if(current_count <= count)
{
current = i;
current_count = count;
}
}
printf("%lld %lld\n", current, current_count);
return 0;
}
long long get_collatz_length(long long num)
{
long long count;
count = 1;
while(num != 1)
{
printf("%lld\n", num);
if(num%2)
{
num = num*3+1; // here it is;
}
else
{
num/=2;
}
count++;
}
puts("");
return count;
}
It's seems to be bug in dmc compiler, that fails to handle long long type correctly. Here is narrowed test-case:
#include <stdio.h>
int main(void)
{
long long num = 3LL;
/*printf("%lld\n", num);*/
num = num * 3LL;
char *t = (char *) #
for (int i = 0; i < 8; i++)
printf("%x\t", t[i]);
putchar('\n');
/*printf("%lld\n", num);*/
return 0;
}
It produces (little endian, so 0x900000009 == 38 654 705 673):
9 0 0 0 9 0 0 0
From dissasembly it looks that it stores 64-bit integer as two 32-bit registers:
.data:0x000000be 6bd203 imul edx,edx,0x3
.data:0x000000c1 6bc803 imul ecx,eax,0x3
.data:0x000000c4 03ca add ecx,edx
.data:0x000000c6 ba03000000 mov edx,0x3
.data:0x000000cb f7e2 mul edx
.data:0x000000cd 03d1 add edx,ecx
.data:0x000000cf 31c0 xor eax,eax
I additionaly tested it with objconv tool, that just confirms my initial diagnose:
#include <stdio.h>
void mul(void)
{
long long a;
long long c;
a = 5LL;
c = a * 3LL;
printf("%llx\n", c);
}
int main(void)
{
mul();
return 0;
}
disassembly (single section):
>objconv.exe -fmasm ..\dm\bin\check.obj
_mul PROC NEAR
mov eax, 5 ; 0000 _ B8, 00000005
cdq ; 0005 _ 99
imul edx, edx, 3 ; 0006 _ 6B. D2, 03
imul ecx, eax, 3 ; 0009 _ 6B. C8, 03
add ecx, edx ; 000C _ 03. CA
mov edx, 3 ; 000E _ BA, 00000003
mul edx ; 0013 _ F7. E2
add edx, ecx ; 0015 _ 03. D1
push edx ; 0017 _ 52
push eax ; 0018 _ 50
push offset FLAT:?_001 ; 0019 _ 68, 00000000(segrel)
call _printf ; 001E _ E8, 00000000(rel)
add esp, 12 ; 0023 _ 83. C4, 0C
ret ; 0026 _ C3
_mul ENDP
Note that mul edx operates implicitely on eax. The result is stored in both registers, higher part (in this case 0) in stored in edx, while lower in eax.
I'm writing a kernel module to find the the memory address of do_debug (0xffffffff8134f709) by first searching for the hex bytes next to the address. I'm not sure I am using the correct hex bytes: "\xe8\x61\x07\x00\x00" (I wish to stick to C and not assembly.)
struct desc_ptr idt_register;
store_idt(&idt_register);
printk("idt_register.address: %lx\n", idt_register.address); // same as in /boot/System.map*
gate_desc *idt_table = (gate_desc *)idt_register.address;
unsigned char *debug = (unsigned char *)gate_offset(idt_table[0x1]);
printk("debug: %lx\n", (unsigned long)debug); // same as in /boot/System.map*
int count = 0;
while(count < 150){
if( (*(debug) == 0xe8) && (*(debug + 1) == 0x61) && (*(debug + 2) == 0x07) && (*(debug + 3) == 0x00) && (*(debug + 4) == 0x00) ){
debug += 5;
unsigned long *do_debug = (unsigned long *)(0xffffffff00000000 | *((unsigned long *)(debug)));
if((unsigned long)do_debug != 0xffffffff8134f709){
printk("do_debug: %lx\n", (unsigned long)do_debug); // wrong address !
return;
}
break;
}
debug++;
count++;
}
gdb:
gdb ./vmlinux-3.2.0-4-amd64
...
(gdb) info line debug
Line 54 of "/build/linux-s5x2oE/linux-3.2.46/drivers/pci/hotplug/pci_hotplug_core.c" is at address 0xffffffff811cf02b <power_read_file+2> but contains no code.
Line 1322 of "/build/linux-s5x2oE/linux-3.2.46/arch/x86/kernel/entry_64.S"
starts at address 0xffffffff8134ef80 and ends at 0xffffffff8134efc0.
(gdb) disas 0xffffffff8134ef80,0xffffffff8134efc0
Dump of assembler code from 0xffffffff8134ef80 to 0xffffffff8134efc0:
0xffffffff8134ef80: callq *0x2c687a(%rip) # 0xffffffff81615800
0xffffffff8134ef86: pushq $0xffffffffffffffff
0xffffffff8134ef88: sub $0x78,%rsp
0xffffffff8134ef8c: callq 0xffffffff8134ed40
0xffffffff8134ef91: mov %rsp,%rdi
0xffffffff8134ef94: xor %esi,%esi
0xffffffff8134ef96: subq $0x1000,%gs:0x1137c
0xffffffff8134efa3: callq 0xffffffff8134f709 <do_debug>
0xffffffff8134efa8: addq $0x1000,%gs:0x1137c
0xffffffff8134efb5: jmpq 0xffffffff8134f160
0xffffffff8134efba: nopw 0x0(%rax,%rax,1)
End of assembler dump.
(gdb) x/i 0xffffffff8134efa3
0xffffffff8134efa3: callq 0xffffffff8134f709 <do_debug>
(gdb) x/xw 0xffffffff8134efa3
0xffffffff8134efa3: 0x000761e8
(gdb)
readelf:
ffffffff8134ef96: 65 48 81 2c 25 7c 13 subq $0x1000,%gs:0x1137c
ffffffff8134ef9d: 01 00 00 10 00 00
ffffffff8134efa3: e8 61 07 00 00 callq ffffffff8134f709 <do_debug>
ffffffff8134efa8: 65 48 81 04 25 7c 13 addq $0x1000,%gs:0x1137c
ffffffff8134efaf: 01 00 00 10 00 00
EDIT:
(gdb) print do_debug
$1 = {void (struct pt_regs *, long int)} 0xffffffff8134f709 <do_debug>
(gdb)
I am not familiar with linux kernel.
In the while statement, when you go into the first if statement, the debug then point to ffffffff8134efa8
ffffffff8134efa8: 65 48 81 04 25 7c 13 addq $0x1000,%gs:0x1137c # debug point here
ffffffff8134efaf: 01 00 00 10 00 00
the next statement
unsigned long *do_debug = (unsigned long *)(0xffffffff00000000 |
*((unsigned long *)(debug)));
will get this result.
do_debug = (unsigned long *)(0xffffffff00000000 | 0x01137c2504814865) = 0xffffffff04814865
if you want to get the do_debug address, you should do this:
if( (*(debug) == 0xe8) && (*(debug + 1) == 0x61) && (*(debug + 2) == 0x07) && (*(debug + 3) == 0x00) && (*(debug + 4) == 0x00) ){
//debug += 5;
uint32_t offset = ntohl(*(unsigned int *)(debug+1)); //get the do_debug function offset
unsigned long *do_debug = (debug+5)+offset; // next_code_instruction + offset
//debug+5 point to ffffffff8134efa8
//offset is 0x761
//so ffffffff8134efa8+0x761 = 0xffffffff8134f709
if((unsigned long)do_debug != 0xffffffff8134f709){
printk("do_debug: %lx\n", (unsigned long)do_debug); // wrong address !
return;
}
break;
}
I am trying to write a code in C that generates a random integer , performs simple calculation and then I am trying to print the values of the file in IEEE standard. But I am unable to do so , Please help.
I am unable to print it in Hexadecimal/Binary which is very important.
If I type cast the values in fprintf, I am getting this Error expected expression before double.
int main (int argc, char *argv) {
int limit = 20 ; double a[limit], b[limit]; //Inputs
double result[limit] ; int i , k ; //Outputs
printf("limit = %d", limit ); double q;
for (i= 0 ; i< limit;i++)
{
a[i]= rand();
b[i]= rand();
printf ("A= %x B = %x\n",a[i],b[i]);
}
char op;
printf("Enter the operand used : add,subtract,multiply,divide\n");
scanf ("%c", &op); switch (op) {
case '+': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] + b[k];
printf ("result= %f\n",result[k]);
}
}
break;
case '*': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] * b[k];
}
}
break;
case '/': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] / b[k];
}
}
break;
case '-': {
for (k= 0 ; k< limit ; k++)
{
result [k]= a[k] - b[k];
}
}
break; }
FILE *file; file = fopen("tb.txt","w"); for(k=0;k<limit;k++) {
fprintf (file,"%x\n
%x\n%x\n\n",double(a[k]),double(b[k]),double(result[k]) );
}
fclose(file); /*done!*/
}
If your C compiler supports IEEE-754 floating point format directly (because the CPU supports it) or fully emulates it, you may be able to print doubles simply as bytes. And that is the case for the x86/64 platform.
Here's an example:
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>
void PrintDoubleAsCBytes(double d, FILE* f)
{
unsigned char a[sizeof(d)];
unsigned i;
memcpy(a, &d, sizeof(d));
for (i = 0; i < sizeof(a); i++)
fprintf(f, "%0*X ", (CHAR_BIT + 3) / 4, a[i]);
}
int main(void)
{
PrintDoubleAsCBytes(0.0, stdout); puts("");
PrintDoubleAsCBytes(0.5, stdout); puts("");
PrintDoubleAsCBytes(1.0, stdout); puts("");
PrintDoubleAsCBytes(2.0, stdout); puts("");
PrintDoubleAsCBytes(-2.0, stdout); puts("");
PrintDoubleAsCBytes(DBL_MIN, stdout); puts("");
PrintDoubleAsCBytes(DBL_MAX, stdout); puts("");
PrintDoubleAsCBytes(INFINITY, stdout); puts("");
#ifdef NAN
PrintDoubleAsCBytes(NAN, stdout); puts("");
#endif
return 0;
}
Output (ideone):
00 00 00 00 00 00 00 00
00 00 00 00 00 00 E0 3F
00 00 00 00 00 00 F0 3F
00 00 00 00 00 00 00 40
00 00 00 00 00 00 00 C0
00 00 00 00 00 00 10 00
FF FF FF FF FF FF EF 7F
00 00 00 00 00 00 F0 7F
00 00 00 00 00 00 F8 7F
If IEEE-754 isn't supported directly, the problem becomes more complex. However, it can still be solved.
Here are a few related questions and answers that can help:
How do I handle byte order differences when reading/writing floating-point types in C?
Is there a tool to know whether a value has an exact binary representation as a floating point variable?
C dynamically printf double, no loss of precision and no trailing zeroes
And, of course, all the IEEE-754 related info can be found in Wikipedia.
Try this in your fprint part:
fprintf (file,"%x\n%x\n%x\n\n",*((int*)(&a[k])),*((int*)(&b[k])),*((int*)(&result[k])));
That would translate the double as an integer so it's printed in IEEE standard.
But if you're running your program on a 32-bit machine on which int is 32-bit and double is 64-bit, I suppose you should use:
fprintf (file,"%x%x\n%x%x\n%x%x\n\n",*((int*)(&a[k])),*((int*)(&a[k])+1),*((int*)(&b[k])),*((int*)(&b[k])+1),*((int*)(&result[k])),*((int*)(&result[k])+1));
In C, there are two ways to get at the bytes in a float value: a pointer cast, or a union. I recommend a union.
I just tested this code with GCC and it worked:
#include <stdio.h>
typedef unsigned char BYTE;
int
main()
{
float f = 3.14f;
int i = sizeof(float) - 1;
BYTE *p = (BYTE *)(&f);
p[i] = p[i] | 0x80; // set the sign bit
printf("%f\n", f); // prints -3.140000
}
We are taking the address of the variable f, then assigning it to a pointer to BYTE (unsigned char). We use a cast to force the pointer.
If you try to compile code with optimizations enabled and you do the pointer cast shown above, you might run into the compiler complaining about "type-punned pointer" issues. I'm not exactly sure when you can do this and when you can't. But you can always use the other way to get at the bits: put the float into a union with an array of bytes.
#include <stdio.h>
typedef unsigned char BYTE;
typedef union
{
float f;
BYTE b[sizeof(float)];
} UFLOAT;
int
main()
{
UFLOAT u;
int const i = sizeof(float) - 1;
u.f = 3.14f;
u.b[i] = u.b[i] | 0x80; // set the sign bit
printf("%f\n", u.f); // prints -3.140000
}
What definitely will not work is to try to cast the float value directly to an unsigned integer or something like that. C doesn't know you just want to override the type, so C tries to convert the value, causing rounding.
float f = 3.14;
unsigned int i = (unsigned int)f;
if (i == 3)
printf("yes\n"); // will print "yes"
P.S. Discussion of "type-punned" pointers here:
Dereferencing type-punned pointer will break strict-aliasing rules
I'm trying to optimize this function using SIMD but I don't know where to start.
long sum(int x,int y)
{
return x*x*x+y*y*y;
}
The disassembled function looks like this:
4007a0: 48 89 f2 mov %rsi,%rdx
4007a3: 48 89 f8 mov %rdi,%rax
4007a6: 48 0f af d6 imul %rsi,%rdx
4007aa: 48 0f af c7 imul %rdi,%rax
4007ae: 48 0f af d6 imul %rsi,%rdx
4007b2: 48 0f af c7 imul %rdi,%rax
4007b6: 48 8d 04 02 lea (%rdx,%rax,1),%rax
4007ba: c3 retq
4007bb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
The calling code looks like this:
do {
for (i = 0; i < maxi; i++) {
j = nextj[i];
long sum = cubeSum(i,j);
while (sum <= p) {
long x = sum & (psize - 1);
int flag = table[x];
if (flag <= guard) {
table[x] = guard+1;
} else if (flag == guard+1) {
table[x] = guard+2;
count++;
}
j++;
sum = cubeSum(i,j);
}
nextj[i] = j;
}
p += psize;
guard += 3;
} while (p <= n);
Fill one SSE register with (x|y|0|0) (since each SSE register holds 4 32-bit elements). Lets call it r1
then make a copy of that register to another register r2
Do r2 * r1, storing the result in, say r2.
Do r2 * r1 again storing the result in r2
Now in r2 you have (x*x*x|y*y*y|0|0)
Unpack the lower two elements of r2 into separate registers, add them (SSE3 has horizontal add instructions, but only for floats and doubles).
In the end, I'd actually be surprised if this turned out to be any faster than the simple code the compiler has already generated for you. SIMD is more useful if you have arrays of data you want to operate on..
This particular case is not a good fit for SIMD (SSE or otherwise). SIMD really only works well when you have contiguous arrays that you can access sequentially and process heterogeneously.
However you can at least get rid of some of the redundant operations in the scalar code, e.g. repeatedly calculating i * i * i when i is invariant:
do {
for (i = 0; i < maxi; i++) {
int i3 = i * i * i;
int j = nextj[i];
int j3 = j * j * j;
long sum = i3 + j3;
while (sum <= p) {
long x = sum & (psize - 1);
int flag = table[x];
if (flag <= guard) {
table[x] = guard+1;
} else if (flag == guard+1) {
table[x] = guard+2;
count++;
}
j++;
j3 = j * j * j;
sum = i3 + j3;
}
nextj[i] = j;
}
p += psize;
guard += 3;
} while (p <= n);