Compressing a 'char' array using bit packing in C [closed] - c

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a large array (around 1 MB) of type unsigned char (i.e. uint8_t). I know that the bytes in it can have only one of 5 values (i.e. 0, 1, 2, 3, 4). Moreover we do not need to preserve '3's from the input, they can be safely lost when we encode/decode.
So I guessed bit packing would be the simplest way to compress it, so every byte can be converted to 2 bits (00, 01..., 11).
As mentioned all elements of value 3 can be removed (i.e. saved as 0). Which gives me option to save '4' as '3'. While reconstructing (decompressing) I restore 3's to 4's.
I wrote a small function for the compression but I feel this has too many operations and just not efficient enough. Any code snippets or suggestion on how to make it more efficient or faster (hopefully keeping the readability) will be very much helpful.
/// Compress by packing ...
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
{
uint8_t temp[4];
for (int small_loop = 0; small_loop < 4; small_loop++)
{
temp[small_loop] = *in; // Load into local variable
if (temp[small_loop] == 3) // 3's are discarded
temp[small_loop] = 0;
else if (temp[small_loop] == 4) // and 4's are converted to 3
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
((temp[3] & 0x03));
} // end loop
}
Edited to make the problem more clear as it looked like I'm trying to save 5 values into 2 bits. Thanks to #Brian Cain for suggested wording.
Cross-posted on Code Review.

Your function has a bug: when loading the small array, you should write:
temp[small_loop] = in[small_loop];
You can get rid of the tests with a lookup table, either on the source data, or more efficiently on some intermediary result:
In the code below, I use a small table lookup5 to convert the values 0,1,2,3,4 to 0,1,2,0,3, and a larger one to map groups of 4 3-bit values from the source array to the corresponding byte value in the packed format:
#include <stdint.h>
/// Compress by packing ...
void compressByPacking0(uint8_t *out, uint8_t *in, uint32_t length) {
static uint8_t lookup[4096];
static const uint8_t lookup5[8] = { 0, 1, 2, 0, 3, 0, 0, 0 };
if (lookup[0] == 0) {
/* initialize lookup table */
for (int i = 0; i < 4096; i++) {
lookup[i] = (lookup5[(i >> 0) & 7] << 0) +
(lookup5[(i >> 3) & 7] << 2) +
(lookup5[(i >> 6) & 7] << 4) +
(lookup5[(i >> 9) & 7] << 6);
}
}
for (; length >= 4; length -= 4, in += 4, out++) {
*out = lookup[(in[0] << 9) + (in[1] << 6) + (in[2] << 3) + (in[3] << 0)];
}
uint8_t last = 0;
switch (length) {
case 3:
last |= lookup5[in[2]] << 4;
/* fall through */
case 2:
last |= lookup5[in[1]] << 2;
/* fall through */
case 1:
last |= lookup5[in[0]] << 0;
*out = last;
break;
}
}
Notes:
The code assumes the array does not contain values outside the specified range. Extra protection against spurious input can be achieved at a minimal cost.
The dummy << 0 are here only for symmetry and compile to no extra code.
The lookup table could be initialized statically, via a build time script or a set of macros.
You might want to unroll this loop 4 or more times, or let the compiler decide.
You could also use this simpler solution with a smaller lookup table accessed more often. Careful benchmarking will tell you which is more efficient on your target system:
/// Compress by packing ...
void compressByPacking1(uint8_t *out, uint8_t *in, uint32_t length) {
static const uint8_t lookup[4][5] = {
{ 0 << 6, 1 << 6, 2 << 6, 0 << 6, 3 << 6 },
{ 0 << 4, 1 << 4, 2 << 4, 0 << 4, 3 << 4 },
{ 0 << 2, 1 << 2, 2 << 2, 0 << 2, 3 << 2 },
{ 0 << 0, 1 << 0, 2 << 0, 0 << 0, 3 << 0 },
};
for (; length >= 4; length -= 4, in += 4, out++) {
*out = lookup[0][in[0]] + lookup[1][in[1]] +
lookup[2][in[2]] + lookup[3][in[3]];
}
uint8_t last = 0;
switch (length) {
case 3:
last |= lookup[2][in[2]];
/* fall through */
case 2:
last |= lookup[1][in[1]];
/* fall through */
case 1:
last |= lookup[0][in[0]];
*out = last;
break;
}
}
Here is yet another approach, without any tables:
/// Compress by packing ...
void compressByPacking2(uint8_t *out, uint8_t *in, uint32_t length) {
#define BITS ((1 << 2) + (2 << 4) + (3 << 8))
for (; length >= 4; length -= 4, in += 4, out++) {
*out = ((BITS << 6 >> (in[0] + in[0])) & 0xC0) +
((BITS << 4 >> (in[1] + in[1])) & 0x30) +
((BITS << 2 >> (in[2] + in[2])) & 0x0C) +
((BITS << 0 >> (in[3] + in[3])) & 0x03);
}
uint8_t last = 0;
switch (length) {
case 3:
last |= (BITS << 2 >> (in[2] + in[2])) & 0x0C;
/* fall through */
case 2:
last |= (BITS << 4 >> (in[1] + in[1])) & 0x30;
/* fall through */
case 1:
last |= (BITS << 6 >> (in[0] + in[0])) & 0xC0;
*out = last;
break;
}
}
Here is a comparative benchmark on my system, Macbook pro running OS/X, with clang -O2:
compressByPacking(1MB) -> 0.867ms
compressByPacking0(1MB) -> 0.445ms
compressByPacking1(1MB) -> 0.538ms
compressByPacking2(1MB) -> 0.824ms
The compressByPacking0 variant is fastest, almost twice as fast as your code.
It is a little disappointing, but the code is portable. You might squeeze more performance using handcoded SSE optimizations.

I have a large array (around 1 MB)
Either this is a typo, your target is seriously aging, or this compression operation is invoked repeatedly in the critical path of your application.
Any code snippets or suggestion on how to make it more efficient or
faster (hopefully keeping the readability) will be very much helpful.
In general, you will find the best information by empirically measuring the performance and inspecting the generated code. Using profilers to determine what code is executing, where there are cache misses and pipeline stalls -- these can help you tune your algorithm.
For example, you chose a stride of 4 elements. Is that just because you are mapping four input elements to a single byte? Can you use native SIMD instructions/intrinsics to operate on more elements at a time?
Also, how are you compiling for your target and how well is your compiler able to optimize your code?
Let's ask clang whether it finds any problems trying to optimize your code:
$ clang -fvectorize -O3 -Rpass-missed=licm -c tryme.c
tryme.c:11:28: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
temp[small_loop] = *in; // Load into local variable
^
tryme.c:21:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
*out = (uint8_t)((temp[0] & 0x03) << 6) |
^
tryme.c:22:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[1] & 0x03) << 4) |
^
tryme.c:23:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[2] & 0x03) << 2) |
^
tryme.c:24:25: remark: failed to move load with loop-invariant address because the loop may invalidate its value [-Rpass-missed=licm]
((temp[3] & 0x03));
^
I'm not sure but maybe alias analysis is what makes it think it can't move this load. Try playing with __restrict__ to see if that has any effect.
$ clang -fvectorize -O3 -Rpass-analysis=loop-vectorize -c tryme.c
tryme.c:13:13: remark: loop not vectorized: loop contains a switch statement [-Rpass-analysis=loop-vectorize]
if (temp[small_loop] == 3) // 3's are discarded
I can't think of anything obvious that you can do about this one unless you change your algorithm. If the compression ratio is satisfactory without deleting the 3s, you could perhaps eliminate this.
So what's the generated code look like? Take a look below. How could you write it better by hand? If you can write it better yourself, either do that or feed it back into your algorithm to help guide the compiler.
Does the compiled code take advantage of your target's instruction set and registers?
Most importantly -- try executing it and see where you're spending the most cycles. Stalls from branch misprediction, unaligned loads? Maybe you can do something about those. Use what you know about the frequency of your input data to give the compiler hints about the branches in your encoder.
$ objdump -d --source tryme.o
...
0000000000000000 <compressByPacking>:
#include <stdint.h>
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
0: c1 ea 02 shr $0x2,%edx
3: 0f 84 86 00 00 00 je 8f <compressByPacking+0x8f>
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
{
uint8_t temp[4];
for (int small_loop = 0; small_loop < 4; small_loop++)
{
temp[small_loop] = *in; // Load into local variable
10: 8a 06 mov (%rsi),%al
if (temp[small_loop] == 3) // 3's are discarded
12: 3c 04 cmp $0x4,%al
14: 74 3a je 50 <compressByPacking+0x50>
16: 3c 03 cmp $0x3,%al
18: 41 88 c0 mov %al,%r8b
1b: 75 03 jne 20 <compressByPacking+0x20>
1d: 45 31 c0 xor %r8d,%r8d
20: 3c 04 cmp $0x4,%al
22: 74 33 je 57 <compressByPacking+0x57>
24: 3c 03 cmp $0x3,%al
26: 88 c1 mov %al,%cl
28: 75 02 jne 2c <compressByPacking+0x2c>
2a: 31 c9 xor %ecx,%ecx
2c: 3c 04 cmp $0x4,%al
2e: 74 2d je 5d <compressByPacking+0x5d>
30: 3c 03 cmp $0x3,%al
32: 41 88 c1 mov %al,%r9b
35: 75 03 jne 3a <compressByPacking+0x3a>
37: 45 31 c9 xor %r9d,%r9d
3a: 3c 04 cmp $0x4,%al
3c: 74 26 je 64 <compressByPacking+0x64>
3e: 3c 03 cmp $0x3,%al
40: 75 24 jne 66 <compressByPacking+0x66>
42: 31 c0 xor %eax,%eax
44: eb 20 jmp 66 <compressByPacking+0x66>
46: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4d: 00 00 00
50: 41 b0 03 mov $0x3,%r8b
53: 3c 04 cmp $0x4,%al
55: 75 cd jne 24 <compressByPacking+0x24>
57: b1 03 mov $0x3,%cl
59: 3c 04 cmp $0x4,%al
5b: 75 d3 jne 30 <compressByPacking+0x30>
5d: 41 b1 03 mov $0x3,%r9b
60: 3c 04 cmp $0x4,%al
62: 75 da jne 3e <compressByPacking+0x3e>
64: b0 03 mov $0x3,%al
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
66: 41 c0 e0 06 shl $0x6,%r8b
((temp[1] & 0x03) << 4) |
6a: c0 e1 04 shl $0x4,%cl
6d: 80 e1 30 and $0x30,%cl
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
70: 44 08 c1 or %r8b,%cl
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
73: 41 c0 e1 02 shl $0x2,%r9b
77: 41 80 e1 0c and $0xc,%r9b
((temp[3] & 0x03));
7b: 24 03 and $0x3,%al
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
((temp[1] & 0x03) << 4) |
7d: 44 08 c8 or %r9b,%al
((temp[2] & 0x03) << 2) |
80: 08 c8 or %cl,%al
temp[small_loop] = 3;
} // end small loop
// Pack the bits into write pointer
*out = (uint8_t)((temp[0] & 0x03) << 6) |
82: 88 07 mov %al,(%rdi)
#include <stdint.h>
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
for (int loop = 0; loop < length/4; loop ++, in += 4, out++)
84: 48 83 c6 04 add $0x4,%rsi
88: 48 ff c7 inc %rdi
8b: ff ca dec %edx
8d: 75 81 jne 10 <compressByPacking+0x10>
((temp[1] & 0x03) << 4) |
((temp[2] & 0x03) << 2) |
((temp[3] & 0x03));
} // end loop
}
8f: c3 retq

In all the excitement about performance, functionality is overlooked. Code is broke.
// temp[small_loop] = *in; // Load into local variable
temp[small_loop] = in[small_loop];
Alternative:
How about a simple tight loop?
Use const and restrict to allow various optimizations.
void compressByPacking1(uint8_t* restrict out, const uint8_t* restrict in,
uint32_t length) {
static const uint8_t t[5] = { 0, 1, 2, 0, 3 };
uint32_t length4 = length / 4;
unsigned v = 0;
uint32_t i;
for (i = 0; i < length4; i++) {
for (unsigned j=0; j < 4; j++) {
v <<= 2;
v |= t[*in++];
}
out[i] = (uint8_t) v;
}
if (length & 3) {
v = 0;
for (unsigned j; j < 4; j++) {
v <<= 2;
if (j < (length & 3)) {
v |= t[*in++];
}
}
out[i] = (uint8_t) v;
}
}
Tested and found this code to be about 270% as fast (41 vs 15) (YMMV).
Tested and found to form the same output as OP's (corrected) code

Update: Tested
Unsafe version is the fastest - fastest than other ones in another answers. Tested with VS2017
const uint8_t table[4][5] =
{ { 0 << 0,1 << 0,2 << 0,0 << 0,3 << 0 },
{ 0 << 2,1 << 2,2 << 2,0 << 2,3 << 2 },
{ 0 << 4,1 << 4,2 << 4,0 << 4,3 << 4 },
{ 0 << 6,1 << 6,2 << 6,0 << 6,3 << 6 },
};
void code(uint8_t *in, uint8_t *out, uint32_t len)
{
memset(out, 0, len / 4 + 1);
for (uint32_t i = 0; i < len; i++)
out[i / 4] |= table[i & 3][in[i] % 5];
}
void code_unsafe(uint8_t *in, uint8_t *out, uint32_t len)
{
for (uint32_t i = 0; i < len; i += 4, in += 4, out++)
{
*out = table[0][in[0]] | table[1][in[1]] | table[2][in[2]] | table[3][in[3]];
}
}
To check how it is written it is enough to compile it - even online
https://godbolt.org/g/Z75NQV
There are small very simple my coding functions - just for comparition of the compiler generated code, not tested.

Does this look clearer?
void compressByPacking (uint8_t* out, uint8_t* in, uint32_t length)
{
assert( 0 == length % 4 );
for (int loop = 0; loop < length; loop += 4)
{
uint8_t temp = 0;
for (int small_loop = 0; small_loop < 4; small_loop++)
{
uint8_t inv = *in; // get next input value
switch(inv)
{
case 0: // encode as 00
case 3: // change to 0
break;
case 1:
temp |= (1 << smal_loop*2); // 1 encode as '01'
break;
case 2:
temp |= (2 << smal_loop*2); // 2 encode as '10'
break;
case 4:
temp |= (3 << smal_loop*2); // 4 encode as '11'
break;
default:
assert(0);
}
} // end inner loop
*out = temp;
} // end outer loop
}

Related

Why final pointer is being aligned to size of int?

Here's the code under consideration:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
char buffer[512];
int pos;
int posf;
int i;
struct timeval *tv;
int main(int argc, char **argv)
{
pos = 0;
for (i = 0; i < 512; i++) buffer[i] = 0;
for (i = 0; i < 4; i++)
{
printf("pos = %d\n", pos);
*(int *)(buffer + pos + 4) = 0x12345678;
pos += 9;
}
for (i = 0; i < 9 * 4; i++)
{
printf(" %02X", (int)(unsigned char)*(buffer + i));
if ((i % 9) == 8) printf("\n");
}
printf("\n");
// ---
pos = 0;
for (i = 0; i < 512; i++) buffer[i] = 0;
*(int *)(buffer + 4) = 0x12345678;
*(int *)(buffer + 9 + 4) = 0x12345678;
*(int *)(buffer + 18 + 4) = 0x12345678;
*(int *)(buffer + 27 + 4) = 0x12345678;
for (i = 0; i < 9 * 4; i++)
{
printf(" %02X", (int)(unsigned char)*(buffer + i));
if ((i % 9) == 8) printf("\n");
}
printf("\n");
return 0;
}
And the output of code is
pos = 0
pos = 9
pos = 18
pos = 27
00 00 00 00 78 56 34 12 00
00 00 00 78 56 34 12 00 00
00 00 78 56 34 12 00 00 00
00 78 56 34 12 00 00 00 00
00 00 00 00 78 56 34 12 00
00 00 00 00 78 56 34 12 00
00 00 00 00 78 56 34 12 00
00 00 00 00 78 56 34 12 00
I can not get why
*(int *)(buffer + pos + 4) = 0x12345678;
is being placed into the address aligned to size of int (4 bytes). I expect the following actions during the execution of this command:
pointer to buffer, which is char*, increased by the value of pos (0, 9, 18, 27) and then increased by 4. The resulting pointer is char* pointing to char array index [pos + 4];
char* pointer in the brackets is being converted to the int*, causing resulting pointer addressing integer of 4 bytes size at base location (buffer + pos + 4) and integer array index [0];
resulting int* location is being stored with bytes 78 56 34 12 in this order (little endian system).
Instead I see pointer in brackets being aligned to size of int (4 bytes), however direct addressing using constants (see second piece of code) works properly as expected.
target CPU is i.MX287 (ARM9);
target operating system is OpenWrt Linux [...] 3.18.29 #431 Fri Feb 11 15:57:31 2022 armv5tejl GNU/Linux;
compiled on Linux [...] 4.15.0-142-generic #146~16.04.1-Ubuntu SMP Tue Apr 13 09:27:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux, installed in Virtual machine;
GCC compiler version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
I compile as a part of whole system image compilation, flags are CFLAGS = -Os -Wall -Wmissing-declarations -g3.
Update: thanks to Andrew Henle, I now replace
*(int*)(buffer + pos + 4) = 0x12345678;
with
buffer[pos + 4] = value & 0xff;
buffer[pos + 5] = (value >> 8) & 0xff;
buffer[pos + 6] = (value >> 16) & 0xff;
buffer[pos + 7] = (value >> 24) & 0xff;
and can't believe I must do it on 32-bit microprocessor system, whatever architecture it has, and that GCC is not able to properly slice int into bytes or partial int words and perform RMW for those parts.
char* pointer in the brackets is being converted to the int*, causing resulting pointer addressing integer of 4 bytes size at base location (buffer + pos + 4) and integer array index [0]
This incurs undefined behavior (UB) when the alignments requirements of int * are not met.
Instead copy with memcpy(). A good compiler will emit valid optimized code.
// *(int*)(buffer + pos + 4) = 0x12345678;
memcpy(buffer + pos + 4, &(int){0x12345678}, sizeof (int));

How is the ospf checksum calculated?

I'm having trouble getting an accurate checksum calculated. My router is rejecting my Hello messages because of it. Here's the package in hex (starting from the ospf header)
vv vv <-- my program's checksum
0000 02 01 00 30 c0 a8 03 0a 00 00 00 00 f3 84 00 00 ...0............
0010 00 00 00 00 00 00 00 00 ff ff ff 00 00 0a 00 01 ................
0020 00 00 00 28 c0 a8 03 0a c0 a8 03 0a ...(........
Wireshark calls 0xf384 bogus and says the expected value is 0xb382. I've confirmed that my algorythm picks the bytes in pairs and in the right order and then adds them together:
0201 + 0030 + c0a8 + ... + 030a
There is no authentication to worry about. We weren't even taught how to set it up. Lastly, here's my shot at calculating the checksum:
OspfHeader result;
result.Type = type;
result.VersionNumber = 0x02;
result.PacketLength = htons(24 + content_len);
result.AuthType = htons(auth_type);
result.AuthValue = auth_val;
result.AreaId = area_id;
result.RouterId = router_id;
uint16_t header_buffer[12];
memcpy(header_buffer, &result, 24);
uint32_t checksum;
for(int i = 0; i < 8; i++)
{
checksum += ntohs(header_buffer[i]);
checksum = (checksum & 0xFFFF) + (checksum >> 16);
}
for(int i = 0; i + 1 < content_len; i += 2)
{
checksum += (content_buffer[i] << 8) + content_buffer[i+1];
checksum = (checksum & 0xFFF) + (checksum >> 16);
}
result.Checksum = htons(checksum xor 0xFFFF);
return result;
I'm not sure where exactly I messed up here. Any clues?
Mac_3.2.57$cat ospfCheckSum2.c
#include <stdio.h>
int main(void){
// array was wrong len
unsigned short header_buffer[22] = {
0x0201, 0x0030, 0xc0a8, 0x030a,
0x0000, 0x0000, 0x0000, 0x0000,
0x0000, 0x0000, 0x0000, 0x0000,
0xffff, 0xff00, 0x000a, 0x0001,
0x0000, 0x0028, 0xc0a8, 0x030a,
0xc0a8, 0x030a
};
// note ospf len is also wrong, BTW
// was not init'd
unsigned int checksum = 0;
// was only scanning 8 vs 22
for(int i = 0; i < 22; i++)
{
checksum += header_buffer[i];
checksum = (checksum & 0xFFFF) + (checksum >> 16);
}
// len not odd, so this is not needed
//for(int i = 0; i + 1 < content_len; i += 2)
//{
// checksum += (content_buffer[i] << 8) + content_buffer[i+1];
// checksum = (checksum & 0xFFF) + (checksum >> 16);
//}
printf("result is %04x\n", checksum ^ 0xFFFF); // no "xor" for me (why not do ~ instead?)
// also: where/what is: content_len, content_buffer, result, OspfHeader
return(0);
}
Mac_3.2.57$cc ospfCheckSum2.c
Mac_3.2.57$./a.out
result is b382
Mac_3.2.57$

Why doesn't my bit left shift?

I'm using embedded C on Keil. I'm trying to program such that it stores a bit, bit shifts and then it stores it again and it repeats until all eight bits are stored.
However, when I debug (maybe debug wrongly), the value only shows "01 00 00 00 00 00 00...". When it stores logic'1' and then when it shift left, it shows "02 00 00 00 00 00 00...". When the loop repeats, it shows the same thing over and over again. What I expected was "01 01 01 01 01 01 01..." (Let's say all the input bits was '1'). How do I solve this problem?
#include <reg51.h>
sbit Tsignal = P1^2;
unsigned char xdata x[500];
for(u=0; u<8; u++)
{
x[i] = x[i] << 1;
x[i] = Tsignal; //Store Tsignal in x
}
Ah, I have solved it already.
unsigned int u;
unsigned char p;
unsigned char xdata x[500];
for(u=0; u<8; u++) //Bit Shift Loop
{
x[i] = x[i] <<1; //Left Bit Shift by 1
p = Tsignal; //Store Tsignal to Buffer p
x[i] |= p;
} //End Bitshift loop
I think you want to do something like this:
for(u=0;u<8;u++)
{
// Update Tsignal.
//Tsignal = GetBitValue();
// Store it to x.
x = (x << 1) | (Tsignal & 0x1)
}

optimized itoa function

I am thinking on how to implement the conversion of an integer (4byte, unsigned) to string with SSE instructions. The usual routine is to divide the number and store it in a local variable, then invert the string (the inversion routine is missing in this example):
char *convert(unsigned int num, int base) {
static char buff[33];
char *ptr;
ptr = &buff[sizeof(buff) - 1];
*ptr = '\0';
do {
*--ptr="0123456789abcdef"[num%base];
num /= base;
} while(num != 0);
return ptr;
}
But inversion will take extra time. Is there any other algorithm than can be used preferably with SSE instruction to parallelize the function?
Terje Mathisen invented a very fast itoa() that does not require lookup tables. If you're not interested in the explanation of how it works, skip down to Performance or Implementation.
More than 15 years ago Terje Mathisen came up with a parallelized itoa() for base 10. The idea is to take a 32-bit value and break it into two chunks of 5 digits. (A quick Google search for "Terje Mathisen itoa" gave this post: http://computer-programming-forum.com/46-asm/7aa4b50bce8dd985.htm)
We start like so:
void itoa(char *buf, uint32_t val)
{
lo = val % 100000;
hi = val / 100000;
itoa_half(&buf[0], hi);
itoa_half(&buf[5], lo);
}
Now we can just need an algorithm that can convert any integer in the domain [0, 99999] to a string. A naive way to do that might be:
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
// Move all but the first digit to the right of the decimal point.
float tmp = val / 10000.0;
for(size_t i = 0; i < 5; i++)
{
// Extract the next digit.
int digit = (int) tmp;
// Convert to a character.
buf[i] = '0' + (char) digit;
// Remove the lead digit and shift left 1 decimal place.
tmp = (tmp - digit) * 10.0;
}
}
Rather than use floating-point, we will use 4.28 fixed-point math because it is significantly faster in our case. That is, we fix the binary point at the 28th bit position such that 1.0 is represented as 2^28. To convert into fixed-point, we simply multiply by 2^28. We can easily round down to the nearest integer by masking with 0xf0000000, and we can extract the fractional portion by masking with 0x0fffffff.
(Note: Terje's algorithm differs slightly in the choice of fixed-point format.)
So now we have:
typedef uint32_t fix4_28;
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
// Convert `val` to fixed-point and divide by 10000 in a single step.
// N.B. we would overflow a uint32_t if not for the parentheses.
fix4_28 tmp = val * ((1 << 28) / 10000);
for(size_t i = 0; i < 5; i++)
{
int digit = (int)(tmp >> 28);
buf[i] = '0' + (char) digit;
tmp = (tmp & 0x0fffffff) * 10;
}
}
The only problem with this code is that 2^28 / 10000 = 26843.5456, which is truncated to 26843. This causes inaccuracies for certain values. For example, itoa_half(buf, 83492) produces the string "83490". If we apply a small correction in our conversion to 4.28 fixed-point, then the algorithm works for all numbers in the domain [0, 99999]:
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
fix4_28 const f1_10000 = (1 << 28) / 10000;
// 2^28 / 10000 is 26843.5456, but 26843.75 is sufficiently close.
fix4_28 tmp = val * ((f1_10000 + 1) - (val / 4);
for(size_t i = 0; i < 5; i++)
{
int digit = (int)(tmp >> 28);
buf[i] = '0' + (char) digit;
tmp = (tmp & 0x0fffffff) * 10;
}
}
Terje interleaves the itoa_half part for the low & high halves:
void itoa(char *buf, uint32_t val)
{
fix4_28 const f1_10000 = (1 << 28) / 10000;
fix4_28 tmplo, tmphi;
lo = val % 100000;
hi = val / 100000;
tmplo = lo * (f1_10000 + 1) - (lo / 4);
tmphi = hi * (f1_10000 + 1) - (hi / 4);
for(size_t i = 0; i < 5; i++)
{
buf[i + 0] = '0' + (char)(tmphi >> 28);
buf[i + 5] = '0' + (char)(tmplo >> 28);
tmphi = (tmphi & 0x0fffffff) * 10;
tmplo = (tmplo & 0x0fffffff) * 10;
}
}
There is an additional trick that makes the code slightly faster if the loop is fully unrolled. The multiply by 10 is implemented as either a LEA+SHL or LEA+ADD sequence. We can save 1 instruction by multiplying instead by 5, which requires only a single LEA. This has the same effect as shifting tmphi and tmplo right by 1 position each pass through the loop, but we can compensate by adjusting our shift counts and masks like this:
uint32_t mask = 0x0fffffff;
uint32_t shift = 28;
for(size_t i = 0; i < 5; i++)
{
buf[i + 0] = '0' + (char)(tmphi >> shift);
buf[i + 5] = '0' + (char)(tmplo >> shift);
tmphi = (tmphi & mask) * 5;
tmplo = (tmplo & mask) * 5;
mask >>= 1;
shift--;
}
This only helps if the loop is fully-unrolled because you can precalculate the value of shift and mask for each iteration.
Finally, this routine produces zero-padded results. You can get rid of the padding by returning a pointer to the first character that is not 0 or the last character if val == 0:
char *itoa_unpadded(char *buf, uint32_t val)
{
char *p;
itoa(buf, val);
p = buf;
// Note: will break on GCC, but you can work around it by using memcpy() to dereference p.
if (*((uint64_t *) p) == 0x3030303030303030)
p += 8;
if (*((uint32_t *) p) == 0x30303030)
p += 4;
if (*((uint16_t *) p) == 0x3030)
p += 2;
if (*((uint8_t *) p) == 0x30)
p += 1;
return min(p, &buf[15]);
}
There is one additional trick applicable to 64-bit (i.e. AMD64) code. The extra, wider registers make it efficient to accumulate each 5-digit group in a register; after the last digit has been calculated, you can smash them together with SHRD, OR them with 0x3030303030303030, and store to memory. This improves performance for me by about 12.3%.
Vectorization
We could execute the above algorithm as-is on the SSE units, but there is almost no gain in performance. However, if we split the value into smaller chunks, we can take advantage of SSE4.1 32-bit multiply instructions. I tried three different splits:
2 groups of 5 digits
3 groups of 4 digits
4 groups of 3 digits
The fastest variant was 4 groups of 3 digits. See below for the results.
Performance
I tested many variants of Terje's algorithm in addition to the algorithms suggested by vitaut and Inge Henriksen. I verified through exhaustive testing of inputs that each algorithm's output matches itoa().
My numbers are taken from a Westmere E5640 running Windows 7 64-bit. I benchmark at real-time priority and locked to core 0. I execute each algorithm 4 times to force everything into the cache. I time 2^24 calls using RDTSCP to remove the effect of any dynamic clock speed changes.
I timed 5 different patterns of inputs:
itoa(0 .. 9) -- nearly best-case performance
itoa(1000 .. 1999) -- longer output, no branch mispredicts
itoa(100000000 .. 999999999) -- longest output, no branch mispredicts
itoa(256 random values) -- varying output length
itoa(65536 random values) -- varying output length and thrashes L1/L2 caches
The data:
ALG TINY MEDIUM LARGE RND256 RND64K NOTES
NULL 7 clk 7 clk 7 clk 7 clk 7 clk Benchmark overhead baseline
TERJE_C 63 clk 62 clk 63 clk 57 clk 56 clk Best C implementation of Terje's algorithm
TERJE_ASM 48 clk 48 clk 50 clk 45 clk 44 clk Naive, hand-written AMD64 version of Terje's algorithm
TERJE_SSE 41 clk 42 clk 41 clk 34 clk 35 clk SSE intrinsic version of Terje's algorithm with 1/3/3/3 digit grouping
INGE_0 12 clk 31 clk 71 clk 72 clk 72 clk Inge's first algorithm
INGE_1 20 clk 23 clk 45 clk 69 clk 96 clk Inge's second algorithm
INGE_2 18 clk 19 clk 32 clk 29 clk 36 clk Improved version of Inge's second algorithm
VITAUT_0 9 clk 16 clk 32 clk 35 clk 35 clk vitaut's algorithm
VITAUT_1 11 clk 15 clk 33 clk 31 clk 30 clk Improved version of vitaut's algorithm
LIBC 46 clk 128 clk 329 clk 339 clk 340 clk MSVCRT12 implementation
My compiler (VS 2013 Update 4) produced surprisingly bad code; the assembly version of Terje's algorithm is just a naive translation, and it's a full 21% faster. I was also surprised at the performance of the SSE implementation, which I expected to be slower. The big surprise was how fast INGE_2, VITAUT_0, and VITAUT_1 were. Bravo to vitaut for coming up with a portable solution that bests even my best effort at the assembly level.
Note: INGE_1 is a modified version of Inge Henriksen's second algorithm because the original has a bug.
INGE_2 is based on the second algorithm that Inge Henriksen gave. Rather than storing pointers to the precalculated strings in a char*[] array, it stores the strings themselves in a char[][5] array. The other big improvement is in how it stores characters in the output buffer. It stores more characters than necessary and uses pointer arithmetic to return a pointer to the first non-zero character. The result is substantially faster -- competitive even with the SSE-optimized version of Terje's algorithm. It should be noted that the microbenchmark favors this algorithm a bit because in real-world applications the 600K data set will constantly blow the caches.
VITAUT_1 is based on vitaut's algorithm with two small changes. The first change is that it copies character pairs in the main loop, reducing the number of store instructions. Similar to INGE_2, VITAUT_1 copies both final characters and uses pointer arithmetic to return a pointer to the string.
Implementation
Here I give code for the 3 most interesting algorithms.
TERJE_ASM:
; char *itoa_terje_asm(char *buf<rcx>, uint32_t val<edx>)
;
; *** NOTE ***
; buf *must* be 8-byte aligned or this code will break!
itoa_terje_asm:
MOV EAX, 0xA7C5AC47
ADD RDX, 1
IMUL RAX, RDX
SHR RAX, 48 ; EAX = val / 100000
IMUL R11D, EAX, 100000
ADD EAX, 1
SUB EDX, R11D ; EDX = (val % 100000) + 1
IMUL RAX, 214748 ; RAX = (val / 100000) * 2^31 / 10000
IMUL RDX, 214748 ; RDX = (val % 100000) * 2^31 / 10000
; Extract buf[0] & buf[5]
MOV R8, RAX
MOV R9, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R8, 31 ; R8 = buf[0]
SHR R9, 31 ; R9 = buf[5]
; Extract buf[1] & buf[6]
MOV R10, RAX
MOV R11, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R10, 31 - 8
SHR R11, 31 - 8
AND R10D, 0x0000FF00 ; R10 = buf[1] << 8
AND R11D, 0x0000FF00 ; R11 = buf[6] << 8
OR R10D, R8D ; R10 = buf[0] | (buf[1] << 8)
OR R11D, R9D ; R11 = buf[5] | (buf[6] << 8)
; Extract buf[2] & buf[7]
MOV R8, RAX
MOV R9, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R8, 31 - 16
SHR R9, 31 - 16
AND R8D, 0x00FF0000 ; R8 = buf[2] << 16
AND R9D, 0x00FF0000 ; R9 = buf[7] << 16
OR R8D, R10D ; R8 = buf[0] | (buf[1] << 8) | (buf[2] << 16)
OR R9D, R11D ; R9 = buf[5] | (buf[6] << 8) | (buf[7] << 16)
; Extract buf[3], buf[4], buf[8], & buf[9]
MOV R10, RAX
MOV R11, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R10, 31 - 24
SHR R11, 31 - 24
AND R10D, 0xFF000000 ; R10 = buf[3] << 24
AND R11D, 0xFF000000 ; R11 = buf[7] << 24
AND RAX, 0x80000000 ; RAX = buf[4] << 31
AND RDX, 0x80000000 ; RDX = buf[9] << 31
OR R10D, R8D ; R10 = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24)
OR R11D, R9D ; R11 = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24)
LEA RAX, [R10+RAX*2] ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32)
LEA RDX, [R11+RDX*2] ; RDX = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24) | (buf[9] << 32)
; Compact the character strings
SHL RAX, 24 ; RAX = (buf[0] << 24) | (buf[1] << 32) | (buf[2] << 40) | (buf[3] << 48) | (buf[4] << 56)
MOV R8, 0x3030303030303030
SHRD RAX, RDX, 24 ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32) | (buf[5] << 40) | (buf[6] << 48) | (buf[7] << 56)
SHR RDX, 24 ; RDX = buf[8] | (buf[9] << 8)
; Store 12 characters. The last 2 will be null bytes.
OR R8, RAX
LEA R9, [RDX+0x3030]
MOV [RCX], R8
MOV [RCX+8], R9D
; Convert RCX into a bit pointer.
SHL RCX, 3
; Scan the first 8 bytes for a non-zero character.
OR EDX, 0x00000100
TEST RAX, RAX
LEA R10, [RCX+64]
CMOVZ RAX, RDX
CMOVZ RCX, R10
; Scan the next 4 bytes for a non-zero character.
TEST EAX, EAX
LEA R10, [RCX+32]
CMOVZ RCX, R10
SHR RAX, CL ; N.B. RAX >>= (RCX % 64); this works because buf is 8-byte aligned.
; Scan the next 2 bytes for a non-zero character.
TEST AX, AX
LEA R10, [RCX+16]
CMOVZ RCX, R10
SHR EAX, CL ; N.B. RAX >>= (RCX % 32)
; Convert back to byte pointer. N.B. this works because the AMD64 virtual address space is 48-bit.
SAR RCX, 3
; Scan the last byte for a non-zero character.
TEST AL, AL
MOV RAX, RCX
LEA R10, [RCX+1]
CMOVZ RAX, R10
RETN
INGE_2:
uint8_t len100K[100000];
char str100K[100000][5];
void itoa_inge_2_init()
{
memset(str100K, '0', sizeof(str100K));
for(uint32_t i = 0; i < 100000; i++)
{
char buf[6];
itoa(i, buf, 10);
len100K[i] = strlen(buf);
memcpy(&str100K[i][5 - len100K[i]], buf, len100K[i]);
}
}
char *itoa_inge_2(char *buf, uint32_t val)
{
char *p = &buf[10];
uint32_t prevlen;
*p = '\0';
do
{
uint32_t const old = val;
uint32_t mod;
val /= 100000;
mod = old - (val * 100000);
prevlen = len100K[mod];
p -= 5;
memcpy(p, str100K[mod], 5);
}
while(val != 0);
return &p[5 - prevlen];
}
VITAUT_1:
static uint16_t const str100p[100] = {
0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930,
0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931,
0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932,
0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933,
0x3034, 0x3134, 0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934,
0x3035, 0x3135, 0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935,
0x3036, 0x3136, 0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936,
0x3037, 0x3137, 0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937,
0x3038, 0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938,
0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, };
char *itoa_vitaut_1(char *buf, uint32_t val)
{
char *p = &buf[10];
*p = '\0';
while(val >= 100)
{
uint32_t const old = val;
p -= 2;
val /= 100;
memcpy(p, &str100p[old - (val * 100)], sizeof(uint16_t));
}
p -= 2;
memcpy(p, &str100p[val], sizeof(uint16_t));
return &p[val < 10];
}
The first step to optimizing your code is getting rid of the arbitrary base support. This is because dividing by a constant is almost surely multiplication, but dividing by base is division, and because '0'+n is faster than "0123456789abcdef"[n] (no memory involved in the former).
If you need to go beyond that, you could make lookup tables for each byte in the base you care about (e.g. 10), then vector-add the (e.g. decimal) results for each byte. As in:
00 02 00 80 (input)
0000000000 (place3[0x00])
+0000131072 (place2[0x02])
+0000000000 (place1[0x00])
+0000000128 (place0[0x80])
==========
0000131200 (result)
This post compares several methods of integer to string conversion aka itoa. The fastest method reported there is fmt::format_int from the {fmt} library which is 5-18 times faster than sprintf/std::stringstream and ~4 times faster than a naive ltoa/itoa implementation (the actual numbers may of course vary depending on platform).
Unlike most other methods fmt::format_int does one pass over the digits. It also minimizes the number of integer divisions using the idea from Alexandrescu's talk Fastware. The implementation is available here.
This is of course if C++ is an option and you are not restricted by the itoa's API.
Disclaimer: I'm the author of this method and the fmt library.
http://sourceforge.net/projects/itoa/
Its uses a big static const array of all 4-digits integers and uses it for 32-bits or 64-bits conversion to string.
Portable, no need of a specific instruction set.
The only faster version I could find was in assembly code and limited to 32 bits.
Interesting problem. If you're interested in a 10 radix only itoa() then I have made a 10 times as fast example and a 3 times as fast example as the typical itoa() implementation.
First example (3x performance)
The first, which is 3 times as fast as itoa(), uses a single-pass non-reversal software design pattern and is based on the open source itoa() implementation found in groff.
// itoaSpeedTest.cpp : Defines the entry point for the console application.
//
#pragma comment(lib, "Winmm.lib")
#include "stdafx.h"
#include "Windows.h"
#include <iostream>
#include <time.h>
using namespace std;
#ifdef _WIN32
/** a signed 32-bit integer value type */
#define _INT32 __int32
#else
/** a signed 32-bit integer value type */
#define _INT32 long int // Guess what a 32-bit integer is
#endif
/** minimum allowed value in a signed 32-bit integer value type */
#define _INT32_MIN -2147483647
/** maximum allowed value in a signed 32-bit integer value type */
#define _INT32_MAX 2147483647
/** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */
#define _INT32_MAX_LENGTH 11
#ifdef _WIN32
/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);
/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);
/** Use to stop the performance timer and output the result to the standard stream */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT
/** Use to start the performance timer */
#define TIMER_START clock_t start;double diff;start=clock();
/** Use to stop the performance timer and output the result to the standard stream */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif
/** Array used for fast number character lookup */
const char numbersIn10Radix[10] = {'0','1','2','3','4','5','6','7','8','9'};
/** Array used for fast reverse number character lookup */
const char reverseNumbersIn10Radix[10] = {'9','8','7','6','5','4','3','2','1','0'};
const char *reverseArrayEndPtr = &reverseNumbersIn10Radix[9];
/*!
\brief Converts a 32-bit signed integer to a string
\param i [in] Integer
\par Software design pattern
Uses a single pass non-reversing algorithm and is 3x as fast as \c itoa().
\returns Integer as a string
\copyright GNU General Public License
\copyright 1989-1992 Free Software Foundation, Inc.
\date 1989-1992, 2013
\author James Clark<jjc#jclark.com>, 1989-1992
\author Inge Eivind Henriksen<inge#meronymy.com>, 2013
\note Function was originally a part of \a groff, and was refactored & optimized in 2013.
\relates itoa()
*/
const char *Int32ToStr(_INT32 i)
{
// Make room for a 32-bit signed integers digits and the '\0'
char buf[_INT32_MAX_LENGTH + 2];
char *p = buf + _INT32_MAX_LENGTH + 1;
*--p = '\0';
if (i >= 0)
{
do
{
*--p = numbersIn10Radix[i % 10];
i /= 10;
} while (i);
}
else
{
// Negative integer
do
{
*--p = reverseArrayEndPtr[i % 10];
i /= 10;
} while (i);
*--p = '-';
}
return p;
}
int _tmain(int argc, _TCHAR* argv[])
{
TIMER_INIT
// Make sure we are playing fair here
if (sizeof(int) != sizeof(_INT32))
{
cerr << "Error: integer size mismatch; test would be invalid." << endl;
return -1;
}
const int steps = 100;
{
char intBuffer[20];
cout << "itoa() took:" << endl;
TIMER_START;
for (int i = _INT32_MIN; i < i + steps ; i += steps)
itoa(i, intBuffer, 10);
TIMER_STOP;
}
{
cout << "Int32ToStr() took:" << endl;
TIMER_START;
for (int i = _INT32_MIN; i < i + steps ; i += steps)
Int32ToStr(i);
TIMER_STOP;
}
cout << "Done" << endl;
int wait;
cin >> wait;
return 0;
}
On 64-bit Windows the result from running this example is:
itoa() took:
2909.84 ms.
Int32ToStr() took:
991.726 ms.
Done
On 32-bit Windows the result from running this example is:
itoa() took:
3119.6 ms.
Int32ToStr() took:
1031.61 ms.
Done
Second example (10x performance)
If you don't mind spending some time initializing some buffers then it's possible to optimize the function above to be 10x faster than the typical itoa() implementation. What you need to do is to create string buffers rather than character buffers, like this:
// itoaSpeedTest.cpp : Defines the entry point for the console application.
//
#pragma comment(lib, "Winmm.lib")
#include "stdafx.h"
#include "Windows.h"
#include <iostream>
#include <time.h>
using namespace std;
#ifdef _WIN32
/** a signed 32-bit integer value type */
#define _INT32 __int32
/** a signed 8-bit integer value type */
#define _INT8 __int8
/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
#else
/** a signed 32-bit integer value type */
#define _INT32 long int // Guess what a 32-bit integer is
/** a signed 8-bit integer value type */
#define _INT8 char
/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
#endif
/** minimum allowed value in a signed 32-bit integer value type */
#define _INT32_MIN -2147483647
/** maximum allowed value in a signed 32-bit integer value type */
#define _INT32_MAX 2147483647
/** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */
#define _INT32_MAX_LENGTH 11
#ifdef _WIN32
/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);
/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock to get better precision that 15ms on Windows */
#define TIMER_INIT timeBeginPeriod(10);
/** Use to start the performance timer */
#define TIMER_START clock_t start;double diff;start=clock();
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif
/* Set this as large or small as you want, but has to be in the form 10^n where n >= 1, setting it smaller will
make the buffers smaller but the performance slower. If you want to set it larger than 100000 then you
must add some more cases to the switch blocks. Try to make it smaller to see the difference in
performance. It does however seem to become slower if larger than 100000 */
static const _INT32 numElem10Radix = 100000;
/** Array used for fast lookup number character lookup */
const char *numbersIn10Radix[numElem10Radix] = {};
_UINT8 numbersIn10RadixLen[numElem10Radix] = {};
/** Array used for fast lookup number character lookup */
const char *reverseNumbersIn10Radix[numElem10Radix] = {};
_UINT8 reverseNumbersIn10RadixLen[numElem10Radix] = {};
void InitBuffers()
{
char intBuffer[20];
for (int i = 0; i < numElem10Radix; i++)
{
itoa(i, intBuffer, 10);
size_t numLen = strlen(intBuffer);
char *intStr = new char[numLen + 1];
strcpy(intStr, intBuffer);
numbersIn10Radix[i] = intStr;
numbersIn10RadixLen[i] = numLen;
reverseNumbersIn10Radix[numElem10Radix - 1 - i] = intStr;
reverseNumbersIn10RadixLen[numElem10Radix - 1 - i] = numLen;
}
}
/*!
\brief Converts a 32-bit signed integer to a string
\param i [in] Integer
\par Software design pattern
Uses a single pass non-reversing algorithm with string buffers and is 10x as fast as \c itoa().
\returns Integer as a string
\copyright GNU General Public License
\copyright 1989-1992 Free Software Foundation, Inc.
\date 1989-1992, 2013
\author James Clark<jjc#jclark.com>, 1989-1992
\author Inge Eivind Henriksen, 2013
\note This file was originally a part of \a groff, and was refactored & optimized in 2013.
\relates itoa()
*/
const char *Int32ToStr(_INT32 i)
{
/* Room for INT_DIGITS digits, - and '\0' */
char buf[_INT32_MAX_LENGTH + 2];
char *p = buf + _INT32_MAX_LENGTH + 1;
_INT32 modVal;
*--p = '\0';
if (i >= 0)
{
do
{
modVal = i % numElem10Radix;
switch(numbersIn10RadixLen[modVal])
{
case 5:
*--p = numbersIn10Radix[modVal][4];
case 4:
*--p = numbersIn10Radix[modVal][3];
case 3:
*--p = numbersIn10Radix[modVal][2];
case 2:
*--p = numbersIn10Radix[modVal][1];
default:
*--p = numbersIn10Radix[modVal][0];
}
i /= numElem10Radix;
} while (i);
}
else
{
// Negative integer
const char **reverseArray = &reverseNumbersIn10Radix[numElem10Radix - 1];
const _UINT8 *reverseArrayLen = &reverseNumbersIn10RadixLen[numElem10Radix - 1];
do
{
modVal = i % numElem10Radix;
switch(reverseArrayLen[modVal])
{
case 5:
*--p = reverseArray[modVal][4];
case 4:
*--p = reverseArray[modVal][3];
case 3:
*--p = reverseArray[modVal][2];
case 2:
*--p = reverseArray[modVal][1];
default:
*--p = reverseArray[modVal][0];
}
i /= numElem10Radix;
} while (i);
*--p = '-';
}
return p;
}
int _tmain(int argc, _TCHAR* argv[])
{
InitBuffers();
TIMER_INIT
// Make sure we are playing fair here
if (sizeof(int) != sizeof(_INT32))
{
cerr << "Error: integer size mismatch; test would be invalid." << endl;
return -1;
}
const int steps = 100;
{
char intBuffer[20];
cout << "itoa() took:" << endl;
TIMER_START;
for (int i = _INT32_MIN; i < i + steps ; i += steps)
itoa(i, intBuffer, 10);
TIMER_STOP;
}
{
cout << "Int32ToStr() took:" << endl;
TIMER_START;
for (int i = _INT32_MIN; i < i + steps ; i += steps)
Int32ToStr(i);
TIMER_STOP;
}
cout << "Done" << endl;
int wait;
cin >> wait;
return 0;
}
On 64-bit Windows the result from running this example is:
itoa() took:
2914.12 ms.
Int32ToStr() took:
306.637 ms.
Done
On 32-bit Windows the result from running this example is:
itoa() took:
3126.12 ms.
Int32ToStr() took:
299.387 ms.
Done
Why do you use reverse string lookup buffers?
It's possible to do this without the reverse string lookup buffers (thus saving 1/2 the internal memory), but this makes it significantly slower (timed at about 850 ms on 64-bit and 380 ms on 32-bit systems). It's not clear to me exactly why it's so much slower - especially on 64-bit systems, to test this further yourself you can change simply the following code:
#define _UINT32 unsigned _INT32
...
static const _UINT32 numElem10Radix = 100000;
...
void InitBuffers()
{
char intBuffer[20];
for (int i = 0; i < numElem10Radix; i++)
{
_itoa(i, intBuffer, 10);
size_t numLen = strlen(intBuffer);
char *intStr = new char[numLen + 1];
strcpy(intStr, intBuffer);
numbersIn10Radix[i] = intStr;
numbersIn10RadixLen[i] = numLen;
}
}
...
const char *Int32ToStr(_INT32 i)
{
char buf[_INT32_MAX_LENGTH + 2];
char *p = buf + _INT32_MAX_LENGTH + 1;
_UINT32 modVal;
*--p = '\0';
_UINT32 j = i;
do
{
modVal = j % numElem10Radix;
switch(numbersIn10RadixLen[modVal])
{
case 5:
*--p = numbersIn10Radix[modVal][4];
case 4:
*--p = numbersIn10Radix[modVal][3];
case 3:
*--p = numbersIn10Radix[modVal][2];
case 2:
*--p = numbersIn10Radix[modVal][1];
default:
*--p = numbersIn10Radix[modVal][0];
}
j /= numElem10Radix;
} while (j);
if (i < 0) *--p = '-';
return p;
}
That's part of my code in asm. It works only for range 255-0 It can be faster however here you can find direction and main idea.
4 imuls
1 memory read
1 memory write
You can try to reduce 2 imule's and use lea's with shifting. However you can't find anything faster in C/C++/Python ;)
void itoa_asm(unsigned char inVal, char *str)
{
__asm
{
// eax=100's -> (some_integer/100) = (some_integer*41) >> 12
movzx esi,inVal
mov eax,esi
mov ecx,41
imul eax,ecx
shr eax,12
mov edx,eax
imul edx,100
mov edi,edx
// ebx=10's -> (some_integer/10) = (some_integer*205) >> 11
mov ebx,esi
sub ebx,edx
mov ecx,205
imul ebx,ecx
shr ebx,11
mov edx,ebx
imul edx,10
// ecx = 1
mov ecx,esi
sub ecx,edx // -> sub 10's
sub ecx,edi // -> sub 100's
add al,'0'
add bl,'0'
add cl,'0'
//shl eax,
shl ebx,8
shl ecx,16
or eax,ebx
or eax,ecx
mov edi,str
mov [edi],eax
}
}
#Inge Henriksen
I believe your code has a bug:
IntToStr(2701987) == "2701987" //Correct
IntToStr(27001987) == "2701987" //Incorrect
Here's why your code is wrong:
modVal = i % numElem10Radix;
switch (reverseArrayLen[modVal])
{
case 5:
*--p = reverseArray[modVal][4];
case 4:
*--p = reverseArray[modVal][3];
case 3:
*--p = reverseArray[modVal][2];
case 2:
*--p = reverseArray[modVal][1];
default:
*--p = reverseArray[modVal][0];
}
i /= numElem10Radix;
There should be a leading 0 before "1987", which is "01987". But after the first iteration, you get 4 digits instead of 5.
So,
IntToStr(27000000) = "2700" //Incorrect
For unsigned 0 to 9,999,999 with terminating null. (99,999,999 without)
void itoa(uint64_t u, char *out) // up to 9,999,999 with terminating zero
{
*out = 0;
do {
uint64_t n0 = u;
*((uint64_t *)out) = (*((uint64_t *)out) << 8) | (n0 + '0' - (u /= 10) * 10);
} while (u);
}

while (n > 1) is 25% faster than while (n)?

I have two logically equivalent functions:
long ipow1(int base, int exp) {
// HISTORICAL NOTE:
// This wasn't here in the original question, I edited it in,
if (exp == 0) return 1;
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
NOTICE:
These loops are equivalent because in the former case we are returning result * base (handling the case when exp is or has been reduced to 1) but in the second case we are returning result.
Strangely enough, both with -O3 and -O0 ipow1 consequently outperforms ipow2 by about 25%. How is this possible?
I'm on Windows 7, x64, gcc 4.5.2 and compiling with gcc ipow.c -O0 -std=c99.
And this is my profiling code:
int main(int argc, char *argv[]) {
LARGE_INTEGER ticksPerSecond;
LARGE_INTEGER tick;
LARGE_INTEGER start_ticks, end_ticks, cputime;
double totaltime = 0;
int repetitions = 10000;
int rep = 0;
int nopti = 0;
for (rep = 0; rep < repetitions; rep++) {
if (!QueryPerformanceFrequency(&ticksPerSecond)) printf("\tno go QueryPerformance not present");
if (!QueryPerformanceCounter(&tick)) printf("no go counter not installed");
QueryPerformanceCounter(&start_ticks);
/* start real code */
for (int i = 0; i < 55; i++) {
for (int j = 0; j < 11; j++) {
nopti = ipow1(i, j); // or ipow2
}
}
/* end code */
QueryPerformanceCounter(&end_ticks);
cputime.QuadPart = end_ticks.QuadPart - start_ticks.QuadPart;
totaltime += (double)cputime.QuadPart / (double)ticksPerSecond.QuadPart;
}
printf("\tTotal elapsed CPU time: %.9f sec with %d repetitions - %ld:\n", totaltime, repetitions, nopti);
return 0;
}
No, really, the two ARE NOT equivalent. ipow2 returns correct results when ipow1 doesn't.
http://ideone.com/MqyqU
P.S. I don't care how many comments you leave "explaining" why they're the same, it takes only a single counter-example to disprove your claims.
P.P.S. -1 on the question for your insufferable arrogance toward everyone who already tried to point this out to you.
It's becouse with while (exp > 1) the for will run from exp to 2 (it will execute with exp = 2, decrement it to 1 and then end the loop).
With while (exp), the for will run from exp to 1 (it will execute with exp = 1, decrement it to 0 and then end the loop).
So with while (exp) you have an extra iteration, which takes the extra time to run.
EDIT: Even with the multiplication after the loop with the exp>1 while, keep in mind that the multiplication is not the only thing in the loop.
If you dont want to read all of this skip to the bottom, I come up with a 21% difference just by analysis of the code.
Different systems, versions of the compiler, same compiler version built by different folks/distros will give different instruction mixes, this is just one example of what you might get.
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result * base;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
return result;
}
0000000000000000 <ipow1>:
0: 83 fe 01 cmp $0x1,%esi
3: ba 01 00 00 00 mov $0x1,%edx
8: 7e 1d jle 27 <ipow1+0x27>
a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
10: 40 f6 c6 01 test $0x1,%sil
14: 74 07 je 1d <ipow1+0x1d>
16: 48 63 c7 movslq %edi,%rax
19: 48 0f af d0 imul %rax,%rdx
1d: d1 fe sar %esi
1f: 0f af ff imul %edi,%edi
22: 83 fe 01 cmp $0x1,%esi
25: 7f e9 jg 10 <ipow1+0x10>
27: 48 63 c7 movslq %edi,%rax
2a: 48 0f af c2 imul %rdx,%rax
2e: c3 retq
2f: 90 nop
0000000000000030 <ipow2>:
30: 85 f6 test %esi,%esi
32: b8 01 00 00 00 mov $0x1,%eax
37: 75 0a jne 43 <ipow2+0x13>
39: eb 19 jmp 54 <ipow2+0x24>
3b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
40: 0f af ff imul %edi,%edi
43: 40 f6 c6 01 test $0x1,%sil
47: 74 07 je 50 <ipow2+0x20>
49: 48 63 d7 movslq %edi,%rdx
4c: 48 0f af c2 imul %rdx,%rax
50: d1 fe sar %esi
52: 75 ec jne 40 <ipow2+0x10>
54: f3 c3 repz retq
Isolating the loops:
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
//if exp & 1 not true jump to 1d to skip
10: 40 f6 c6 01 test $0x1,%sil
14: 74 07 je 1d <ipow1+0x1d>
//result *= base
16: 48 63 c7 movslq %edi,%rax
19: 48 0f af d0 imul %rax,%rdx
//exp>>=1
1d: d1 fe sar %esi
//base *= base
1f: 0f af ff imul %edi,%edi
//while(exp>1) stayin the loop
22: 83 fe 01 cmp $0x1,%esi
25: 7f e9 jg 10 <ipow1+0x10>
Comparing something to zero normally saves you an instruction and you can see that here
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
}
//base *= base
40: 0f af ff imul %edi,%edi
//if exp & 1 not true jump to skip
43: 40 f6 c6 01 test $0x1,%sil
47: 74 07 je 50 <ipow2+0x20>
//result *= base
49: 48 63 d7 movslq %edi,%rdx
4c: 48 0f af c2 imul %rdx,%rax
//exp>>=1
50: d1 fe sar %esi
//no need for a compare
52: 75 ec jne 40 <ipow2+0x10>
Your timing method is going to generate a lot of error/chaos. Depending on the beat frequency of the loop and the accuracy of the timer you can create a lot of gain in one and a lot of loss in another. This method normally gives better accuracy:
starttime = ...
for(rep=bignumber;rep;rep--)
{
//code under test
...
}
endtime = ...
total = endtime - starttime;
Of course if you are running this on an operating system timing it is going to have a decent amount of error in it anyway.
Also you want to use volatile variables for your timer variables, helps the compiler to not re-arrange the order of execution. (been there seen that).
If we look at this from the perspective of the base multiplies:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int mults;
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
mults++;
}
result *= base;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) result *= base;
exp >>= 1;
base *= base;
mults++;
}
return result;
}
int main ( void )
{
int i;
int j;
mults = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
mults=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
}
there are
mults 1045
mults 1595
50% more for ipow2(). Actually it is not just the multiplies it is that you are going through the loop 50% more times.
ipow1() gets a little back on the other multiplies:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int mults;
long ipow1(int base, int exp) {
long result = 1;
while (exp > 1) {
if (exp & 1) mults++;
exp >>= 1;
base *= base;
}
mults++;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
while (exp) {
if (exp & 1) mults++;
exp >>= 1;
base *= base;
}
return result;
}
int main ( void )
{
int i;
int j;
mults = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
mults=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("mults %u\n",mults);
}
ipow1() performs the result*=base a different number (more) times than ipow2()
mults 990
mults 935
being a long * int can make these more expensive. not enough to make up for the losses around the loop in ipow2().
Even without disassembling, making a rough guess on the operations/instructions you hope the compiler uses. Accounting here for processors in general not necessarily x86, some processors will run this code better than others (from a number of instructions executed perspective not counting all the other factors).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned int ops;
long ipow1(int base, int exp) {
long result = 1;
ops++; //result = immediate
while (exp > 1) {
ops++; // compare exp - 1
ops++; // conditional jump
//if (exp & 1)
ops++; //exp&1
ops++; //conditional jump
if (exp & 1)
{
result *= base;
ops++;
}
exp >>= 1;
ops++;
//ops+=?; //using a signed number can cost you this on some systems
//always use unsigned unless you have a specific reason to use signed.
//if this had been a short or char variable it might cost you even more
//operations
//if this needs to be signed it is what it is, just be aware of
//the cost
base *= base;
ops++;
}
result *= base;
ops++;
return result;
}
long ipow2(int base, int exp) {
long result = 1;
ops++;
while (exp) {
//ops++; //cmp exp-0, often optimizes out;
ops++; //conditional jump
//if (exp & 1)
ops++;
ops++;
if (exp & 1)
{
result *= base;
ops++;
}
exp >>= 1;
ops++;
//ops+=?; //right shifting a signed number
base *= base;
ops++;
}
return result;
}
int main ( void )
{
int i;
int j;
ops = 0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow1(i, j); // or ipow2
}
}
printf("ops %u\n",ops);
ops=0;
for (i = 0; i < 55; i++) {
for (j = 0; j < 11; j++) {
ipow2(i, j); // or ipow2
}
}
printf("ops %u\n",ops);
}
Assuming I counted all the major operations and didnt unfairly give one function more than another:
ops 7865
ops 9515
ipow2 is 21% slower using this analysis.
I think the big killer is the 50% more times through the loop. Granted it is data dependent, you might find inputs in a benchmark test that make the difference between functions greater or worse than the 25% you are seeing.
Your functions are not "logically equal".
while (exp > 1){...}
is NOT logically equal to
while (exp){...}
Why do you say it is?
Does this really generate the same assembly code? When I tried (with gcc 4.5.1 on OpenSuse 11.4, I will admit) I found slight differences.
ipow1.s:
cmpl $1, -24(%rbp)
jg .L4
movl -20(%rbp), %eax
cltq
imulq -8(%rbp), %rax
leave
ipow2.s:
cmpl $0, -24(%rbp)
jne .L4
movq -8(%rbp), %rax
leave
Perhaps the processor's branch prediction is just more effective with jg than with jne? It seems unlikely that one branch instruction would run 25% faster than another (especially when cmpl has done most of the heavy lifting)

Resources