Profiling a Set implementation on 64-bit machines

Profiling a Set implementation on 64-bit machines - c

Relevant Information on my system:
Core2Duo T6500
gcc (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2)
Using the basic set implementation, where each set that is stored is really just the lexicographical order of the set stored, you can use standard bit operations for Set operations like Union, Intersection, elementQ, etc.
My question is about determining the size of the set. Implementations like Cliquer use a
static int set_bit_count[256]
to store how many bits are in any given possible 8 bit string, and then the algorithm would go through 8 bits at a time to determine the set's size.
I have two problems with that way:
If registers are more than 8x faster than cache or RAM, this would waste speed.
In a 64-bit machine, are not int operations slower than say, unsigned long long int which I assume are the standard operating integers on 64-bit CPU's.
But I would imagine just using a simple
while(x)
x&=x-1;
++count;
could be faster as everything could be stored in registers. But on the downside, could something other than the obvious 8x times as many operations?
Also, there are so many different combinations of int, uint, unsigned long, unsigned long long that I have no Idea where to start testing.
Do you know any essays on this topic?
Do you know any other SO questions on this topic?
Do you have any insights to this?
Do you have any suggestions on how to profile this? I've never used gprof. And when I use time.h, I can't get finer than a second of granularity.
I would be very grateful if you did.

Most likely (though I'm too lazy to test right now), the fastest would be
int popcount(unsigned x) {
int count;
#if defined(__GNUC__)
__asm__("popcnt %1,%0" : "=r" (count) : "r" (x));
#elif defined(_MSC_VER)
__asm {
POPCNT x, count
};
#else
/* blah, who cares */
for (count = 0; x; count += x&1, x >>= 1);
#endif
return count;
}
(Though this will explode if the CPU doesn't support SSE4.2.) Of course, it would be better (and more portable) to use the compilers' built-in intrinsics, and in general I would trust the compiler to choose whatever implementation is best for the current target platform.
int popcount(unsigned x);
#if defined(__GNUC__)
# define popcount __builtin_popcount
#elif defined(_MSC_VER)
# define popcount __popcnt
#else
/* fallback implementation */
#fi

I would profile the two different implementations using a random number generator to create the bit patterns. I would loop over many iterations, accumulating something during each iteration (e.g., exclusive-OR of the bit count), which I'd print-out at the end of the loop. The accumulating and printing are necessary so that the compiler doesn't optimize away anything of importance.

Related

Should I use the stdint.h integer types on 32/64 bit machines?

One thing that bugs me about the regular c integer declarations is that their names are strange, "long long" being the worst. I am only building for 32 and 64 bit machines so I do not necessarily need the portability that the library offers, however I like that the name for each type is a single word in similar length with no ambiguity in size.
// multiple word types are hard to read
// long integers can be 32 or 64 bits depending on the machine
unsigned long int foo = 64;
long int bar = -64;
// easy to read
// no ambiguity
uint64_t foo = 64;
int64_t bar = -64;
On 32 and 64 bit machines:
1) Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
2) If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
for (int8_t i = 0; i < 10; i++) {
}
3) Whenever I use an integer that I know will never be negative is it ok to prefer using the unsigned version even if I do not need the extra range in provides?
// instead of the one above
for (uint8_t i = 0; i < 10; i++) {
}
4) Is it safe to use a typedef for the types included from stdint.h
typedef int32_t signed_32_int;
typedef uint32_t unsigned_32_int;
edit: both answers were equally good and I couldn't really lean towards one so I just picked the answerer with lower rep

1) Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
Yes it can be slower. Use int_fast16_t instead. Profile the code as needed. Performance is very implementation dependent. A prime benefit of int16_t is its small, well defined size (also it must be 2's complement) as used in structures and arrays, not so much for speed.
The typedef name int_fastN_t designates the fastest signed integer type with a width of at least N. C11 §7.20.1.3 2
2) If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
Yes but that savings in code and speed is questionable. Suggest int instead. Emitted code tends to be optimal in speed/size with the native int size.
3) Whenever I use an integer that I know will never be negative is it OK to prefer using the unsigned version even if I do not need the extra range in provides?
Using some unsigned type is preferred when the math is strictly unsigned (such as array indexing with size_t), yet code needs to watch for careless application like
for (unsigned i = 10 ; i >= 0; i--) // infinite loop
4) Is it safe to use a typedef for the types included from stdint.h
Almost always. Types like int16_t are optional. Maximum portability uses required types uint_least16_t and uint_fast16_t for code to run on rare platforms that use bits widths like 9, 18, etc.

Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
Yes. Some CPUs do not have dedicated 16-bit arithmetic instructions; arithmetic on 16-bit integers must be emulated with an instruction sequence along the lines of:
r1 = r2 + r3
r1 = r1 & 0xffff
The same principle applies to 8-bit types.
Use the "fast" integer types in <stdint.h> to avoid this -- for instance, int_fast16_t will give you an integer that is at least 16 bits wide, but may be wider if 16-bit types are nonoptimal.
If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
Don't bother; just use int. Using a narrower type doesn't actually save any space, and may cause you issues down the line if you decide to increase the number of iterations to over 127 and forget that the loop variable is using a narrow type.
Whenever I use an integer that I know will never be negative is it ok to prefer using the unsigned version even if I do not need the extra range in provides?
Best avoided. Certain C idioms do not work properly on unsigned integers; for instance, you cannot write a loop of the form:
for (i = 100; i >= 0; i--) { … }
if i is an unsigned type, because i >= 0 will always be true!
Is it safe to use a typedef for the types included from stdint.h
Safe from a technical perspective, but it'll annoy other developers who have to work with your code.
Get used to the <stdint.h> names. They're standardized and reasonably easy to type.

Absolutely possible, yes. On my laptop (Intel Haswell), in a microbenchmark that counts up and down between 0 and 65535 on two registers 2 billion times, this takes
1.313660150s - ax dx (16-bit)
1.312484805s - eax edx (32-bit)
1.312270238s - rax rdx (64-bit)
Minuscule but repeatable differences in timing. (I wrote the benchmark in assembly, because C compilers may optimize it to a different register size.)
It will work, but you'll have to keep it up to date if you change the bounds and the C compiler will probably optimize it to the same assembly code anyway.
As long as it's correct C, that's totally fine. Keep in mind that unsigned overflow is defined and signed overflow is undefined, and compilers do take advantage of that for optimization. For example,
void foo(int start, int count) {
for (int i = start; i < start + count; i++) {
// With unsigned arithmetic, this will execute 0 times if
// "start + count" overflows to a number smaller than "start".
// With signed arithmetic, that may happen, or the compiler
// may assume this loop always runs "count" times.
// For defined behavior, avoid signed overflow.
}
Yes. Also, POSIX provides inttypes.h which extends stdint.h with some useful functions and macros.

Create own type of variable

Is it possible to create a custom type of variable in C/C++? I want something like "super long int", that occupies let's say 40 bytes and allows same operations as in an usual int. (+, -, /, %, <, >, etc..)

There's nothing built-in for something like that, at least not in C. You'll need to use a big-number library like GMP. It doesn't allow for using the normal set of operators, but it can handle numbers of an arbitrarily large size.
EDIT:
If you're targeting C++, GMP does have overloaded operators that will allow you to use the standard set of operators like you would with a regular int. See the manual for more details.

Some CPUs have support to work with very large numbers. With SSE on the x86/64 architecture you can implement 128 bit values (16 bytes) that can be calculated with normally.
With AVX this limitation extends to 256 bits (32 bytes). The upcoming AVX-512 extension is supposed to have 512 bits (64 bytes), thus enabling "super large" integers.
But there are two caveats to these extensions:
The compiler has to support it (GCC for example uses immintrin.h for AXV support and xmmintrin.h for SSE support). Alternatively you can try to implement the abstractions via inline assembler, but then the Assembler has to understand these (GCC uses AS as far as I know).
The machine you are running the compiled code on has to support these instructions. If the CPU does not support AVX or SSE (depending on what you want to do), the application will crash on these instructions, as the CPU does not understand them.
AVX/SSE is used in the implementations of memset, memcpy, etc, since they also allow you to reduce the memory accesses by a good deal (keep in mind that, while your cache line is going to be loaded into cache once, loading to it still takes up some cycles, and AVX/SSE help you eliminating a good chunk of these costs as well).
Here a working example (compiles with GCC 4.9.3, you have to add -mavx to your compiler options):
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
int main(void)
{
size_t i;
/*********************************************************************
**Hack-ish way to ensure that malloc's alignment does not screw with
**us. On this box it aligns to 0x10 bytes, but AVX needs 0x20.
*********************************************************************/
#define AVX_BASE (0x20ULL)
uint64_t*real_raw = malloc(128);
uint64_t*raw = (uint64_t*)((uintptr_t)real_raw + (AVX_BASE - ((uintptr_t)real_raw % AVX_BASE)));
__m256i value = _mm256_setzero_si256();
for(i = 0;i < 10;i++)
{
/*No special function here to do the math.*/
value += i * i;
/*************************************************************
**Extract the value from the register and print the last
**byte.
*************************************************************/
_mm256_store_si256((__m256i*)raw,value);
printf("%lu\n",raw[0]);
}
_mm256_store_si256((__m256i*)raw,value);
printf("End: %lu\n",raw[0]);
free(real_raw);
return 0;
}

Multiword addition in C

I have a C program which uses GCC's __uint128_t which is great, but now my needs have grown beyond it.
What are my options for fast arithmetic with 196 or 256 bits?
The only operation I need is addition (and I don't need the carry bit, i.e., I will be working mod 2192 or 2256).
Speed is important, so I don't want to move to a general multi-precision if at all possible. (In fact my code does use multi-precision in some places, but this is in the critical loop and will run tens of billions of times. So far the multi-precision needs to run only tens of thousands of times.)
Maybe this is simple enough to code directly, or maybe I need to find some appropriate library.
What is your advice, Oh great Stack Overflow?
Clarification: GMP is too slow for my needs. Although I actually use multi-precision in my code it's not in the inner loop and runs less than 105 times. The hot loop runs more like 1012 times. When I changed my code (increasing a size parameter) so that the multi-precision part ran more often vs. the single-precision, I had a 100-fold slowdown (mostly due to memory management issues, I think, rather than extra µops). I'd like to get that down to a 4-fold slowdown or better.

256-bit version
__uint128_t a[2], b[2], c[2]; // c = a + b
c[0] = a[0] + b[0]; // add low part
c[1] = a[1] + b[1] + (c[0] < a[0]); // add high part and carry
Edit: 192-bit version. This way you can eliminate the 128-bit comparison like what #harold's stated:
struct uint192_t {
__uint128_t H;
uint64_t L;
} a, b, c; // c = a + b
c.L = a.L + b.L;
c.H = a.H + b.H + (c.L < a.L);
Alternatively you can use the integer overflow builtins or checked arithmetic builtins
bool carry = __builtin_uaddl_overflow(a.L, b.L, &c.L);
c.H = a.H + b.H + carry;
Demo on Godbolt
If you do a lot of additions in a loop you should consider using SIMD and/or running them in parallel with multithreading. For SIMD you may need change the layout of the type so that you can add all the low parts at once and all the high parts at once. Once possible solution is an array of struct of array as suggested here practical BigNum AVX/SSE possible?
SSE2: llhhllhhllhhllhh
AVX2: llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh
With AVX-512 you can add eight 64-bit values at once. So you can add eight 192-bit values in 3 instructions plus a few more for the carry. For more information read Is it possible to use SSE and SSE2 to make a 128-bit wide integer?
With AVX-2 or AVX-512 you may also have very fast horizontal add so it may also worth a try for 256-bit even if you don't have parallel addition chains. But for 192-bit addition then 3 add/adc instructions would be much faster
There are also many libraries with a fixed-width integer type. For example Boost.Multiprecision
#include <boost/multiprecision/cpp_int.hpp>
using namespace boost::multiprecision;
uint256_t myUnsignedInt256 = 1;
Some other libraries:
ttmath: ttmath:UInt<3> (an int type with 3 limbs, which is 192 bits on 64-bit computers)
uint256_t
See also
C++ 128/256-bit fixed size integer types

You could test if the "add (low < oldlow) to simulate carry"-technique from this answer is fast enough. It's slightly complicated by the fact that low is an __uint128_t here, that could hurt code generation. You might try it with 4 uint64_t's as well, I don't know whether that'll be better or worse.
If that's not good enough, drop to inline assembly, and directly use the carry flag - it doesn't get any better than that, but you'd have the usual downsides of using inline assembly.

Efficiency of different integer sizes on a 64-bit CPU

In a 64-bit CPU, if the int is 32 bits whereas the long is 64 bits, would the long be more efficient than the int?

The main problem with your question is that you did not define "efficient". There are several possible efficiency related differences.
Of course if you need to use 64 bits, then there's no question. But sometimes you could use 32 bits and you wonder if it would be better to use 64 bits instead.
Data Size Efficiency
Using 32 bits will use less memory. This is more efficient especially if you use a lot of them. Not only it's more efficient in the sense that you may not get to swap out, but also in the sense that you'll have fewer cache misses. If you use just a few then the efficiency difference is irrelevant.
Code Size Efficiency
This is heavily dependent on the architecture. Some architectures will need longer instructions to manipulate 32 bit values, others will need longer instructions to manipulate 64 bits values and others will make no difference. On the intel processors, for example, 32 bits is the default operand size even for 64 bits code. Smaller code may have a little advantage both in cache behavior and in pipeline usage. But it is dependent on the architecture which operand size will use smaller code.
Execution Speed Efficiency
In general there should be no difference beyond the one implied by code size. Once the instruction has been decoded the timing for mere execution are generally identical. However, once again, this is in fact architecture specific. There are architectures that do not have native 32 bit arithmetic, for example.
My suggestion:
If it's just some local variables or data in small structures that you do not allocate in huge quantities, use int and do it in a way that does not assume a size, so that a new version of the compiler or a different compiler that use a different size for int will still work.
However if you have huge arrays or matrixes, then use the smallest type you can use and make sure its size is explicit.

On the common x86-64 architecture, 32-bit arithmetic is never slower than 64 bit arithmethic. So int is always the same speed or faster than long. On other architectures that don't actually have builtin 32-bit arithmetic, such as the MMIX, this might not hold.
Basic wisdom holds: Write it without considering such micro-optimizations and if necessary, profile and optimize.

If you are trying to store 64 bits of data, use a long. If you aren't going to need the 64 bits use the regular 32 bit int.

Yes, a 64bit number would be more efficient than a 32bit number.
On a 64bit CPU most compilers would give you 64bit if you ask for an long int though.
To see the size with your current compiler:
#include <stdio.h>
int main(int argc, char **argv){
long int foo;
printf("The size of an int is: %ld bytes\n", sizeof(foo));
printf("The size of an int is: %ld bits\n", sizeof(foo) * 8);
return 0;
}
If your cpu is running in 64bit mode you can expect that the CPU will use that regardless of what you ask. All the registers are 64bit, the operations are 64bit so if you want to get a 32bit result it will generally convert the 64bit result to 32bit for you.
The limits.h on my system defines long int as:
/* Minimum and maximum values a `signed long int' can hold. */
# if __WORDSIZE == 64
# define LONG_MAX 9223372036854775807L
# else
# define LONG_MAX 2147483647L
# endif
# define LONG_MIN (-LONG_MAX - 1L)

On embedded platforms, is it more efficient to use unsigned int instead of (implicity signed) int?

I've got into this habit of always using unsigned integers where possible in my code, because the processor can do divides by powers of two on unsigned types, which it can't with signed types. Speed is critical for this project. The processor operates at up to 40 MIPS.
My processor has an 18 cycle divide, but it takes longer than the single cycle barrel shifter. So is it worth using unsigned integers here to speed things up or do they bring other disadvantages? I'm using a dsPIC33FJ128GP802 - a member of the dsPIC33F series by Microchip. It has single cycle multiply for both signed and unsigned ints. It also has sign and zero extend instructions.
For example, it produces this code when mixing signed and unsigned integers.
026E4 97E80F mov.b [w15-24],w0
026E6 FB0000 se w0,w0
026E8 97E11F mov.b [w15-31],w2
026EA FB8102 ze w2,w2
026EC B98002 mul.ss w0,w2,w0
026EE 400600 add.w w0,w0,w12
026F0 FB8003 ze w3,w0
026F2 100770 subr.w w0,#16,w14
I'm using C (GCC for dsPIC.)

I think we all need to know a lot more about the peculiarities of your processor to answer this question. Why can't it do divides by powers of two on signed integers? As far as I remember the operation is the same for both. I.e.
10/2 = 00001010 goes to 00000101
-10/2 = 11110110 goes to 11111011
Maybe you should write some simple code doing an unsigned divide and a signed divide and compare the compiled output.
Also benchmarking is a good idea. It doesn't need to be precise. Just have a an array of a few thousand numbers, start a timer and start dividing them a few million times and time how long it takes. Maybe do a few billion times if your processor is fast. E.g.
int s_numbers[] = { etc. etc. };
int s_array_size = sizeof(s_numbers);
unsigned int u_numbers[] = { etc. etc.};
unsigned int u_array_size = sizeof(u_numbers);
int i;
int s_result;
unsigned int u_result;
/* Start timer. */
for(i = 0; i < 100000000; i++)
{
i_result = s_numbers[i % s_array_size] / s_numbers[(i + 1) % s_array_size];
}
/* Stop timer and print difference. */
/* Repeat for unsigned integers. */
Written in a hurry to show the principle, please forgive any errors.
It won't give precise benchmarking but should give a general idea of which is faster.

I don't know much about the instruction set available on your processor but a quick look makes me think that it has instructions that may be used for both arithmetic and logical shifts, which should mean that shifting a signed value costs about the same as shifting an unsigned value, and dividing by powers of 2 for each using the shifts should also cost the same. (my knowledge about this is from a quick glance at some intrinsic functions for a C compiler that targets your processor family).
That being said, if you are working with values which are to be interpreted as unsigned then you might as well declare them as unsigned. For the last few years I've been using the types from stdint.h more and more, and usually I end up using the unsigned versions because my values are either inherently unsigned or I'm just using them as bit arrays.

Generate assembly both ways and count cycles.

I'm going to guess the unsigned divide of powers of two are faster because it can simply do a right shift as needed without needing to worry about sign extension.
As for disadvantages: detecting arithmetic overflows, overflowing a signed type because you didn't realize it while using unsigned, etc. Nothing blocking, just different things to watch out for.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight