Create own type of variable - c

Is it possible to create a custom type of variable in C/C++? I want something like "super long int", that occupies let's say 40 bytes and allows same operations as in an usual int. (+, -, /, %, <, >, etc..)

There's nothing built-in for something like that, at least not in C. You'll need to use a big-number library like GMP. It doesn't allow for using the normal set of operators, but it can handle numbers of an arbitrarily large size.
EDIT:
If you're targeting C++, GMP does have overloaded operators that will allow you to use the standard set of operators like you would with a regular int. See the manual for more details.

Some CPUs have support to work with very large numbers. With SSE on the x86/64 architecture you can implement 128 bit values (16 bytes) that can be calculated with normally.
With AVX this limitation extends to 256 bits (32 bytes). The upcoming AVX-512 extension is supposed to have 512 bits (64 bytes), thus enabling "super large" integers.
But there are two caveats to these extensions:
The compiler has to support it (GCC for example uses immintrin.h for AXV support and xmmintrin.h for SSE support). Alternatively you can try to implement the abstractions via inline assembler, but then the Assembler has to understand these (GCC uses AS as far as I know).
The machine you are running the compiled code on has to support these instructions. If the CPU does not support AVX or SSE (depending on what you want to do), the application will crash on these instructions, as the CPU does not understand them.
AVX/SSE is used in the implementations of memset, memcpy, etc, since they also allow you to reduce the memory accesses by a good deal (keep in mind that, while your cache line is going to be loaded into cache once, loading to it still takes up some cycles, and AVX/SSE help you eliminating a good chunk of these costs as well).
Here a working example (compiles with GCC 4.9.3, you have to add -mavx to your compiler options):
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
int main(void)
{
size_t i;
/*********************************************************************
**Hack-ish way to ensure that malloc's alignment does not screw with
**us. On this box it aligns to 0x10 bytes, but AVX needs 0x20.
*********************************************************************/
#define AVX_BASE (0x20ULL)
uint64_t*real_raw = malloc(128);
uint64_t*raw = (uint64_t*)((uintptr_t)real_raw + (AVX_BASE - ((uintptr_t)real_raw % AVX_BASE)));
__m256i value = _mm256_setzero_si256();
for(i = 0;i < 10;i++)
{
/*No special function here to do the math.*/
value += i * i;
/*************************************************************
**Extract the value from the register and print the last
**byte.
*************************************************************/
_mm256_store_si256((__m256i*)raw,value);
printf("%lu\n",raw[0]);
}
_mm256_store_si256((__m256i*)raw,value);
printf("End: %lu\n",raw[0]);
free(real_raw);
return 0;
}

Related

String length function is unstable

So I made this strlen a while ago and everything seemed fine. But I started noticing bugs with my codebase and after a while I tracked it down to this strlen function. I used SIMD instructions to write it and I am new to writing intrinsics so the code isn't probably the best it could be either.
Here is the function:
inline size_t strlen(const char* data) {
const __m256i terminationCharacters = _mm256_setzero_si256();
const size_t shiftAmount = ((size_t)&data) & 31;
const __m256i* pointer = (const __m256i*) (data - shiftAmount);
size_t length = 0;
for (;; length += 32, ++pointer) {
const __m256i comparingData = _mm256_load_si256(pointer);
const __m256i comparison = _mm256_cmpeq_epi8(comparingData, terminationCharacters);
if (!_mm256_testc_si256(terminationCharacters, comparison)) {
const auto mask = _mm256_movemask_epi8(comparison);
return length + _tzcnt_u32(mask >> shiftAmount);
}
}
}
Your attempt to combine startup handling into the aligned-vector loop has at least 2 showstopper bugs:
You exit the loop if your aligned load finds any zero bytes, even if they're from before the proper start of the string. (#James Griffin spotted this in comments). You need to do mask >>= shiftAmount and check that for non-zero to see if there were any matches in the part of the load that comes after the start of the string. (Don't use _mm256_testc_si256, just movemask and check).
_tzcnt_u32(mask >> shiftAmount); is buggy for any vectors after the first. The whole vector comes from bytes after the start of the string, so you need tzcnt to see all of bits. Instead, you want _tzcnt_u32(mask) - shiftAmount, I think.
Make yourself some test cases with 0 bytes before the actual string but inside the first aligned vector. And test cases with the final 0 in different places relative to a vector, and non-zero and test your version against libc strlen. (Maybe even for some randomized 0-positions within the first 32 bytes, and then within the first 64 bytes after that.)
Your strategy for handling unaligned startup should work, if you separate it from the loop. (Is it safe to read past the end of a buffer within the same page on x86 and x64?).
Another option is a page-cross check before a first unaligned vector load from the actual start of the string. (But then you need a fallback to something else). Then go aligned: overlap is fine; as long as you calculate the final length correctly, it doesn't matter if you check the same byte twice for being zero.
(You also don't really want the compiler to be wasting instructions inside the loop incrementing a pointer and a separate length, so check the resulting asm. A pointer-subtract after the loop should do the trick. Even cast to uintptr_t.
Also, you can subtract the final zero-position from the initial function arg, instead of from the aligned pointer, so instead of subtracting shiftAmount twice, you're just not using it at all except for the initial alignment.)
Don't use the vptest intrinsic (_mm256_testc_si256) at all, even in the main loop when you should be checking all the bytes; it's not better for _mm_cmp* results. vptest is 2 uops and can't macro-fuse with a branch instruction. But vpmovmskb eax, ymm0 is 1 uop, and test eax,eax / jz .loop is another one macro-fused uop. And even better, you actually need the integer movemask result after the loop, so you already have it.
Related
Is it safe to read past the end of a buffer within the same page on x86 and x64?
Why does glibc's strlen need to be so complicated to run quickly? (includes links to hand-written x86-64 asm for glibc's strlen implementation.) Unless you're on a platform with a worse C library, normally you should use that, because glibc uses CPU detection during dynamic linking to select a good version of strlen (and memcpy, etc.) for your CPU. Unaligned-startup for strlen is somewhat tricky, and glibc I think makes reasonable choices, unless the function-call overhead is a big problem. It also has good loop-unrolling techniques for big strings (like _mm256_min_epu8 to get a zero in a vector element if either of 2 input vectors had a zero, so it can amortize the actual movemask/branch work over a whole cache-line of data). It might be too aggressive in ramping up to that for medium-length strings though.
Note that glibc's licence is the LGPL, so you can't just copy code from glibc into your project unless your license is compatible. Even writing an intrinsics equivalent of its asm might be questionable.
Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? - a simple SSE2 strlen that doesn't handle misalignment, in hand-written asm. And comments on benchmarking.
https://agner.org/optimize/ - guides and instruction tables, and his subroutine library (in hand-written asm) includes a strlen. (But note it's GPL licensed.)
I assume some of the BSDs and MacOS have an asm strlen under a more permissive license you could use / look at if your project isn't GPL-compatible.
No offense but
size_t strlen(char *p)
{
size_t ret_val = 0;
while (*p++) ret_val++;
retirn ret_val;
}
does its work very well since long long ago. Also, today's optimizing compilers get very tight code for it, and your code is impossible to read.

Multiword addition in C

I have a C program which uses GCC's __uint128_t which is great, but now my needs have grown beyond it.
What are my options for fast arithmetic with 196 or 256 bits?
The only operation I need is addition (and I don't need the carry bit, i.e., I will be working mod 2192 or 2256).
Speed is important, so I don't want to move to a general multi-precision if at all possible. (In fact my code does use multi-precision in some places, but this is in the critical loop and will run tens of billions of times. So far the multi-precision needs to run only tens of thousands of times.)
Maybe this is simple enough to code directly, or maybe I need to find some appropriate library.
What is your advice, Oh great Stack Overflow?
Clarification: GMP is too slow for my needs. Although I actually use multi-precision in my code it's not in the inner loop and runs less than 105 times. The hot loop runs more like 1012 times. When I changed my code (increasing a size parameter) so that the multi-precision part ran more often vs. the single-precision, I had a 100-fold slowdown (mostly due to memory management issues, I think, rather than extra µops). I'd like to get that down to a 4-fold slowdown or better.
256-bit version
__uint128_t a[2], b[2], c[2]; // c = a + b
c[0] = a[0] + b[0]; // add low part
c[1] = a[1] + b[1] + (c[0] < a[0]); // add high part and carry
Edit: 192-bit version. This way you can eliminate the 128-bit comparison like what #harold's stated:
struct uint192_t {
__uint128_t H;
uint64_t L;
} a, b, c; // c = a + b
c.L = a.L + b.L;
c.H = a.H + b.H + (c.L < a.L);
Alternatively you can use the integer overflow builtins or checked arithmetic builtins
bool carry = __builtin_uaddl_overflow(a.L, b.L, &c.L);
c.H = a.H + b.H + carry;
Demo on Godbolt
If you do a lot of additions in a loop you should consider using SIMD and/or running them in parallel with multithreading. For SIMD you may need change the layout of the type so that you can add all the low parts at once and all the high parts at once. Once possible solution is an array of struct of array as suggested here practical BigNum AVX/SSE possible?
SSE2: llhhllhhllhhllhh
AVX2: llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh
With AVX-512 you can add eight 64-bit values at once. So you can add eight 192-bit values in 3 instructions plus a few more for the carry. For more information read Is it possible to use SSE and SSE2 to make a 128-bit wide integer?
With AVX-2 or AVX-512 you may also have very fast horizontal add so it may also worth a try for 256-bit even if you don't have parallel addition chains. But for 192-bit addition then 3 add/adc instructions would be much faster
There are also many libraries with a fixed-width integer type. For example Boost.Multiprecision
#include <boost/multiprecision/cpp_int.hpp>
using namespace boost::multiprecision;
uint256_t myUnsignedInt256 = 1;
Some other libraries:
ttmath: ttmath:UInt<3> (an int type with 3 limbs, which is 192 bits on 64-bit computers)
uint256_t
See also
C++ 128/256-bit fixed size integer types
You could test if the "add (low < oldlow) to simulate carry"-technique from this answer is fast enough. It's slightly complicated by the fact that low is an __uint128_t here, that could hurt code generation. You might try it with 4 uint64_t's as well, I don't know whether that'll be better or worse.
If that's not good enough, drop to inline assembly, and directly use the carry flag - it doesn't get any better than that, but you'd have the usual downsides of using inline assembly.

Efficiency of different integer sizes on a 64-bit CPU

In a 64-bit CPU, if the int is 32 bits whereas the long is 64 bits, would the long be more efficient than the int?
The main problem with your question is that you did not define "efficient". There are several possible efficiency related differences.
Of course if you need to use 64 bits, then there's no question. But sometimes you could use 32 bits and you wonder if it would be better to use 64 bits instead.
Data Size Efficiency
Using 32 bits will use less memory. This is more efficient especially if you use a lot of them. Not only it's more efficient in the sense that you may not get to swap out, but also in the sense that you'll have fewer cache misses. If you use just a few then the efficiency difference is irrelevant.
Code Size Efficiency
This is heavily dependent on the architecture. Some architectures will need longer instructions to manipulate 32 bit values, others will need longer instructions to manipulate 64 bits values and others will make no difference. On the intel processors, for example, 32 bits is the default operand size even for 64 bits code. Smaller code may have a little advantage both in cache behavior and in pipeline usage. But it is dependent on the architecture which operand size will use smaller code.
Execution Speed Efficiency
In general there should be no difference beyond the one implied by code size. Once the instruction has been decoded the timing for mere execution are generally identical. However, once again, this is in fact architecture specific. There are architectures that do not have native 32 bit arithmetic, for example.
My suggestion:
If it's just some local variables or data in small structures that you do not allocate in huge quantities, use int and do it in a way that does not assume a size, so that a new version of the compiler or a different compiler that use a different size for int will still work.
However if you have huge arrays or matrixes, then use the smallest type you can use and make sure its size is explicit.
On the common x86-64 architecture, 32-bit arithmetic is never slower than 64 bit arithmethic. So int is always the same speed or faster than long. On other architectures that don't actually have builtin 32-bit arithmetic, such as the MMIX, this might not hold.
Basic wisdom holds: Write it without considering such micro-optimizations and if necessary, profile and optimize.
If you are trying to store 64 bits of data, use a long. If you aren't going to need the 64 bits use the regular 32 bit int.
Yes, a 64bit number would be more efficient than a 32bit number.
On a 64bit CPU most compilers would give you 64bit if you ask for an long int though.
To see the size with your current compiler:
#include <stdio.h>
int main(int argc, char **argv){
long int foo;
printf("The size of an int is: %ld bytes\n", sizeof(foo));
printf("The size of an int is: %ld bits\n", sizeof(foo) * 8);
return 0;
}
If your cpu is running in 64bit mode you can expect that the CPU will use that regardless of what you ask. All the registers are 64bit, the operations are 64bit so if you want to get a 32bit result it will generally convert the 64bit result to 32bit for you.
The limits.h on my system defines long int as:
/* Minimum and maximum values a `signed long int' can hold. */
# if __WORDSIZE == 64
# define LONG_MAX 9223372036854775807L
# else
# define LONG_MAX 2147483647L
# endif
# define LONG_MIN (-LONG_MAX - 1L)

Profiling a Set implementation on 64-bit machines

Relevant Information on my system:
Core2Duo T6500
gcc (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2)
Using the basic set implementation, where each set that is stored is really just the lexicographical order of the set stored, you can use standard bit operations for Set operations like Union, Intersection, elementQ, etc.
My question is about determining the size of the set. Implementations like Cliquer use a
static int set_bit_count[256]
to store how many bits are in any given possible 8 bit string, and then the algorithm would go through 8 bits at a time to determine the set's size.
I have two problems with that way:
If registers are more than 8x faster than cache or RAM, this would waste speed.
In a 64-bit machine, are not int operations slower than say, unsigned long long int which I assume are the standard operating integers on 64-bit CPU's.
But I would imagine just using a simple
while(x)
x&=x-1;
++count;
could be faster as everything could be stored in registers. But on the downside, could something other than the obvious 8x times as many operations?
Also, there are so many different combinations of int, uint, unsigned long, unsigned long long that I have no Idea where to start testing.
Do you know any essays on this topic?
Do you know any other SO questions on this topic?
Do you have any insights to this?
Do you have any suggestions on how to profile this? I've never used gprof. And when I use time.h, I can't get finer than a second of granularity.
I would be very grateful if you did.
Most likely (though I'm too lazy to test right now), the fastest would be
int popcount(unsigned x) {
int count;
#if defined(__GNUC__)
__asm__("popcnt %1,%0" : "=r" (count) : "r" (x));
#elif defined(_MSC_VER)
__asm {
POPCNT x, count
};
#else
/* blah, who cares */
for (count = 0; x; count += x&1, x >>= 1);
#endif
return count;
}
(Though this will explode if the CPU doesn't support SSE4.2.) Of course, it would be better (and more portable) to use the compilers' built-in intrinsics, and in general I would trust the compiler to choose whatever implementation is best for the current target platform.
int popcount(unsigned x);
#if defined(__GNUC__)
# define popcount __builtin_popcount
#elif defined(_MSC_VER)
# define popcount __popcnt
#else
/* fallback implementation */
#fi
I would profile the two different implementations using a random number generator to create the bit patterns. I would loop over many iterations, accumulating something during each iteration (e.g., exclusive-OR of the bit count), which I'd print-out at the end of the loop. The accumulating and printing are necessary so that the compiler doesn't optimize away anything of importance.

Is there memset() that accepts integers larger than char?

Is there a version of memset() which sets a value that is larger than 1 byte (char)? For example, let's say we have a memset32() function, so using it we can do the following:
int32_t array[10];
memset32(array, 0xDEADBEEF, sizeof(array));
This will set the value 0xDEADBEEF in all the elements of array. Currently it seems to me this can only be done with a loop.
Specifically, I am interested in a 64 bit version of memset(). Know anything like that?
void memset64( void * dest, uint64_t value, uintptr_t size )
{
uintptr_t i;
for( i = 0; i < (size & (~7)); i+=8 )
{
memcpy( ((char*)dest) + i, &value, 8 );
}
for( ; i < size; i++ )
{
((char*)dest)[i] = ((char*)&value)[i&7];
}
}
(Explanation, as requested in the comments: when you assign to a pointer, the compiler assumes that the pointer is aligned to the type's natural alignment; for uint64_t, that is 8 bytes. memcpy() makes no such assumption. On some hardware unaligned accesses are impossible, so assignment is not a suitable solution unless you know unaligned accesses work on the hardware with small or no penalty, or know that they will never occur, or both. The compiler will replace small memcpy()s and memset()s with more suitable code so it is not as horrible is it looks; but if you do know enough to guarantee assignment will always work and your profiler tells you it is faster, you can replace the memcpy with an assignment. The second for() loop is present in case the amount of memory to be filled is not a multiple of 64 bits. If you know it always will be, you can simply drop that loop.)
There's no standard library function afaik. So if you're writing portable code, you're looking at a loop.
If you're writing non-portable code then check your compiler/platform documentation, but don't hold your breath because it's rare to get much help here. Maybe someone else will chip in with examples of platforms which do provide something.
The way you'd write your own depends on whether you can define in the API that the caller guarantees the dst pointer will be sufficiently aligned for 64-bit writes on your platform (or platforms if portable). On any platform that has a 64-bit integer type at all, malloc at least will return suitably-aligned pointers.
If you have to cope with non-alignment, then you need something like moonshadow's answer. The compiler may inline/unroll that memcpy with a size of 8 (and use 32- or 64-bit unaligned write ops if they exist), so the code should be pretty nippy, but my guess is it probably won't special-case the whole function for the destination being aligned. I'd love to be corrected, but fear I won't be.
So if you know that the caller will always give you a dst with sufficient alignment for your architecture, and a length which is a multiple of 8 bytes, then do a simple loop writing a uint64_t (or whatever the 64-bit int is in your compiler) and you'll probably (no promises) end up with faster code. You'll certainly have shorter code.
Whatever the case, if you do care about performance then profile it. If it's not fast enough try again with more optimisation. If it's still not fast enough, ask a question about an asm version for the CPU(s) on which it's not fast enough. memcpy/memset can get massive performance increases from per-platform optimisation.
Just for the record, the following uses memcpy(..) in the following pattern. Suppose we want to fill an array with 20 integers:
--------------------
First copy one:
N-------------------
Then copy it to the neighbour:
NN------------------
Then copy them to make four:
NNNN----------------
And so on:
NNNNNNNN------------
NNNNNNNNNNNNNNNN----
Then copy enough to fill the array:
NNNNNNNNNNNNNNNNNNNN
This takes O(lg(num)) applications of memcpy(..).
int *memset_int(int *ptr, int value, size_t num) {
if (num < 1) return ptr;
memcpy(ptr, &value, sizeof(int));
size_t start = 1, step = 1;
for ( ; start + step <= num; start += step, step *= 2)
memcpy(ptr + start, ptr, sizeof(int) * step);
if (start < num)
memcpy(ptr + start, ptr, sizeof(int) * (num - start));
return ptr;
}
I thought it might be faster than a loop if memcpy(..) was optimised using some hardware block memory copy functionality, but it turns out that a simple loop is faster than the above with -O2 and -O3. (At least using MinGW GCC on Windows with my particular hardware.) Without the -O switch, on a 400 MB array the code above is about twice as fast as an equivalent loop, and takes 417 ms on my machine, while with optimisation they both go to about 300 ms. Which means that it takes approximately the same number of nanoseconds as bytes, and a clock cycle is about a nanosecond. So either there is no hardware block memory copy functionality on my machine, or the memcpy(..) implementation does not take advantage of it.
Check your OS documentation for a local version, then consider just using the loop.
The compiler probably knows more about optimizing memory access on any particular architecture than you do, so let it do the work.
Wrap it up as a library and compile it with all the speed improving optimizations the compiler allows.
wmemset(3) is the wide (16-bit) version of memset. I think that's the closest you're going to get in C, without a loop.
If you're just targeting an x86 compiler you could try something like (VC++ example):
inline void memset32(void *buf, uint32_t n, int32_t c)
{
__asm {
mov ecx, n
mov eax, c
mov edi, buf
rep stosd
}
}
Otherwise just make a simple loop and trust the optimizer to know what it's doing, just something like:
for(uint32_t i = 0;i < n;i++)
{
((int_32 *)buf)[i] = c;
}
If you make it complicated chances are it will end up slower than simpler to optimize code, not to mention harder to maintain.
You should really let the compiler optimize this for you as someone else suggested. In most cases that loop will be negligible.
But if this some special situation and you don't mind being platform specific, and really need to get rid of the loop, you can do this in an assembly block.
//pseudo code
asm
{
rep stosq ...
}
You can probably google stosq assembly command for the specifics. It shouldn't be more than a few lines of code.
write your own; it's trivial even in asm.

Resources