What is the difference between loadu/lddqu and assignment operator?

What is the difference between loadu/lddqu and assignment operator? - c

I am using SIMD vector to do some computations, and curious the difference of them, as followings.
__m128i vector2 = vector1;
__m128i vector2 = _mm_loadu_si128(&vector1);
So, what's the difference of above two statements?

Like Peter Cordes said in his comment, if vector1 really is a __m128i it's just unnecessarily complicated.
However, that's not the use case for _mm_loadu_si128. While the type of the argument is __m128i const*, that's really more about a lack of good options and poor decisions.
_mm_loadu_si128 is really meant to load any 16 bytes of data into a vector register. If you want to load data which is already aligned to 16 byte boundaries you should use _mm_load_si128 instead. If your data isn't aligned to 16 byte boundaries it's not really a __m128i, so the type of the parameter is misleading at best.
The reason Intel (or whoever) chose to use __m128i const* isn't completely clear, but to be honest there aren't a lot of good options. int const* wouldn't really make sense because what we're trying to load isn't always 32-bit signed integers. They could have made an _mm_loadu_epi8, _mm_loadu_epi16, _mm_loadu_epi32, etc., but even that wouldn't be quite right because the data doesn't need to be aligned to _Alignof(int), _Alignof(short), etc., though char actually work well here.
The right choice would probably be to make the argument void*, but I guess Intel wanted to signal that they really wanted 16 bytes of data. char mem_addr[16] would be okay in C99+, but not in C++, and while SSE2 came out at around the same time as C99 many compilers didn't support C99 (MSVC still doesn't!).
Basically, for this function, ignore the types of the parameters and read the description.

Related

Comparison uint8_t vs uint16_t while declaring a counter

Assuming to have a counter which counts from 0 to 100, is there an advantage of declaring the counter variable as uint16_t instead of uint8_t.
Obviously if I use uint8_t I could save some space. On a processor with natural wordsize of 16 bits access times would be the same for both I guess. I couldn't think why I would use a uint16_t if uint8_t can cover the range.

Using a wider type than necessary can allow the compiler to avoid having to mask the higher bits.
Suppose you were working on a 16 bit architecture, then using uint16_t could be more efficient, however if you used uint16_t instead of uint8_t on a 32 bit architecture then you would still have the mask instructions but just masking a different number of bits.
The most efficient type to use in a cross-platform portable way is just plain int or unsigned int, which will always be the correct type to avoid the need for masking instructions, and will always be able to hold numbers up to 100.
If you are in a MISRA or similar regulated environment that forbids the use of native types, then the correct standard-compliant type to use is uint_fast8_t. This guarantees to be the fastest unsigned integer type that has at least 8 bits.
However, all of this is nonsense really. Your primary goal in writing code should be to make it readable, not to make it as fast as possible. Penny-pinching instructions like this makes code convoluted and more likely to have bugs. Also because it is harder to read, the bugs are less likely to be found during code review.
You should only try to optimize like this once the code is finished and you have tested it and found the particular part which is the bottleneck. Masking a loop counter is very unlikely to be the bottleneck in any real code.

Obviously if I use uint8_t I could save some space.
Actually, that's not necessarily obvious! A loop index variable is likely to end up in a register, and if it does there's no memory to be saved. Also, since the definition of the C language says that much arithmetic takes place using type int, it's possible that using a variable smaller than int might actually end up costing you space in terms of extra code emitted by the compiler to convert back and forth between int and your smaller variable. So while it could save you some space, it's not at all guaranteed that it will — and, in any case, the actual savings are going to be almost imperceptibly small in the grand scheme of things.
If you have an array of some number of integers in the range 0-100, using uint8_t is a fine idea if you want to save space. For an individual variable, on the other hand, the arguments are pretty different.
In general, I'd say that there are two reasons not to use type uint8_t (or, equivalently, char or unsigned char) as a loop index:
It's not going to save much data space (if at all), and it might cost code size and/or speed.
If the loop runs over exactly 256 elements (yours didn't, but I'm speaking more generally here), you may have introduced a bug (which you'll discover soon enough): your loop may run forever.
The interviewer was probably expecting #1 as an answer. It's not a guaranteed answer — under plenty of circumstances, using the smaller type won't cost you anything, and evidently there are microprocessors where it can actually save something — but as a general rule, I agree that using an 8-bit type as a loop index is, well, silly. And whether or not you agree, it's certainly an issue to be aware of, so I think it's a fair interview question.
See also this question, which discusses the same sorts of issues.

The interview question doesn't make much sense from a platform-generic point of view. If we look at code such as this:
for(uint8_t i=0; i<n; i++)
array[i] = x;
Then the expression i<n will get carried out on type int or larger because of implicit promotion. Though the compiler may optimize it to use a smaller type if it doesn't affect the result.
As for array[i], the compiler is likely to use a type corresponding to whatever address size the system is using.
What the interviewer was fishing for is likely that uint32_t on a 32 bitter tend to generate faster code in some situations. For those cases you can use uint_fast8_t, but more likely the compiler will perform optimizations no matter.
The only optimization uint8_t blocks the compiler from doing, is to allocate a larger variable than 8 bits on the stack. It doesn't however block the compiler from optimizing out the variable entirely and using a register instead. Such as for example storing it in an index register with the same width as the address bus.
Example with gcc x86_64: https://godbolt.org/z/vYscf3KW9. The disassembly is pretty painful to read, but the compiler just picked CPU registers to store anything regardless of the type of i, giving identical machine code between uint8_t anduint16_t. I would have been surprised if it didn't.
On a processor with natural wordsize of 16 bits access times would be the same for both I guess.
Yes this is true for all mainstream 16 bitters. Some might even manage faster code if given 8 bits instead of 16. Some exotic systems like DSP exist, but in case of lets say a 1 byte=16 bits DSP, then the compiler doesn't even provide you with uint8_t to begin with - it is an optional type. One generally doesn't bother with portability to wildly exotic systems, since doing so is a waste of everyone's time and money.
The correct answer: it is senseless to do manual optimization without a specific system in mind. uint8_t is perfectly fine to use for generic, portable code.

How to extract bytes from an SSE2 __m128i structure?

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).
I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.
I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:
greys = aligned_alloc(16, xres * sizeof(int8_t*));
for (uint32_t x = 0; x < xres; x++)
{
greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}
(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?
My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.
Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?
I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.
Nevertheless, I'm grateful for any clarification available.
In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).

Least obvious question first:
once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i
Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).
If you want other parts of the vector, not the low element, your options are:
Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).
Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.
_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]
Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.
An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.
If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.
Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)
Does this turn into a linear, unbroken block in memory?
No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.
Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).
And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.
assumed _mm_load_si128 only loads signed bytes
Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.
Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.
The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).
See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.
To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.
The x86 tag wiki has some good x86 asm / performance links.

Use int or char in arrays?

Suppose I have an array in C with 5 elements which hold integers in the range [0..255], would it be generally better to useunsigned char, unsigned int or int regarding performance? Because a char would be only one byte, but an int is easier to handle for the processor, as far as I know. Or does it mostly depend on how the elements are accessed?
EDIT: Measuring is quite difficult, because the code belongs to a library, and the array is accessed external.
Also, I encounter this problem not only in this very case, so I'm asking for a more general answer

While the answer really depends on the CPU and how it handles loading storing small integers, you can assume that the byte array will be faster on most modern systems:
A char only takes 1/4 of the space that an int takes (on most systems), which means that working on a char array takes only a quarter of the memory bandwidth. And most codes are memory bound on modern hardware.

This one is quite impossible to answer, because it depends on the code, compiler, and processor.
However, one suggestion is to use uint8_t insted of unsigned char. (The code will be the same, but the explicit version conveys the meaning much better.)
Then there is one more thing to consider. You might be best off by packing four 8-bit integers into one 32-bit integer. Most arithmetic and bitwise logical operations work fine as long as there are no overflows (division being the notable exception).

The golden rule: Measure it. If you can't be bothered measuring it, then it isn't worth optimising. So measure it one way, change it, measure it the other way, take whatever is faster. Be aware that when you switch to a different compiler, or a different processor, like a processor that is introduced in 2015 or 2016, the other code might now be faster.
Alternatively, don't measure it, but write the most readable and maintainable code.
And consider using a single 64 bit integer and shift operations :-)

What is the point behind unions in C?

I'm going through O'Reilly's Practical C Programming book, and having read the K&R book on the C programming language, and I am really having trouble grasping the concept behind unions.
They take the size of the largest data type that makes them up...and the most recently assigned one overwrites the rest...but why not just use / free memory as needed?
The book mentions that it's used in communication, where you need to set flags of the same size; and on a googled website, that it can eliminate odd-sized memory chunks...but is it of any use in a modern, non-embedded memory space?
Is there something crafty you can do with it and CPU registers? Is it simply a hold over from an earlier era of programming? Or does it, like the infamous goto, still have some powerful use (possibly in tight memory spaces) that makes it worth keeping around?

Well, you almost answered your question: Memory.
Back in the days memory was rather low, and even saving a few kbytes has been useful.
But even today there are scenarios where unions would be useful. For example, if you'd like to implement some kind of variant datatype. The best way to do this is using a union.
This doesn't sound like much, but let's just assume you want to use a variable either storing a 4 character string (like an ID) or a 4 byte number (which could be some hash or indeed just a number).
If you use a classic struct, this would be 8 bytes long (at least, if you're unlucky there are filling bytes as well). Using an union it's only 4 bytes. So you're saving 50% memory, which isn't a lot for one instance, but imagine having a million of these.
While you can achieve similar things by casting or subclassing a union is still the easiest way to do this.

One use of unions is having two variables occupy the same space, and a second variable in the struct decide what data type you want to read it as.
e.g. you could have a boolean 'isDouble', and a union 'doubleOrLong' which has both a double and a long. If isDouble == true interpret the union as a double else interpret it as a long.
Another use of unions is accessing data types in different representations. For instance, if you know how a double is laid out in memory, you could put a double in a union, access it as a different data type like a long, directly access its bits, its mantissa, its sign, its exponent, whatever, and do some direct manipulation with it.
You don't really need this nowadays since memory is so cheap, but in embedded systems it has its uses.

The Windows API makes use of unions quite a lot. LARGE_INTEGER is an example of such a usage. Basically, if the compiler supports 64-bit integers, use the QuadPart member; otherwise, set the low DWORD and the high DWORD manually.

It's not really a hold over, as the C language was created in 1972, when memory was a real concern.
You could make the argument that in modern, non-embedded space, you might not want to use C as a programming language to begin with. If you've chosen C as your language choice for implementation, you're looking to harness the benefits of C: it's efficient, close-to-metal, which results in tight, fast binaries.
As such, when choosing to use C, you'd still want to take advantage of it's benefits, which includes memory-space efficiency. To which, the Union works very well; allowing you to have some degree of type safety, while enforcing the smallest memory foot print available.

One place where I have seen it used is in the Doom 3/idTech 4 Fast Inverse Square Root implementation.
For those unfamiliar with this algorithm, it essentially requires treating a floating point number as an integer. The old Quake (and earlier) version of the code does this by the following:
float y = 2.0f;
// treat the bits of y as an integer
long i = * ( long * ) &y;
// do some stuff with i
// treat the bits of i as a float
y = * ( float * ) &i;
original source on GitHub
This code takes the address of a floating point number y, casts it to a pointer to a long (ie, a 32 bit integer in Quake days), and derefences it into i. Then it does some incredibly bizarre bit-twiddling stuff, and the reverse.
There are two disadvantages of doing it this way. One is that the convoluted address-of, cast, dereference process forces the value of y to be read from memory, rather than from a register1, and ditto on the way back. On Quake-era computers, however, floating point and integer registers were completely separate so you pretty much had to push to memory and back to deal with this restriction.
The second is that, at least in C++, doing such casting is deeply frowned upon, even when doing what amounts to voodoo such as this function does. I'm sure there are more compelling arguments, however I'm not sure what they are :)
So, in Doom 3, id included the following bit in their new implementation (which uses a different set of bit twiddling, but a similar idea):
union _flint {
dword i;
float f;
};
...
union _flint seed;
seed.i = /* look up some tables to get this */;
double r = seed.f; // <- access the bits of seed.i as a floating point number
original source on GitHub
Theoretically, on an SSE2 machine, this can be accessed through a single register; I'm not sure in practice whether any compiler would do this. It's still somewhat cleaner code in my opinion than the casting games in the earlier Quake version.
1 - ignoring "sufficiently advanced compiler" arguments

Smart typedefs

I've always used typedef in embedded programming to avoid common mistakes:
int8_t - 8 bit signed integer
int16_t - 16 bit signed integer
int32_t - 32 bit signed integer
uint8_t - 8 bit unsigned integer
uint16_t - 16 bit unsigned integer
uint32_t - 32 bit unsigned integer
The recent embedded muse (issue 177, not on the website yet) introduced me to the idea that it's useful to have some performance specific typedefs. This standard suggests having typedefs that indicate you want the fastest type that has a minimum size.
For instance, one might declare a variable using int_fast16_t, but it would actually be implemented as an int32_t on a 32 bit processor, or int64_t on a 64 bit processor as those would be the fastest types of at least 16 bits on those platforms. On an 8 bit processor it would be int16_t bits to meet the minimum size requirement.
Having never seen this usage before I wanted to know
Have you seen this in any projects, embedded or otherwise?
Any possible reasons to avoid this sort of optimization in typedefs?

For instance, one might declare a
variable using int_fast16_t, but it
would actually be implemented as an
int32_t on a 32 bit processor, or
int64_t on a 64 bit processor as those
would be the fastest types of at least
16 bits on those platforms
That's what int is for, isn't it? Are you likely to encounter an 8-bit CPU any time soon, where that wouldn't suffice?
How many unique datatypes are you able to remember?
Does it provide so much additional benefit that it's worth effectively doubling the number of types to consider whenever I create a simple integer variable?
I'm having a hard time even imagining the possibility that it might be used consistently.
Someone is going to write a function which returns a int16fast_t, and then someone else is going to come along and store that variable into an int16_t.
Which means that in the obscure case where the fast variants are actually beneficial, it may change the behavior of your code. It may even cause compiler errors or warnings.

Check out stdint.h from C99.

The main reason I would avoid this typedef is that it allows the type to lie to the user. Take int16_t vs int_fast16_t. Both type names encode the size of the value into the name. This is not an uncommon practice in C/C++. I personally use the size specific typedefs to avoid confusion for myself and other people reading my code. Much of our code has to run on both 32 and 64 bit platforms and many people don't know the various sizing rules between the platforms. Types like int32_t eliminate the ambiguity.
If I had not read the 4th paragraph of your question and instead just saw the type name, I would have assumed it was some scenario specific way of having a fast 16 bit value. And I obviously would have been wrong :(. For me it would violate the "don't surprise people" rule of programming.
Perhaps if it had another distinguishing verb, letter, acronym in the name it would be less likely to confuse users. Maybe int_fast16min_t ?

When I am looking at int_fast16_t, and I am not sure about the native width of the CPU in which it will run, it may make things complicated, for example the ~ operator.
int_fast16_t i = 10;
int_16_t j = 10;
if (~i != ~j) {
// scary !!!
}
Somehow, I would like to willfully use 32 bit or 64 bit based on the native width of the processor.

I'm actually not much of a fan of this sort of thing.
I've seen this done many times (in fact, we even have these typedefs at my current place of employment)... For the most part, I doubt their true usefulness... It strikes me as change for changes sake... (and yes, I know the sizes of some of the built ins can vary)...

I commonly use size_t, it happens to be the fastest address size, a tradition I picked up in embedding. And it never caused any issues or confusion in embedded circles, but it actually began causing me problems when I began working on 64bit systems.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight