Working with Intel SSE SIMD intrinsics - c

I have a question regarding the various arithmetic operations for Intel SSE intrinsics.
what is the difference between doing a _mm_add_ps Vs. _mm_add_epi8/16/32? I want to make sure that my data is aligned at all times.
In a sample code when I do this:
__m128 u1 = _mm_load_ps(&V[(i-1)]);
I get a segmentation fault. But when I do this:
__m128 u1 = _mm_loadu_ps(&V[(i-1)]);
It works fine.
Since I want my data aligned i declared the array like this:
posix_memalign((void**)&V, 16, dx*sizeof(float));
Can someone help explain this.

_mm_add_ps add floats together, where _mm_add_epi8/16/32 adds integers, which are not floating point numbers.
_mm_loadu_ps does not require your floats to be 16byte (128bit) aligned, whereas _mm_load_ps does require 16byte alignment.
So if you get a seg fault on the first one, your alignment is wrong.
On the posix_memalign page it says this:
The posix_memalign() function shall fail if:
[EINVAL] The value of the alignment parameter is not a power of two
multiple of sizeof( void *).
I'm not sure that sizeof(float) == sizeof(void*) ??
Per this, it seems to be the same in C (on a 32bit system). Ok, a little trickery here, because the size of a pointer is normally the size of the CPU register width, 32bit or 64bit (8 bytes) depending on the system used, whereas a float would normally be 32bit (4 bytes)
Your aligned allocation should look more like this:
posix_memalign((void**)&V, 16, dx*sizeof(void*)); //since it will the correct size for your platform. You can always cast to `float` later on.

Related

casting from uchar* to uint* make unpredictable result

I'm compiling this code with C99 (with some different type definitions like uint32 instead of uint32_t) for an old arm architecture.
uint32 x2 = *((uint32 *) &data[t]);
uint32 x3;
memcpy(&x3, &data[t], 4);
printf("%d %d %d %d %d %d", x2, x3, data[t], data[t + 1], data[t + 2], data[t + 3]);
(data is uchar* and have length > t + 4)
but surprisingly the output is this:
-268435454 2 2 0 0 0
what is wrong with this casting?
The x2 line causes undefined behavior. Firstly data[t] might not have 32-bit alignment, and secondly, it's probably a strict aliasing violation to read a 32-bit value from that location.
Just remove that line and use the x3 version.
As commented by the rest of the answers, the problem comes from an unaligned access. While unaligned access is absolutely supported by x64 or x86, you cannot say that it is fully (un/-)supported by ARM, because it is ARM version dependent.
There are three posibilities:
before ARMv5 (included). ARM didn't support unaligned access which is what uint32 x2 = *((uint32 *) &data[t]); is doing (from LDR point of view, three of the four bytes of the 32bit variable are unaligned to 4, and just one is aligned), so the result is undefined (hence an error). Given that, the problem has to be fixed by software (on ARM, __packed for unaligned pointers or structs is useful).
after ARMv7 (included), they do allow unaligned access*, so the code should be valid, without errors (however, performance is a total different topic, and I am pretty sure it will be slower compared with aligned access to 32bit, but this topic deserves its own entry).
ARMv6. Usually, ARM has something which makes things funnier, and alignment is not going to be different. Here, they added a bit in order to select which way you prefer (ARMv5 or ARMv7).
Considering "for an old arm architecture" comment, your case looks like the first one, but if you include your assembly code and architecture, that would be great for a complete answer.
*some instructions will fail as they cannot support it (i.e. STM)
You have several issues here. Do not use pointer punning, use unions for it.
Your printf format is wrong as well. You should use %u instead of %d.

Alignment in 64bit machine is not 8 Bytes

I am trying to find out alignment on my 64bit machine(Win10 on Intel iCore7). I thought of this experiment:
void check_alignment(char c1, char c2 )
{
printf("delta=%d\n", (int)&c2 - (int)&c1); // prints 4 instead of 8
}
void main(){
check_alignment('a','b');
}
I was expecting delta=8. Since it's 64bit machine char c1 and char c2 should be stored on multiples of 8. Isn't that right?
Even if we assume compiler has done optimization to have them stored in less space, why not just store them back to back delta=1? Why 4 byte alignment?
I repeated the above experiment with float types, and still gives delta=4
void check_alignment(float f1, float f2 )
{
printf("delta=%d\n", (int)&c2 - (int)&c1); // prints 4
}
void main(){
check_alignment(1.0,1.1);
}
Firstly, if your platform is 64-bit, then why are you casting your pointer values to int? Is int 64-bit wide on your platform? If not, your subtraction is likely to produce a meaningless result. Use intptr_t or ptrdiff_t for that purpose, not int.
Secondly, in a typical implementation a 1-byte type will typically be aligned at 1-byte boundary, regardless of whether your platform is 64-bit or not. To see a 8-byte alignment you'd need a 8-byte type. And in order to see how it is aligned you have to inspect the physical value of the address (i.e. whether it is divisible by 1, 2, 4, 8 etc.), not analyze how far apart two variables are spaced.
Thirdly, how far apart c1 and c2 are in memory has little to do with alignment requirements of char type. It is determined by how char values are passed (or stored locally) on your platform. In your case they are apparently allocated 4-byte storage cells each. That's perfectly fine. Nobody every promised you that two unrelated objects with 1-byte alignment will be packed next to each other as tightly as possible.
If you want to determine alignment by measuring how far from each other two objects are stored, declare an array. Do not try to measure the distance between two independent objects - this is meaningless.
To determine the greatest fundamental alignment in your C implementation, use:
#include <stdio.h>
#include <stddef.h>
int main(void)
{
printf("%zd bytes\n", _Alignof(max_align_t));
}
To determine the alignment requirement of any particular type, replace max_align_t above with that type.
Alignment is not purely a function of the processor or other hardware. Hardware might support aligned or unaligned accesses with different performance effects, and some instructions might support unaligned access while others do not. A particular C implementation might choose to require or not require certain alignment in combination with choosing to use or not use various instructions. Additionally, on some hardware, whether unaligned access is supported is configurable by the operating system.

Convert __m256d to __m256i

Since cast like this:
__m256d a;
uint64_t t[4];
_mm256_store_si256( (__m256i*)t, (__m256i)a );/* Cast of 'a' to __m256i not allowed */
are not allowed when compiling under Visual Studio, I thought I could use some intrinsic functions to convert a __m256d value into a __m256i before passing it to _mm256_store_si256 and thus, avoiding the cast which causes the error.
But after looking on that list, I couldn't find a function taking for argument a __m256d value and returning a __256i value. So maybe you could help me writing my own function or finding the function I'm looking for, a function that stores 4x 64-bit double bit value to an array of 4x64-bit integers.
EDIT:
After further research, I found _mm256_cvtpd_epi64 which seems to be exactly what I want. But, my CPU doesn't support AVX512 instructions set...
What is left for me to do here?
You could use _mm256_store_pd( (double*)t, a). I'm pretty sure this is strict-aliasing safe because you're not directly dereferencing the pointer after casting it. The _mm256_store_pd intrinsic wraps the store with any necessary may-alias stuff.
(With AVX512, Intel switched to using void* for the load/store intrinsics instead of float*, double*, or __m512i*, to remove the need for these clunky casts and make it more clear that intrinsics can alias anything.)
The other option is to _mm256_castpd_si256 to reinterpret the bits of your __m256d as a __m256i:
alignas(32) uint64_t t[4];
_mm256_store_si256( (__m256i*)t, _mm256_castpd_si256(a));
If you read from t[] right away, your compiler might optimize away the store/reload and just shuffle or pextrq rax, xmm0, 1 to extract FP bit patterns directly into integer registers. You could write this manually with intrinsics. Store/reload is not bad, though, especially if you want more than 1 of the double bit-patterns as scalar integers.
You could instead use union m256_elements { uint64_t u64[4]; __m256d vecd; };, but there's no guarantee that will compile efficiently.
This cast compiles to zero asm instructions, i.e. it's just a type-pun to keep the C compiler happy.
If you wanted to actually round packed double to the nearest signed or unsigned 64-bit integer and have the result in 2's complement or unsigned binary instead of IEEE754 binary64, you need AVX512F _mm256/512_cvtpd_epi64 (vcvtpd2qq) for it to be efficient. SSE2 + x86-64 can do it for scalar, or you can use some packed FP hacks for numbers in the [0..2^52] range: How to efficiently perform double/int64 conversions with SSE/AVX?.
BTW, storeu doesn't require an aligned destination, but store does. If the destination is a local, you should normally align it instead of using an unaligned store, at least if the store happens in a loop, or if this function can inline into a larger function.

Can I cast pointers like this?

Code:
unsigned char array_add[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
...
if ((*((uint32_t*)array_add)!=0)||(*((uint32_t*)array_add+1)!=0))
{
...
}
I want to check if the array is all zero. So naturally I thought of casting the address of an array, which also happens to be the address of the first member, to an unsigned int 32 type, so I'll only need to do this twice, since it's a 64 bit, 8 byte array. Problem is, it was successfully compiled but the program crashes every time around here.
I'm running my program on an 8bit microcontroller, cortex-M0.
How wrong am I?
In theory this could work but in practice there is a thing you aren't considering: aligned memory accesses.
If a uint32_t requires aligned memory access (eg to 4 bytes), then casting an array of unsigned char which has 1 byte alignment requirement to an uint32_t* produces a pointer to an unaligned array of uint32_t.
According to documentation:
There is no support for unaligned accesses on the Cortex-M0 processor. Any attempt to perform an unaligned memory access operation results in a HardFault exception.
In practice this is just dangerous and fragile code which invokes undefined behavior in certain circumstances, as pointed out by Olaf and better explained here.
To test multiple bytes as once code could use memcmp().
How speedy this is depends more on the compiler as a optimizing compiler may simple emit code that does a quick 8 byte at once (or 2 4-byte) compare. Even the memcmp() might not be too slow on an 8-bit processor. Profiling code helps.
Take care in micro-optimizations, as they too often are not efficient use of coders` time for significant optimizations.
unsigned char array_add[8] = ...
const unsigned char array_zero[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
if (memcmp(array_zero, array_add, 8) == 0) ...
Another method uses a union. Be careful not to assume if add.arr8[0] is the most or least significant byte.
union {
uint8_t array8[8];
uint64_t array64;
} add;
// below code will check all 8 of the add.array8[] is they are zero.
if (add.array64 == 0)
In general, focus on writing clear code and reserve such small optimizations to very select cases.
I am not sure but if your array has 8 bytes then just assign base address to a long long variable and compare it to 0. That should solve your problem of checking if the array is all 0.
Edit 1: After Olaf's comment I would say that replace long long with int64_t. However, why do you not a simple loop for iterating the array and checking. 8 chars is all you need to compare.
Edit 2: The other approach could be to OR all elements of array and then compare with 0. If all are 0 then OR will be zero. I do not know whether CMP will be fast or OR. Please refer to Cortex-M0 docs for exact CPU cycles requirement, however, I would expect CMP to be slower.

How do I force the program to use unaligned addresses?

I've heard reads and writes of aligned int's are atomic and safe, I wonder when does the system make non malloc'd globals unaligned other than packed structures and casting/pointer arithmetic byte buffers?
[X86-64 linux] In all of my normal cases, the system always chooses integer locations that don't get word torn, for example, two byte on one word and the other two bytes on the other word. Can any one post a program/snip (C or assembly) that forces the global variable to unaligned address such that the integer gets torn and the system has to use two reads to load one integer value ?
When I print the below program, the addresses are close to each other such that multiple variables are within 64bits but never once word tearing is seen (smartness in the system or compiler ?)
#include <stdio.h>
int a;
char b;
char c;
int d;
int e = 0;
int isaligned(void *p, int N)
{
if (((int)p % N) == 0)
return 1;
else
return 0;
}
int main()
{
printf("processor is %d byte mode \n", sizeof(int *));
printf ( "a=%p/b=%p/c=%p/d=%p/f=%p\n", &a, &b, &c, &d, &e );
printf ( " check for 64bit alignment of test result of 0x80 = %d \n", isaligned( 0x80, 64 ));
printf ( " check for 64bit alignment of a result = %d \n", isaligned( &a, 64 ));
printf ( " check for 64bit alignment of d result = %d \n", isaligned( &e, 64 ));
return 0;}
Output:
processor is 8 byte mode
a=0x601038/b=0x60103c/c=0x60103d/d=0x601034/f=0x601030
check for 64bit alignment of test result of 0x80 = 1
check for 64bit alignment of a result = 0
check for 64bit alignment of d result = 0
How does a read of a char happen in the above case ? Does it read from 8 byte aligned boundary (in my case 0x601030 ) and then go to 0x60103c ?
Memory access granularity is always word size isn't it ?
Thx.
1) Yes, there is no guarantee that unaligned accesses are atomic, because [at least sometimes, on certain types of processors] the data may be written as two separate writes - for example if you cross over a memory page boundary [I'm not talking about 4KB pages for virtual memory, I'm talking about DDR2/3/4 pages, which is some fraction of the total memory size, typically 16Kbits times whatever the width is of the actual memory chip - which will vary depending on the memory stick itself]. Equally, on other processors than x86, you get a trap for reading unaligned memory, which would either cause the program to abort, or the read be emulated in software as multiple reads to "fix" the unaligned read.
2) You could always make an unaligned memory region by something like this:
char *ptr = malloc(sizeof(long long) * number+1);
long long *unaligned = (long long *)&ptr[2];
for(i = 0; i < number; i++)
temp = unaligned[i];
By the way, your alignment check checks if the address is aligned to 64 bytes, not 64 bits. You'll have to divide by 8 to check that it's aligned to 64 bits.
3) A char is a single byte read, and the address will be on the actual address of the byte itself. The actual memory read performed is probably for a full cache-line, starting at the target address, and then cycling around, so for example:
0x60103d is the target address, so the processor will read a cache line of 32 bytes, starting at the 64-bit word we want: 0x601038 (and as soon as that's completed the processor goes on to the next instruction - meanwhile the next read will be performed to fill the cacheline), then cacheline is filled with 0x601020, 0x601028, 0x601030. But should we turn the cache off [if you want your 3GHz latest x86 processor to be slightly slower than a 66MHz 486, disabling the cache is a good way to achieve that], the processor would just read one byte at 0x60103d.
4) Not on x86 processors, they have byte addressing - but for normal memory, reads are done on a cacheline basis, as explained above.
Note also that "may not be atomic" is not at all the same as "will not be atomic" - so you'll probably have a hard time making it go wrong by will - you really need to get all the timings of two different threads just right, and straddle cachelines, straddle memory page boundaries, and so on to make it go wrong - this will happen if you don't want it to happen, but trying to make it go wrong can be darn hard [trust me, I've been there, done that].
It probably doesn't, outside of those cases.
In assembly it's trivial. Something like:
.org 0x2
myglobal:
.word SOME_NUMBER
But on Intel, the processor can safely read unaligned memory. It might not be atomic, but that might not be apparent from the generated code.
Intel, right? The Intel ISA has single-byte read/write opcodes. Disassemble your program and see what it's using.
Not necessarily - you might have a mismatch between memory word size and processor word size.
1) This answer is platform-specific. In general, though, the compiler will align variables unless you force it to do otherwise.
2) The following will require two reads to load one variable when run on a 32-bit CPU:
uint64_t huge_variable;
The variable is larger than a register, so it will require multiple operations to access. You can also do something similar by using packed structures:
struct unaligned __attribute__ ((packed))
{
char buffer[2];
int unaligned;
char buffer2[2];
} sample_struct;
3) This answer is platform-specific. Some platforms may behave like you describe. Some platforms have instructions capable of fetching a half-register or quarter-register of data. I recommend examining the assembly emitted by your compiler for more details (make sure you turn off all compiler optimizations first).
4) The C language allows you to access memory with byte-sized granularity. How this is implemented under the hood and how much data your CPU fetches to read a single byte is platform-specific. For many CPUs, this is the same as the size of a general-purpose register.
The C standards guarantee that malloc(3) returns a memory area that complies to the strictest alignment requirements, so this just can't happen in that case. If there are unaligned data, it is probably read/written by pieces (that depends on the exact guarantees the architecture provides).
On some architectures unaligned access is allowed, on others it is a fatal error. When allowed, it is normally much slower than aligned access; when not allowed the compiler must take the pieces and splice them together, and that is even much slower.
Characters (really bytes) are normally allowed to have any byte address. The instructions working with bytes just get/store the individual byte in that case.
No, memory access is according to the width of the data. But real memory access is in terms of cache lines (read up on CPU cache for this).
Non-aligned objects can never come into existence without you invoking undefined behavior. In other words, there is no sequence of actions, all having well-defined behavior, which a program can take that will result in a non-aligned pointer coming into existence. In particular, there is no portable way to get the compiler to give you misaligned objects. The closest thing is the "packed structure" many compilers have, but that only applies to structure members, not independent objects.
Further, there is no way to test alignedness in portable C. You can use the implementation-defined conversions of pointers to integers and inspect the low bits, but there is no fundamental requirement that "aligned" pointers have zeros in the low bits, or that the low bits after conversion to integer even correspond to the "least significant" bits of the pointer, whatever that would mean. In other words, conversions between pointers and integers are not required to commute with arithmetic operations.
If you really want to make some misaligned pointers, the easiest way to do it, assuming alignof(int)>1, is something like:
char buf[2*sizeof(int)+1];
int *p1 = (int *)buf, *p2 = (int *)(buf+sizeof(int)+1);
It's impossible for both buf and buf+sizeof(int)+1 to be simultaneously aligned for int if alignof(int) is greater than 1. Thus at least one of the two (int *) casts gets applied to a misaligned pointer, invoking undefined behavior, and the typical result is a misaligned pointer.

Resources