I'm compiling this code with C99 (with some different type definitions like uint32 instead of uint32_t) for an old arm architecture.
uint32 x2 = *((uint32 *) &data[t]);
uint32 x3;
memcpy(&x3, &data[t], 4);
printf("%d %d %d %d %d %d", x2, x3, data[t], data[t + 1], data[t + 2], data[t + 3]);
(data is uchar* and have length > t + 4)
but surprisingly the output is this:
-268435454 2 2 0 0 0
what is wrong with this casting?
The x2 line causes undefined behavior. Firstly data[t] might not have 32-bit alignment, and secondly, it's probably a strict aliasing violation to read a 32-bit value from that location.
Just remove that line and use the x3 version.
As commented by the rest of the answers, the problem comes from an unaligned access. While unaligned access is absolutely supported by x64 or x86, you cannot say that it is fully (un/-)supported by ARM, because it is ARM version dependent.
There are three posibilities:
before ARMv5 (included). ARM didn't support unaligned access which is what uint32 x2 = *((uint32 *) &data[t]); is doing (from LDR point of view, three of the four bytes of the 32bit variable are unaligned to 4, and just one is aligned), so the result is undefined (hence an error). Given that, the problem has to be fixed by software (on ARM, __packed for unaligned pointers or structs is useful).
after ARMv7 (included), they do allow unaligned access*, so the code should be valid, without errors (however, performance is a total different topic, and I am pretty sure it will be slower compared with aligned access to 32bit, but this topic deserves its own entry).
ARMv6. Usually, ARM has something which makes things funnier, and alignment is not going to be different. Here, they added a bit in order to select which way you prefer (ARMv5 or ARMv7).
Considering "for an old arm architecture" comment, your case looks like the first one, but if you include your assembly code and architecture, that would be great for a complete answer.
*some instructions will fail as they cannot support it (i.e. STM)
You have several issues here. Do not use pointer punning, use unions for it.
Your printf format is wrong as well. You should use %u instead of %d.
Related
I am trying to implement a VERY efficient 2x2 matrix multiplication in C code for operation in an ARM Cortex-M4. The function accepts 3 pointers to 2x2 arrays, 2 for the inputs to be multiplied and an output buffer passed by the using function. Here is what I have so far...
static inline void multiply_2x2_2x2(int16_t a[2][2], int16_t b[2][2], int32_t c[2][2])
{
int32_t a00a01, a10a11, b00b01, b01b11;
a00a01 = a[0][0] | a[0][1]<<16;
b00b10 = b[0][0] | b[1][0]<<16;
b01b11 = b[0][1] | b[1][1]<<16;
c[0][0] = __SMUAD(a00a01, b00b10);
c[0][1] = __SMUAD(a00a01, b01b11);
a10a11 = a[1][0] | a[1][1]<<16;
c[1][0] = __SMUAD(a10a11, b00b10);
c[1][1] = __SMUAD(a10a11, b01b11);
}
Basically, my strategy is to to use the ARM Cortex-M4 __SMUAD() function to do the actual multiply accumulates. But this requires me to build the inputs a00a01, a10a11, b00b10, and b01b11 ahead of time. My question is, given that the C array should be a continuous in memory, is there a more efficient wat to pass the data into the functions directly without the intermediate variables? Secondary question, am I overthinking this and I should just let the compiler do its job as it is smarter than I am? I tend to do that a lot.
Thanks!
You could break the strict aliasing rules and load the matrix row directly into the 32-bit register, using a int16_t* to int32_t* typecast. An expression such as a00a01 = a[0][0] | a[0][1]<<16 just takes some consecutive bits from RAM and arranges them into other consecutive bits in registers. Consult your compiler manual for the flag to disable its strict aliasing assumptions, and make the cast safely usable.
You could also perhaps avoid transposing matrix columns into registers, by generating b in transposed format in the first place.
The best way to learn about the compiler, and get a sense of the cases for which it's smarter than you, is to disassemble its results and compare the instruction sequence to your intentions.
The first main concern is that some_signed_int << 16 invokes undefined behavior for negative numbers. So you have bugs all over. And then bitwise OR of two int16_t where either is negative does not necessarily form a valid int32_t either. Do you actually need the sign or can you drop it?
ARM examples use unsigned int, which in turn supposedly contains 2x int16_t in raw binary form. This is what you actually want too.
Also it would seem that it shouldn't matter for SMUAD which 16 bit word you place where. So the a[0][0] | a[0][1]<<16; just serves to needlessly swap data around in memory. It will confuse the compiler which can't optimize such code well. Sure, shifts etc are always very fast, but this is pointless overhead.
(As someone noted, this whole thing is likely much easier to write in pure assembler without concern of all the C type rules and undefined behavior.)
To avoid all these issues you could define your own union type:
typedef union
{
int16_t i16 [2][2];
uint32_t u32 [2];
} mat2x2_t;
u32[0] corresponds to i16[0][0] and i16[0][1]
u32[1] corresponds to i16[1][0] and i16[1][1]
C actually lets you "type pun" between these types pretty wildly (unlike C++). Unions also dodge the brittle strict aliasing rules.
The function can then become something along the lines of this pseudo code:
static uint32_t mat_mul16 (mat2x2_t a, mat2x2_t b)
{
uint32_t c0 = __SMUAD(a.u32[0], b.u32[0]);
...
}
Supposedly each such line should give 2x signed 16 multiplications as per the SMUAD instruction.
As for if this actually gives some revolutionary performance increase compared to some default MUL, I kind of doubt it. Disassemble and count CPU ticks.
am I overthinking this and I should just let the compiler do its job as it is smarter than I am?
Most likely :) The old rule of thumb: benchmark and then only manually optimize at the point when you've actually found a performance bottleneck.
I am trying to find out alignment on my 64bit machine(Win10 on Intel iCore7). I thought of this experiment:
void check_alignment(char c1, char c2 )
{
printf("delta=%d\n", (int)&c2 - (int)&c1); // prints 4 instead of 8
}
void main(){
check_alignment('a','b');
}
I was expecting delta=8. Since it's 64bit machine char c1 and char c2 should be stored on multiples of 8. Isn't that right?
Even if we assume compiler has done optimization to have them stored in less space, why not just store them back to back delta=1? Why 4 byte alignment?
I repeated the above experiment with float types, and still gives delta=4
void check_alignment(float f1, float f2 )
{
printf("delta=%d\n", (int)&c2 - (int)&c1); // prints 4
}
void main(){
check_alignment(1.0,1.1);
}
Firstly, if your platform is 64-bit, then why are you casting your pointer values to int? Is int 64-bit wide on your platform? If not, your subtraction is likely to produce a meaningless result. Use intptr_t or ptrdiff_t for that purpose, not int.
Secondly, in a typical implementation a 1-byte type will typically be aligned at 1-byte boundary, regardless of whether your platform is 64-bit or not. To see a 8-byte alignment you'd need a 8-byte type. And in order to see how it is aligned you have to inspect the physical value of the address (i.e. whether it is divisible by 1, 2, 4, 8 etc.), not analyze how far apart two variables are spaced.
Thirdly, how far apart c1 and c2 are in memory has little to do with alignment requirements of char type. It is determined by how char values are passed (or stored locally) on your platform. In your case they are apparently allocated 4-byte storage cells each. That's perfectly fine. Nobody every promised you that two unrelated objects with 1-byte alignment will be packed next to each other as tightly as possible.
If you want to determine alignment by measuring how far from each other two objects are stored, declare an array. Do not try to measure the distance between two independent objects - this is meaningless.
To determine the greatest fundamental alignment in your C implementation, use:
#include <stdio.h>
#include <stddef.h>
int main(void)
{
printf("%zd bytes\n", _Alignof(max_align_t));
}
To determine the alignment requirement of any particular type, replace max_align_t above with that type.
Alignment is not purely a function of the processor or other hardware. Hardware might support aligned or unaligned accesses with different performance effects, and some instructions might support unaligned access while others do not. A particular C implementation might choose to require or not require certain alignment in combination with choosing to use or not use various instructions. Additionally, on some hardware, whether unaligned access is supported is configurable by the operating system.
I've heard reads and writes of aligned int's are atomic and safe, I wonder when does the system make non malloc'd globals unaligned other than packed structures and casting/pointer arithmetic byte buffers?
[X86-64 linux] In all of my normal cases, the system always chooses integer locations that don't get word torn, for example, two byte on one word and the other two bytes on the other word. Can any one post a program/snip (C or assembly) that forces the global variable to unaligned address such that the integer gets torn and the system has to use two reads to load one integer value ?
When I print the below program, the addresses are close to each other such that multiple variables are within 64bits but never once word tearing is seen (smartness in the system or compiler ?)
#include <stdio.h>
int a;
char b;
char c;
int d;
int e = 0;
int isaligned(void *p, int N)
{
if (((int)p % N) == 0)
return 1;
else
return 0;
}
int main()
{
printf("processor is %d byte mode \n", sizeof(int *));
printf ( "a=%p/b=%p/c=%p/d=%p/f=%p\n", &a, &b, &c, &d, &e );
printf ( " check for 64bit alignment of test result of 0x80 = %d \n", isaligned( 0x80, 64 ));
printf ( " check for 64bit alignment of a result = %d \n", isaligned( &a, 64 ));
printf ( " check for 64bit alignment of d result = %d \n", isaligned( &e, 64 ));
return 0;}
Output:
processor is 8 byte mode
a=0x601038/b=0x60103c/c=0x60103d/d=0x601034/f=0x601030
check for 64bit alignment of test result of 0x80 = 1
check for 64bit alignment of a result = 0
check for 64bit alignment of d result = 0
How does a read of a char happen in the above case ? Does it read from 8 byte aligned boundary (in my case 0x601030 ) and then go to 0x60103c ?
Memory access granularity is always word size isn't it ?
Thx.
1) Yes, there is no guarantee that unaligned accesses are atomic, because [at least sometimes, on certain types of processors] the data may be written as two separate writes - for example if you cross over a memory page boundary [I'm not talking about 4KB pages for virtual memory, I'm talking about DDR2/3/4 pages, which is some fraction of the total memory size, typically 16Kbits times whatever the width is of the actual memory chip - which will vary depending on the memory stick itself]. Equally, on other processors than x86, you get a trap for reading unaligned memory, which would either cause the program to abort, or the read be emulated in software as multiple reads to "fix" the unaligned read.
2) You could always make an unaligned memory region by something like this:
char *ptr = malloc(sizeof(long long) * number+1);
long long *unaligned = (long long *)&ptr[2];
for(i = 0; i < number; i++)
temp = unaligned[i];
By the way, your alignment check checks if the address is aligned to 64 bytes, not 64 bits. You'll have to divide by 8 to check that it's aligned to 64 bits.
3) A char is a single byte read, and the address will be on the actual address of the byte itself. The actual memory read performed is probably for a full cache-line, starting at the target address, and then cycling around, so for example:
0x60103d is the target address, so the processor will read a cache line of 32 bytes, starting at the 64-bit word we want: 0x601038 (and as soon as that's completed the processor goes on to the next instruction - meanwhile the next read will be performed to fill the cacheline), then cacheline is filled with 0x601020, 0x601028, 0x601030. But should we turn the cache off [if you want your 3GHz latest x86 processor to be slightly slower than a 66MHz 486, disabling the cache is a good way to achieve that], the processor would just read one byte at 0x60103d.
4) Not on x86 processors, they have byte addressing - but for normal memory, reads are done on a cacheline basis, as explained above.
Note also that "may not be atomic" is not at all the same as "will not be atomic" - so you'll probably have a hard time making it go wrong by will - you really need to get all the timings of two different threads just right, and straddle cachelines, straddle memory page boundaries, and so on to make it go wrong - this will happen if you don't want it to happen, but trying to make it go wrong can be darn hard [trust me, I've been there, done that].
It probably doesn't, outside of those cases.
In assembly it's trivial. Something like:
.org 0x2
myglobal:
.word SOME_NUMBER
But on Intel, the processor can safely read unaligned memory. It might not be atomic, but that might not be apparent from the generated code.
Intel, right? The Intel ISA has single-byte read/write opcodes. Disassemble your program and see what it's using.
Not necessarily - you might have a mismatch between memory word size and processor word size.
1) This answer is platform-specific. In general, though, the compiler will align variables unless you force it to do otherwise.
2) The following will require two reads to load one variable when run on a 32-bit CPU:
uint64_t huge_variable;
The variable is larger than a register, so it will require multiple operations to access. You can also do something similar by using packed structures:
struct unaligned __attribute__ ((packed))
{
char buffer[2];
int unaligned;
char buffer2[2];
} sample_struct;
3) This answer is platform-specific. Some platforms may behave like you describe. Some platforms have instructions capable of fetching a half-register or quarter-register of data. I recommend examining the assembly emitted by your compiler for more details (make sure you turn off all compiler optimizations first).
4) The C language allows you to access memory with byte-sized granularity. How this is implemented under the hood and how much data your CPU fetches to read a single byte is platform-specific. For many CPUs, this is the same as the size of a general-purpose register.
The C standards guarantee that malloc(3) returns a memory area that complies to the strictest alignment requirements, so this just can't happen in that case. If there are unaligned data, it is probably read/written by pieces (that depends on the exact guarantees the architecture provides).
On some architectures unaligned access is allowed, on others it is a fatal error. When allowed, it is normally much slower than aligned access; when not allowed the compiler must take the pieces and splice them together, and that is even much slower.
Characters (really bytes) are normally allowed to have any byte address. The instructions working with bytes just get/store the individual byte in that case.
No, memory access is according to the width of the data. But real memory access is in terms of cache lines (read up on CPU cache for this).
Non-aligned objects can never come into existence without you invoking undefined behavior. In other words, there is no sequence of actions, all having well-defined behavior, which a program can take that will result in a non-aligned pointer coming into existence. In particular, there is no portable way to get the compiler to give you misaligned objects. The closest thing is the "packed structure" many compilers have, but that only applies to structure members, not independent objects.
Further, there is no way to test alignedness in portable C. You can use the implementation-defined conversions of pointers to integers and inspect the low bits, but there is no fundamental requirement that "aligned" pointers have zeros in the low bits, or that the low bits after conversion to integer even correspond to the "least significant" bits of the pointer, whatever that would mean. In other words, conversions between pointers and integers are not required to commute with arithmetic operations.
If you really want to make some misaligned pointers, the easiest way to do it, assuming alignof(int)>1, is something like:
char buf[2*sizeof(int)+1];
int *p1 = (int *)buf, *p2 = (int *)(buf+sizeof(int)+1);
It's impossible for both buf and buf+sizeof(int)+1 to be simultaneously aligned for int if alignof(int) is greater than 1. Thus at least one of the two (int *) casts gets applied to a misaligned pointer, invoking undefined behavior, and the typical result is a misaligned pointer.
I have a question regarding the various arithmetic operations for Intel SSE intrinsics.
what is the difference between doing a _mm_add_ps Vs. _mm_add_epi8/16/32? I want to make sure that my data is aligned at all times.
In a sample code when I do this:
__m128 u1 = _mm_load_ps(&V[(i-1)]);
I get a segmentation fault. But when I do this:
__m128 u1 = _mm_loadu_ps(&V[(i-1)]);
It works fine.
Since I want my data aligned i declared the array like this:
posix_memalign((void**)&V, 16, dx*sizeof(float));
Can someone help explain this.
_mm_add_ps add floats together, where _mm_add_epi8/16/32 adds integers, which are not floating point numbers.
_mm_loadu_ps does not require your floats to be 16byte (128bit) aligned, whereas _mm_load_ps does require 16byte alignment.
So if you get a seg fault on the first one, your alignment is wrong.
On the posix_memalign page it says this:
The posix_memalign() function shall fail if:
[EINVAL] The value of the alignment parameter is not a power of two
multiple of sizeof( void *).
I'm not sure that sizeof(float) == sizeof(void*) ??
Per this, it seems to be the same in C (on a 32bit system). Ok, a little trickery here, because the size of a pointer is normally the size of the CPU register width, 32bit or 64bit (8 bytes) depending on the system used, whereas a float would normally be 32bit (4 bytes)
Your aligned allocation should look more like this:
posix_memalign((void**)&V, 16, dx*sizeof(void*)); //since it will the correct size for your platform. You can always cast to `float` later on.
If I have a int32 type integer in the 8-bit processor's memory, say, 8051, how could I identify the endianess of that integer? Is it compiler specific? I think this is important when sending multybyte data through serial lines etc.
With an 8 bit microcontroller that has no native support for wider integers, the endianness of integers stored in memory is indeed up to the compiler writer.
The SDCC compiler, which is widely used on 8051, stores integers in little-endian format (the user guide for that compiler claims that it is more efficient on that architecture, due to the presence of an instruction for incrementing a data pointer but not one for decrementing).
If the processor has any operations that act on multi-byte values, or has an multi-byte registers, it has the possibility to have an endian-ness.
http://69.41.174.64/forum/printable.phtml?id=14233&thread=14207 suggests that the 8051 mixes different endian-ness in different places.
The endianness is specific to the CPU architecture. Since a compiler needs to target a particular CPU, the compiler would have knowledge of the endianness as well. So if you need to send data over a serial connection, network, etc you may wish to use build-in functions to put data in network byte order - especially if your code needs to support multiple architectures.
For more information, see: http://www.gnu.org/s/libc/manual/html_node/Byte-Order.html
It's not just up to the compiler - '51 has some native 16-bit registers (DPTR, PC in standard, ADC_IN, DAC_OUT and such in variants) of given endianness which the compiler has to obey - but outside of that, the compiler is free to use any endianness it prefers or one you choose in project configuration...
An integer does not have endianness in it. You can't determine just from looking at the bytes whether it's big or little endian. You just have to know: For example if your 8 bit processor is little endian and you're receiving a message that you know to be big endian (because, for example, the field bus system defines big endian), you have to convert values of more than 8 bits. You'll need to either hard-code that or to have some definition on the system on which bytes to swap.
Note that swapping bytes is the easy thing. You may also have to swap bits in bit fields, since the order of bits in bit fields is compiler-specific. Again, you basically have to know this at build time.
unsigned long int x = 1;
unsigned char *px = (unsigned char *) &x;
*px == 0 ? "big endian" : "little endian"
If x is assigned the value 1 then the value 1 will be in the least significant byte.
If we then cast x to be a pointer to bytes, the pointer will point to the lowest memory location of x. If that memory location is 0 it is big endian, otherwise it is little endian.
#include <stdio.h>
union foo {
int as_int;
char as_bytes[sizeof(int)];
};
int main() {
union foo data;
int i;
for (i = 0; i < sizeof(int); ++i) {
data.as_bytes[i] = 1 + i;
}
printf ("%0x\n", data.as_int);
return 0;
}
Interpreting the output is up to you.