Code:
unsigned char array_add[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
...
if ((*((uint32_t*)array_add)!=0)||(*((uint32_t*)array_add+1)!=0))
{
...
}
I want to check if the array is all zero. So naturally I thought of casting the address of an array, which also happens to be the address of the first member, to an unsigned int 32 type, so I'll only need to do this twice, since it's a 64 bit, 8 byte array. Problem is, it was successfully compiled but the program crashes every time around here.
I'm running my program on an 8bit microcontroller, cortex-M0.
How wrong am I?
In theory this could work but in practice there is a thing you aren't considering: aligned memory accesses.
If a uint32_t requires aligned memory access (eg to 4 bytes), then casting an array of unsigned char which has 1 byte alignment requirement to an uint32_t* produces a pointer to an unaligned array of uint32_t.
According to documentation:
There is no support for unaligned accesses on the Cortex-M0 processor. Any attempt to perform an unaligned memory access operation results in a HardFault exception.
In practice this is just dangerous and fragile code which invokes undefined behavior in certain circumstances, as pointed out by Olaf and better explained here.
To test multiple bytes as once code could use memcmp().
How speedy this is depends more on the compiler as a optimizing compiler may simple emit code that does a quick 8 byte at once (or 2 4-byte) compare. Even the memcmp() might not be too slow on an 8-bit processor. Profiling code helps.
Take care in micro-optimizations, as they too often are not efficient use of coders` time for significant optimizations.
unsigned char array_add[8] = ...
const unsigned char array_zero[8]={0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};
if (memcmp(array_zero, array_add, 8) == 0) ...
Another method uses a union. Be careful not to assume if add.arr8[0] is the most or least significant byte.
union {
uint8_t array8[8];
uint64_t array64;
} add;
// below code will check all 8 of the add.array8[] is they are zero.
if (add.array64 == 0)
In general, focus on writing clear code and reserve such small optimizations to very select cases.
I am not sure but if your array has 8 bytes then just assign base address to a long long variable and compare it to 0. That should solve your problem of checking if the array is all 0.
Edit 1: After Olaf's comment I would say that replace long long with int64_t. However, why do you not a simple loop for iterating the array and checking. 8 chars is all you need to compare.
Edit 2: The other approach could be to OR all elements of array and then compare with 0. If all are 0 then OR will be zero. I do not know whether CMP will be fast or OR. Please refer to Cortex-M0 docs for exact CPU cycles requirement, however, I would expect CMP to be slower.
Related
Sorry, I am not sure if I wrote the title accurately.
But first, here are my constraints:
Array[], used as a register map, is declared as an unsigned 8-bit array (uint8_t),
this is so that indexing(offset) is per byte.
Data to be read/written into the array has varying width (8-bit, 16-bit, 32-bit and 64-bit).
Very Limited Memory and Speed is a must.
What are the caveats in doing the following
uint8_t some_function(uint16_t offset_addr) //16bit address
{
uint8_t Array[0x100];
uint8_t data_byte = 0xAA;
uint16_t data_word;
uint32_t data_double = 0xBEEFFACE;
\\ A. Storing wider-data into the array
*((uint32_t *) &Array[offset_addr]) = data_double;
\\ B. Reading multiple-bytes from the array
data_word = *((uint16_t *) &Array[offset_addr]);
return 0;
}
I know i could try writing the data per byte, but that would be slow due to bit shifting.
Is there going to be a significant problem with this usage?
I have run this on my hardware and have not seen any problems so far, but I want to take note of potential problems this implementation might cause.
Is there going to be a significant problem with this usage?
It produces undefined behavior. Therefore, even if in practice that manifests as you intend on your current C implementation, hardware, program, and data, you might find that it breaks unexpectedly when something (anything) changes.
Even if the compiler implements the cast and dereference in the obvious way (which it is not obligated to do, because UB) misaligned accesses resulting from your approach will at least slow many CPUs, and will produce traps on some.
The standard-conforming way to do what you want is this:
uint8_t some_function(uint16_t offset_addr) {
uint8_t Array[0x100];
uint8_t data_byte = 0xAA;
uint16_t data_word;
uint32_t data_double = 0xBEEFFACE;
\\ A. Storing wider-data into the array
memcpy(Array + offset_addr, &data_double, sizeof data_double);
\\ B. Reading multiple-bytes from the array
memcpy(&data_word, Array + offset_addr, sizeof data_word);
return 0;
}
This is not necessarily any slower than your version, and it has defined behavior as long as you do not overrun the bounds of your array.
This is probably fine. Many have done things like this. C performs well with this kind of thing.
Two things to watch out for:
Buffer overruns. You know those zero-days like Eternal Blue and hacks like WannaCry? Many of them exploited bugs in code like yours. Malicious input caused the code to write too much stuff into data structures like your uint8_t Array[0x100]. Be careful. Avoid allocating buffers on the stack (as function-local variables) as you have done because clobbering the stack is exploitable. Make them big enough. Check that you don't overrun them.
Machine byte ordering vs. network byte ordering, aka endianness. If these data structures move from machine to machine over the net you may get into trouble.
Suppose I have a char array and an associated length: Arr and Len. Not a string, a char array. There is no null terminator. Yet I have to copy the array data into an integer of type int64_t. Here's how it's done, and for the purpose of this question I'm assuming Len will not exceed 8:
int64_t Word = 0;
memcpy(&Word, Arr, Len);
Is this actually the proper way to do this? I am copying memory, but is there a faster way to do it inline, for example? So Word can be register?
The problem with a type pun is it assumes that Arr has 8 bytes allocated. No, Arr has at most 8 bytes allocated. It could have 5, so casting Arr to a int64_t * then dereferencing it could try to access three illegal bytes at the end, resulting in segfault.
Is the proper way to do what I describe a memcpy() call, or is there a faster or better way?
Since you specify Len is at most (8), it's reasonable to assume little-endian storage, i.e., the least-significant byte at Arr[0].
If Len was fixed at (8), the compiler might be able to replace memcpy simply by loading the value from memory. That would also be dependent on whether the platform can do unaligned reads - if the compiler can't prove alignment - and may involve something like the bswap instruction on x86-64 if the architecture is big-endian.
The fact that a Len is a run-time value will likely generate a call to memcpy. The overhead of the call itself is not trivial. All things considered, it's probably best just to handle this in an endian-independent way using byte arithmetic. The code assumes 8-bit bytes, which seems consistent with your question.
uint64_t Word = 0;
while (Len--)
Word = (Word << 8) | Arr[Len];
On more exotic platforms, where (CHAR_BIT > 8), you can replace the right-hand side of the OR expression with (Arr[Len] & 0xff). In fact, this is optimised away on platforms with 8-bit (normative) bytes, so you might as well add it for completeness. Or just keep these issues in mind.
There are platforms with legal C implementations where char, short, int are 32-bit values, for example. These are quite common in the embedded world.
Does the return type matter when returning from a function?
This is kind of a 2-part question.
I believe an 8-bit operation would be the same as a 32-bit operation.
I believe an 8-bit value is operated on in a 32-bit register, so it will be promoted to a 32-bit value. Then it would be casted back down to an 8-bit value.
unsigned char SomeFunc() <- Quickest and less memory.
unsigned short SomeFunc()
unsigned long SomeFunc()
"All operations should be performed on the smallest variable wherever possible, this saves both time and space" True or False?
On a 32 bit operating system, I don't believe it would matter, since the return register is 32 bits anyways, whether it be a variable or an address.
So it would neither save time, nor space.
I do understand that there might be a need to return a char/byte, if that's all your dealing with, but you could still return a long and cast it.
I think your still casting either way whether before or after you leave the function. I almost think it is easier and faster to deal with 32 bit values than 16 or 8 bit values.
Second part.
In the following function, I don't believe it would make it any quicker or save any more space if I were to return a unsigned short instead.
unsigned long SomeFunc(unsigned char a, unsigned char b);
{
unsigned long c = a + b;
return c;
}
or
unsigned long SomeFunc(unsigned char a);
{
//This will be promoted to a 32-bit value anyways.
return a & 0x1;
}
The following function would somehow be quicker and take up less memory?
unsigned char SomeFunc(unsigned char a);
{
//This will be promoted to a 32-bit value anyways.
return a & 0x1;
}
tl;dr: Trust your optimizer. Don't fight the type system.
Yes, you should use the smallest type.
Depending on your compiler they may compile differently and your compiler might have good reason to do that. If you lie to your compiler about the types you're messing with its ability to optimize your code.
And because you don't know how the return value will be used.
Not all memory is a single variable. Consider arrays and structs. They are allocated in large blocks beyond the 32 or 64 bit native sizes. If you return larger type than necessary you're forcing that an array or struct storing it to use more memory. For example, if you return an int where you should be returning a char and I have to store those values in array that array will be 4 to 8 times larger.
I do understand that there might be a need to return a char/byte, if that's all your dealing with, but you could still return a long and cast it. I think your still casting either way whether before or after you leave the function. I almost think it is easier and faster to deal with 32 bit values than 16 or 8 bit values.
The return value tells people reading the code what type to use to store the return value. If you habitually use larger types than necessary and just need to know which ones you can cast to smaller types, that's hidden information only you know. And if it's in your head you're going to forget.
Gratuitous casting defeats the safety of the type system. Type checks let you know if you're putting data of the wrong type or size into the wrong place. Casting tells the compiler "I know this looks wrong, but trust me I know what I'm doing". This should be done only when necessary. If you're gratuitously casting you lose this help from the compiler. If you make a mistake the compiler cannot help you.
Finally, it will befuddle anyone reading your code. They'll scratch their heads and wonder why you're always shoving longs into shorts and ints into chars. They'll never know which are ok and which are mistakes.
As to unsigned long vs unsigned char, compiling your two functions with clang -O3 -S and diffing the assembly reveals a slight difference:
- movq %rdi, %rax
+ movl %edi, %eax
The unsigned char implementation will use a 32-bit register while unsigned long will use a 64-bit register. Does this matter? Dunno, probably not. Definitely not enough to defeat the type system.
I've heard reads and writes of aligned int's are atomic and safe, I wonder when does the system make non malloc'd globals unaligned other than packed structures and casting/pointer arithmetic byte buffers?
[X86-64 linux] In all of my normal cases, the system always chooses integer locations that don't get word torn, for example, two byte on one word and the other two bytes on the other word. Can any one post a program/snip (C or assembly) that forces the global variable to unaligned address such that the integer gets torn and the system has to use two reads to load one integer value ?
When I print the below program, the addresses are close to each other such that multiple variables are within 64bits but never once word tearing is seen (smartness in the system or compiler ?)
#include <stdio.h>
int a;
char b;
char c;
int d;
int e = 0;
int isaligned(void *p, int N)
{
if (((int)p % N) == 0)
return 1;
else
return 0;
}
int main()
{
printf("processor is %d byte mode \n", sizeof(int *));
printf ( "a=%p/b=%p/c=%p/d=%p/f=%p\n", &a, &b, &c, &d, &e );
printf ( " check for 64bit alignment of test result of 0x80 = %d \n", isaligned( 0x80, 64 ));
printf ( " check for 64bit alignment of a result = %d \n", isaligned( &a, 64 ));
printf ( " check for 64bit alignment of d result = %d \n", isaligned( &e, 64 ));
return 0;}
Output:
processor is 8 byte mode
a=0x601038/b=0x60103c/c=0x60103d/d=0x601034/f=0x601030
check for 64bit alignment of test result of 0x80 = 1
check for 64bit alignment of a result = 0
check for 64bit alignment of d result = 0
How does a read of a char happen in the above case ? Does it read from 8 byte aligned boundary (in my case 0x601030 ) and then go to 0x60103c ?
Memory access granularity is always word size isn't it ?
Thx.
1) Yes, there is no guarantee that unaligned accesses are atomic, because [at least sometimes, on certain types of processors] the data may be written as two separate writes - for example if you cross over a memory page boundary [I'm not talking about 4KB pages for virtual memory, I'm talking about DDR2/3/4 pages, which is some fraction of the total memory size, typically 16Kbits times whatever the width is of the actual memory chip - which will vary depending on the memory stick itself]. Equally, on other processors than x86, you get a trap for reading unaligned memory, which would either cause the program to abort, or the read be emulated in software as multiple reads to "fix" the unaligned read.
2) You could always make an unaligned memory region by something like this:
char *ptr = malloc(sizeof(long long) * number+1);
long long *unaligned = (long long *)&ptr[2];
for(i = 0; i < number; i++)
temp = unaligned[i];
By the way, your alignment check checks if the address is aligned to 64 bytes, not 64 bits. You'll have to divide by 8 to check that it's aligned to 64 bits.
3) A char is a single byte read, and the address will be on the actual address of the byte itself. The actual memory read performed is probably for a full cache-line, starting at the target address, and then cycling around, so for example:
0x60103d is the target address, so the processor will read a cache line of 32 bytes, starting at the 64-bit word we want: 0x601038 (and as soon as that's completed the processor goes on to the next instruction - meanwhile the next read will be performed to fill the cacheline), then cacheline is filled with 0x601020, 0x601028, 0x601030. But should we turn the cache off [if you want your 3GHz latest x86 processor to be slightly slower than a 66MHz 486, disabling the cache is a good way to achieve that], the processor would just read one byte at 0x60103d.
4) Not on x86 processors, they have byte addressing - but for normal memory, reads are done on a cacheline basis, as explained above.
Note also that "may not be atomic" is not at all the same as "will not be atomic" - so you'll probably have a hard time making it go wrong by will - you really need to get all the timings of two different threads just right, and straddle cachelines, straddle memory page boundaries, and so on to make it go wrong - this will happen if you don't want it to happen, but trying to make it go wrong can be darn hard [trust me, I've been there, done that].
It probably doesn't, outside of those cases.
In assembly it's trivial. Something like:
.org 0x2
myglobal:
.word SOME_NUMBER
But on Intel, the processor can safely read unaligned memory. It might not be atomic, but that might not be apparent from the generated code.
Intel, right? The Intel ISA has single-byte read/write opcodes. Disassemble your program and see what it's using.
Not necessarily - you might have a mismatch between memory word size and processor word size.
1) This answer is platform-specific. In general, though, the compiler will align variables unless you force it to do otherwise.
2) The following will require two reads to load one variable when run on a 32-bit CPU:
uint64_t huge_variable;
The variable is larger than a register, so it will require multiple operations to access. You can also do something similar by using packed structures:
struct unaligned __attribute__ ((packed))
{
char buffer[2];
int unaligned;
char buffer2[2];
} sample_struct;
3) This answer is platform-specific. Some platforms may behave like you describe. Some platforms have instructions capable of fetching a half-register or quarter-register of data. I recommend examining the assembly emitted by your compiler for more details (make sure you turn off all compiler optimizations first).
4) The C language allows you to access memory with byte-sized granularity. How this is implemented under the hood and how much data your CPU fetches to read a single byte is platform-specific. For many CPUs, this is the same as the size of a general-purpose register.
The C standards guarantee that malloc(3) returns a memory area that complies to the strictest alignment requirements, so this just can't happen in that case. If there are unaligned data, it is probably read/written by pieces (that depends on the exact guarantees the architecture provides).
On some architectures unaligned access is allowed, on others it is a fatal error. When allowed, it is normally much slower than aligned access; when not allowed the compiler must take the pieces and splice them together, and that is even much slower.
Characters (really bytes) are normally allowed to have any byte address. The instructions working with bytes just get/store the individual byte in that case.
No, memory access is according to the width of the data. But real memory access is in terms of cache lines (read up on CPU cache for this).
Non-aligned objects can never come into existence without you invoking undefined behavior. In other words, there is no sequence of actions, all having well-defined behavior, which a program can take that will result in a non-aligned pointer coming into existence. In particular, there is no portable way to get the compiler to give you misaligned objects. The closest thing is the "packed structure" many compilers have, but that only applies to structure members, not independent objects.
Further, there is no way to test alignedness in portable C. You can use the implementation-defined conversions of pointers to integers and inspect the low bits, but there is no fundamental requirement that "aligned" pointers have zeros in the low bits, or that the low bits after conversion to integer even correspond to the "least significant" bits of the pointer, whatever that would mean. In other words, conversions between pointers and integers are not required to commute with arithmetic operations.
If you really want to make some misaligned pointers, the easiest way to do it, assuming alignof(int)>1, is something like:
char buf[2*sizeof(int)+1];
int *p1 = (int *)buf, *p2 = (int *)(buf+sizeof(int)+1);
It's impossible for both buf and buf+sizeof(int)+1 to be simultaneously aligned for int if alignof(int) is greater than 1. Thus at least one of the two (int *) casts gets applied to a misaligned pointer, invoking undefined behavior, and the typical result is a misaligned pointer.
I need to alloc an array of uint64_t[1e9] to count something, and I know the items are between (0,2^39).
So I want to calloc 5*1e9 bytes for the array.
Then I found that, if I want to make the uint64_t meanful, it is difficult to by pass the byte order.
There should be 2 ways.
One is to check the endianness first, so that we can memcpy the 5 bytes to either first or last of the whole 8 bytes.
The other is to use 5 bitshift and then bit-or them together.
I think the former should be faster.
So, under GCC or libc or GNU system, is there any header file to indicate whether the current system is Little Endian or Big Endian ? I know x86_64 is Little Endian, but I don't like to write a unportable code.
Of course any other idears are welcomed.
Add:
I need use the array to count many strings use D-left hashing. I plan to use 21bit for key and 18bit for counting.
When you say "faster"... how often is this code executed? 5 times <<8 plus an | probably costs less than 100ns. So if that code is executed 10'000 times, that adds up to 1 (one) second.
If the code is executed less times and you need more than 1 second to implement an endian-clean solution, you're wasting everyone's time.
That said, the solution to figure out the endianess is simple:
int a = 1;
char * ptr = (char*)&a;
bool littleEndian = *ptr == 1;
Now all you need it a big endian machine and a couple of test cases to make sure your memcpy solution works. Note that you need to need to call memcpy five times in one of the two cases to reorder the bytes.
Or you could simply shift and or five times...
EDIT I guess I misunderstood your question a bit. You're saying that you want to use the lowest 5 bytes (=40 bits) of the uint64_t as a counter, yes?
So the operation will be executed many, many times. Again, memcpy is utterly useless. Let's take the number 0x12345678 (32bit). In memory, that looks like so:
0x12 0x34 0x56 0x78 big endian
0x78 0x56 0x34 0x12 little endian
As you can see, the bytes are swapped. So to convert between the two, you must either use bit-shifting or byte swapping. memcpy doesn't work.
But that doesn't actually matter since the CPU will do the decoding for you. All you have to do is to shift the bits in the right place.
key = item & 0x1FFFFF
count = (item >>> 21)
to read and
item = count << 21 | key
to write. Now you just need to build the key from the five bytes and you're done:
key = (((hash[0] << 8) | (hash[1]<<8)) | ....
EDIT 2
It seems you have an array of 40bit ints and you want to read/write that array.
I have two solutions: Using memcpy should work as long as the data isn't copied between CPUs of different endianess (read: when you save/load the data to/from disk). But the function call might be too slow for such a huge array.
The other solution is to use two arrays:
int lower[];
unit8_t upper[]
that is: Save the bits 33-40 in a second array. To read/write the values, one shift+or is necessary.
If you treat numbers as numbers, and not an array of bytes, your code will be endianess-agnostic. Hence, I would go for the shift and or solution.
Having said that, I really didn't catch what you are trying to do? Do you really need one billion entries, each five bytes long? If the data you are sampling is sparse, you might get away with allocating far less memory.
Well, I just find the kernel headers come with <asm/byteorder.h>.
inline memcpy to a while(i<x+3){++*i=++*j} may still slower since cache operation is slower than registers.
another way for memcpy is:
union dat {
uint64_t a;
char b[8];
} d;