How to efficiently compare two network endian u16 or u32 - c

The straightforward way to compare two network endian u16 or u32 would be to convert both of them into host endian and then compare.
But I'm working on a performance critical program and we have lots of those cases. So I'm wondering would it help if we just write a macro to compare them byte by byte from the MSB? In other words, by adding extra one (for u16) or extra three (for u32) comparisons, we can avoid two ntoh calls.
Would it help? Or would it depend on the hardware or the compiler? Is there any better way to do that?
Thanks
PS:
I understand the extra complexity needed while the performance enhancement may be small compared to the whole program. I'm just interested in how the hardware is working and how to push it to the extreme :P

I will assume that you only need this code to run on one processor, which most likely will be little endian.
You need 4 compare functions, which you can write as macros. Two that compare the whole word (short or long) when network order matches processor order, and two that compare byte by byte for the other case. It's faster to compare directly than convert and then compare.
If you need individual compares for EQ, LT, GT etc and for signed/unsigned you may need a lot more combinations to get peak performance. I assume you know how to write the code so I won't try.
Naturally having done this you should benchmark the whole thing to make sure it was actually worth it! Unit tests are pretty important too, so not a trivial project.

Related

C: Most efficient way to store variables where every bit matters

To start off: this might be a duplicate, but i can't seem to find a definitive answer on this question after having searched for it on google.
For a project i am designing a script that makes 2 ATMega328p chips communicate. At this moment i'm testing the best speed to do this, but my goal is to achieve really high baudrates. I have plenty of experience with making code efficient, but not with the memory management part. The problem:
I want to store a multiple of 8 bits (ex.: 48 bits). My first thought was to use an array of length 6 and type uint8_t, but I don't know how efficient arrays are compared to other types. Some people say pointers are more efficient and others say it doesn't matter, but I cant find a definitive answer on what the case is for really small amounts of memory. last quesion: I know the size of the sent bits will never be bigger than 64 bits, so would it matter if i just Always jused uint64_t?
Edit:
to clarify: My goal is to minimize the storage size, not the transmission size
Edit2:
What i meant by having a varying size: The size is determined on compile time, not while running the program.
The ATmega328p is a 8bit processor. All of its instructions are 8bit. Nothing will be faster than simply having an uint8_t array.
What you can do is, when you compile, look at your .lss file, it will show you the assmebly code, then you can look up the AVR instruction set and see the clock cycles each one will take. I think you will find using a uint64_t will just add unncessary overhead unless you are very careful with the way you are putting the bytes into it.
If the length of your packages might vary, the most efficient approach would be to compress the package before communication.
For example the first 3 bits of each package, could determine the size of that package.
The compressed packages are communicated faster, and use up less memory space.

Fastest use of a dataset of just over 64 bytes?

Structure: I have 8 64-bit integers (512 bits = 64 bytes, the assumed cache line width) that I would like to compare to another, single 64-bit integer, in turn, without cache misses. The data set is, unfortunately, absolutely inflexible -- it's already as small as possible.
Access pattern: Each uint64_t is in fact an array of 4x4x4 bits, each bit representing the presence or absence of a voxel. This means sometimes I will be using half of one chunk and half of another, or even corners of 8 different 64-bit chunks.... I guess what this means is there is a high likelihood of a lack of alignment.
How can I do this as fast as possible i.e. without thrashing the cache?
P.S. The idea is that this code will ultimately run on a fairly wide range of architectures of at least a 64B cache line width, so I'd prefer this were absolutely as fast as possible. This also means I can't rely on MOVNTDQA, which anyway may incur a performance hit of it's own inspite of loading the 9th element directly to the CPU.
P.P.S. My knowledge of this area is fairly limited so please take it easy on me. But please spare me the premature optimisation comments; be sure that this is the 3% of this application that really counts.
I wouldn't worry about it. If your dataset is really only 9 integers, most of it will likely be stored in registers anyway. Also, there isn't really any way to optimize cache usage without specifying an architecture, since cache structure is architecture dependent. If you can list several target architectures you may be able to find some commonalities that you can optimize toward, but without knowing those architectures, I don't think there's much we can do for you.
Lastly, this seems like a good example of optimizing too early. I would suggest you take the following steps:
Decide what your maximum acceptable run time is
Finish your program in C
Compile for all of your target architectures
For those platforms that don't meet your speed spec, hand-optimize the intermediate assembly files and recompile until you meet your spec.
Are you sure you get cache-misses?
Even if the comparing value is not in an register, i think your first uint64 array should be on one cache stage (or what ever it is called) and your other data in another.
Your cache surely has some n-way associativity, that prevents your data row from being removed from the cache just by accessing your compare value.
Do not lose your time on Micro Optimizations. Improve your algorithms and data structures.

Use int or char in arrays?

Suppose I have an array in C with 5 elements which hold integers in the range [0..255], would it be generally better to useunsigned char, unsigned int or int regarding performance? Because a char would be only one byte, but an int is easier to handle for the processor, as far as I know. Or does it mostly depend on how the elements are accessed?
EDIT: Measuring is quite difficult, because the code belongs to a library, and the array is accessed external.
Also, I encounter this problem not only in this very case, so I'm asking for a more general answer
While the answer really depends on the CPU and how it handles loading storing small integers, you can assume that the byte array will be faster on most modern systems:
A char only takes 1/4 of the space that an int takes (on most systems), which means that working on a char array takes only a quarter of the memory bandwidth. And most codes are memory bound on modern hardware.
This one is quite impossible to answer, because it depends on the code, compiler, and processor.
However, one suggestion is to use uint8_t insted of unsigned char. (The code will be the same, but the explicit version conveys the meaning much better.)
Then there is one more thing to consider. You might be best off by packing four 8-bit integers into one 32-bit integer. Most arithmetic and bitwise logical operations work fine as long as there are no overflows (division being the notable exception).
The golden rule: Measure it. If you can't be bothered measuring it, then it isn't worth optimising. So measure it one way, change it, measure it the other way, take whatever is faster. Be aware that when you switch to a different compiler, or a different processor, like a processor that is introduced in 2015 or 2016, the other code might now be faster.
Alternatively, don't measure it, but write the most readable and maintainable code.
And consider using a single 64 bit integer and shift operations :-)

C fastest way to compare two bitmaps

There are two arrays of bitmaps in the form of char arrays with millions of records. What could be fastest way to compare them using C.
I can imagine to use bitwise operator xor 1 byte at a time in a for loop.
Important point about bitmaps:
1% to 10% of times algorithm is run, bitmaps can differ. Most of the time they will be same. When hey can differ, they can as much as 100%. There is high probability of change of bits in continuous streak.
Both bitmaps are of same length.
Aim:
Check do they differ and if yes then where.
Be correct every time (probability of detecting error if there is one should be 1).
This answer assumes you mean 'bitmap' as a sequence of 0/1 values rather than 'bitmap image format'
If you simply have two bitmaps of the same length and wish to compare them quickly, memcmp() will be effective as someone suggested in the comments. You could if you want try using SSE type optimizations, but these are not as easy as memcmp(). memcmp() is assuming you simply want to know 'they are different' and nothing more.
If you want to know how many bits they are different by, e.g. 615 bits differ, then again you have little option except to XOR every byte and count the number of differences. As others have noted, you probably want to do this more at 32/64 or even 256 bits at a time, depending on your platform. However, if the arrays are millions of bytes long, then the biggest delay (with current CPUs) will be the time to transfer main memory to the CPU, and it wont matter terribly what the CPU does (lots of caveats here)
If you question is more asking about comparing A to B, but really you are doing this lots of times, such as A to B and C,D,E etc, then you can do a couple of things
A. Store a checksum of each array and first compare the checksums, if these are the same then there is a high chance the arrays are the same. Obviously there is a risk here that checksums can be equal but the data can differ, so make sure that a false result in this case will not have dramatic side effects. And, if you cannot withstand false results, do not use this technique.
B. if the arrays have structure, such as they are image data, then leverage specific tools for this, how is beyond this answer to explain.
C. If the image data can be compressed effectively, then compress each array and compare using the compressed form. If you use ZIP type of compression you cannot tell directly from zip how many bits differ, but other techniques such as RLE can be effective to quickly count bit differences (but are a lot of work to build and get correct and fast)
D. If the risk with (a) is acceptable, then you can checksum each chunk of say 262144 bits, and only count differences where checksums differ. This heavily reduces main memory access and will go lots faster.
All of the options A..D are about reducing main memory access as this is the nub of any performance gain (for problem as stated)

Is it faster to use an array or bit access for multiple boolean values?

1) On a 32-bit CPU is it faster to acccess an array of 32 boolean values or to access the 32 bits within one word? (Assume we want to check the value of the Nth element and can use either a bit-mask (Nth bit is set) or the integer N as an array index.)
It seems to me that the array would be faster because all common computer architectures natively work at the word level (32 bits, 64 bits, etc., processed in parallel) and accessing the sub-word bits takes extra work.
I know different compilers will represent things differently, but it seems that the underlying hardware architecture would dictate the answer. Or does the answer depend on the language and compiler?
And,
2) Is the speed answer reversed if this array represents a state that I pass between client and server?
This question came to mind when reading question "How use bit/bit-operator to control object state?"
P.S. Yes, I could write code to test this myself, but then the SO community wouldn't get to play along!
Bear in mind that a theoretically faster solution that doesn't fit into a cache line might be slower than a theoretically slower one that does, depending on a whole host of things. If this is actually something that needs to be fast, as determined by profiling, test both ways and see. If it doesn't, do whatever looks like cleaner code, which is probably the array.
It depends on the compiler and the access patterns and the platform. Raymond Chen has an excellent cost-benefit analysis: http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx .
Even on non x86 platforms the use of bits can be prohibitive as at least one PPC platform out there uses microcoded instructions to perform a variable shift which can do nasty things with other hardware threads.
So it can be a win, but you need to understand the context in which it will be good and bad. (Which is a general thing anyway.)
For question #1: Yes, on most 32-bit platforms, an array of boolean values should be faster, because you will just be loading each 32-bit-aligned value in the array and testing it against 0. If you use a single word, you will have all that work plus the overhead of bit-fiddling.
For question #2: Again, yes, since sending data over a network is significantly slower than operating on data in the CPU and main memory, the overhead of sending even one word will strongly outweigh any performance gain or loss you get by aligning words or bit fiddling.
This is the code generated by 0 != (value & (1 << index)) to test a bit:
00401000 mov eax,1
00401005 shl eax,cl
00401007 and eax,1
And this by values[index] to test a bool[]:
00401000 movzx eax,byte ptr [ecx+eax]
Can't figure out how to put a loop around it that doesn't get optimized away, I'll vote bool[].
If you are going to check more than one value at a time, doing it in parallel will obviously be faster. If you're only checking one value, it's probably the same.
If you need a better answer than that, write some tests and get back to us.
I think a byte array is probably better than a full-word array for simple random access.
It will give better cache locality than using the full word size, and I don't think byte access is any slower on most/all common architectures.

Resources