Most efficient way to xor byte array in vb.net - arrays

I am making making an application in vb.net that will read files using a byte array as buffer, and create parity files for them by xoring the data... What would be the most efficient way of xoring a byte array? I have though of convertinbg the byte array to a bitarray and then run it trough an xor operation and turn it back to a byte array, but that sounds like a very processing expensive task, and i am worried that it might impact read/write speed... is there a better way to do this? thanks...
To avoid confusion:
What the application does is read half the file to location 1, the other half to location 2, then a parity (xor of the two parts) to location 3...

To Xor two byte arrays simply use a for loop and the Xor operator.
VB.Net's Xor will compile to a the CIL Xor opcode which should be subsequently JIT compiled to the very fast x86 XOR processor instruction.
The cost of the Xor operation is likely to be negligible in comparison to the cost of file I/O.

Depending on what you mean by "efficient" I think you can do better than a simple loop. My initial though is to break the processing into multiple threads. so it will complete faster. You know the size of the input array and therefore the output array so it would be easy to divide load and let each thread fill the appropriate part of the result.
There is a C# article on code project that does similar. Binary operations on byte arrays withparallelism

Related

Efficient File Saving in C, writing certain bits to a binary file

I'm assured that I get numbers from 0-7, and I'm interested to make the code as efficient as possible.
I want to input only the three most least significant bits into the binary file, and not the whole byte.
Is there anyway I can write only 3 bits? I get a huge number of numbers...
The other way I found is to try to mash up the numbers (00000001 shl 3 & next number)
Though there's always a odd one out.
Files work at a byte level, there's no way to output single bits1. You have to read the original bytes containing the bits of your interest, fix them with the bits you have to modify (using bitwise operations) and write them back where they were.
1. And it would not be efficient to do so anyway. Hard disks work best with large chunks to write; flash disks actually require to work with large blocks (=> a single bit change requires a full block erase and rewrite); they are some reasons why operating systems and disk controllers do a lot of write caching.

How to efficiently compare two network endian u16 or u32

The straightforward way to compare two network endian u16 or u32 would be to convert both of them into host endian and then compare.
But I'm working on a performance critical program and we have lots of those cases. So I'm wondering would it help if we just write a macro to compare them byte by byte from the MSB? In other words, by adding extra one (for u16) or extra three (for u32) comparisons, we can avoid two ntoh calls.
Would it help? Or would it depend on the hardware or the compiler? Is there any better way to do that?
Thanks
PS:
I understand the extra complexity needed while the performance enhancement may be small compared to the whole program. I'm just interested in how the hardware is working and how to push it to the extreme :P
I will assume that you only need this code to run on one processor, which most likely will be little endian.
You need 4 compare functions, which you can write as macros. Two that compare the whole word (short or long) when network order matches processor order, and two that compare byte by byte for the other case. It's faster to compare directly than convert and then compare.
If you need individual compares for EQ, LT, GT etc and for signed/unsigned you may need a lot more combinations to get peak performance. I assume you know how to write the code so I won't try.
Naturally having done this you should benchmark the whole thing to make sure it was actually worth it! Unit tests are pretty important too, so not a trivial project.

C fastest way to compare two bitmaps

There are two arrays of bitmaps in the form of char arrays with millions of records. What could be fastest way to compare them using C.
I can imagine to use bitwise operator xor 1 byte at a time in a for loop.
Important point about bitmaps:
1% to 10% of times algorithm is run, bitmaps can differ. Most of the time they will be same. When hey can differ, they can as much as 100%. There is high probability of change of bits in continuous streak.
Both bitmaps are of same length.
Aim:
Check do they differ and if yes then where.
Be correct every time (probability of detecting error if there is one should be 1).
This answer assumes you mean 'bitmap' as a sequence of 0/1 values rather than 'bitmap image format'
If you simply have two bitmaps of the same length and wish to compare them quickly, memcmp() will be effective as someone suggested in the comments. You could if you want try using SSE type optimizations, but these are not as easy as memcmp(). memcmp() is assuming you simply want to know 'they are different' and nothing more.
If you want to know how many bits they are different by, e.g. 615 bits differ, then again you have little option except to XOR every byte and count the number of differences. As others have noted, you probably want to do this more at 32/64 or even 256 bits at a time, depending on your platform. However, if the arrays are millions of bytes long, then the biggest delay (with current CPUs) will be the time to transfer main memory to the CPU, and it wont matter terribly what the CPU does (lots of caveats here)
If you question is more asking about comparing A to B, but really you are doing this lots of times, such as A to B and C,D,E etc, then you can do a couple of things
A. Store a checksum of each array and first compare the checksums, if these are the same then there is a high chance the arrays are the same. Obviously there is a risk here that checksums can be equal but the data can differ, so make sure that a false result in this case will not have dramatic side effects. And, if you cannot withstand false results, do not use this technique.
B. if the arrays have structure, such as they are image data, then leverage specific tools for this, how is beyond this answer to explain.
C. If the image data can be compressed effectively, then compress each array and compare using the compressed form. If you use ZIP type of compression you cannot tell directly from zip how many bits differ, but other techniques such as RLE can be effective to quickly count bit differences (but are a lot of work to build and get correct and fast)
D. If the risk with (a) is acceptable, then you can checksum each chunk of say 262144 bits, and only count differences where checksums differ. This heavily reduces main memory access and will go lots faster.
All of the options A..D are about reducing main memory access as this is the nub of any performance gain (for problem as stated)

x86-64 integer vectorisation optimise

I am trying to vectorize a logical validation problem to run on Intel 64.
I will first try to describe the problem:
I have a static array v[] of 70-bit integers (appx 400,000 of them) which are all known at compile time.
A producer creates 70-bit integers a, a lot of them, very quickly.
For each a I need to find out if there exists an element from v for which v[i] & a == 0.
So far my implementation in C is something like this (simplified):
for (; *v; v++) {
if (!(a & *v))
return FOUND;
}
// a had no matching element in v
return NOT_FOUND;
I am looking into optimizing this using SSE/AVX to speed up the process and do more of those tests in parallel. I got as far as loading a and *v into an XMM register each and calling the PTEST instruction to do the validation.
I am wondering if there is a way to expand this to use all 256 bits of the new YMM registers?
Maybe packing 3x70 bits into a single register?
I can't quite figure out though how to pack/unpack them efficient enough to justify not just using one register per test.
A couple things that we know about the nature of the input:
All elements in v[] have very few bits set
It is not possible to permute/compress v[] in any way to make it use less then 70 bits
The FOUND condition is expected to be satisfied after checking appx 20% on v[] on average.
It is possible to buffer more then one a before checking them in a batch.
I do not necessarily need to know which element of v[] matched, only that one did or not.
Producing a requires very little memory, so anything left in L1 from the previous call is likely to still be there.
The resulting code is intended to be ran on the newest generation of Intel Xeon processors supporting SSE4.2, AVX instructions.
I will be happy to accept assembly or C that compiles with Intel C compiler or at least GCC.
This sounds like you what you really need is a better data structure to store the v[], so that searches take less than linear time.
Consider that if (v[0] & v[1]) & a is not zero, then neither (v[0] & a) nor (v[1] & a) can be zero. This means it is possible to create a tree structure where the v[] are the leaves, and the parent nodes are the AND combination of their children. Then, if parentNode & a gives you a non-zero value, you can skip looking at the children.
However, this isn't necessarily helpful - the parent node only ends up testing the bits common between the children, so if there are only a few of those, you still end up testing lots of leave nodes. But if you can find clusters in your data set and group many similar v[] under a common parent, this may drastically reduce the number of comparisons you have to do.
On the other hand, such a tree search involves a lot of conditional branches (expensive), and would be hard to vectorize. I'd first try if you can get away with just two levels: first do a vectorized search among the cluster parent nodes, then for each match do a search for the entries in that cluster.
Actually here's another idea, to help with the fact that 70 bits don't fit well into registers:
You could split v[] into 64 (=2^6) different arrays. Of the 70 bits in the original v[], the 6 most significant bits are used to determine which array will contain the value, and only the remaining 64 bits are actually stored in the array.
By testing the mask a against the array indices, you will know which of the 64 arrays to search (in the worst case, if a doesn't have any of the 6 highest bits set, that'll be all of them), and each individual array search deals only with 64 bits per element (much easier to pack).
In fact this second approach could be generalized into a tree structure as well, which would give you some sort of trie.

Is it faster to use an array or bit access for multiple boolean values?

1) On a 32-bit CPU is it faster to acccess an array of 32 boolean values or to access the 32 bits within one word? (Assume we want to check the value of the Nth element and can use either a bit-mask (Nth bit is set) or the integer N as an array index.)
It seems to me that the array would be faster because all common computer architectures natively work at the word level (32 bits, 64 bits, etc., processed in parallel) and accessing the sub-word bits takes extra work.
I know different compilers will represent things differently, but it seems that the underlying hardware architecture would dictate the answer. Or does the answer depend on the language and compiler?
And,
2) Is the speed answer reversed if this array represents a state that I pass between client and server?
This question came to mind when reading question "How use bit/bit-operator to control object state?"
P.S. Yes, I could write code to test this myself, but then the SO community wouldn't get to play along!
Bear in mind that a theoretically faster solution that doesn't fit into a cache line might be slower than a theoretically slower one that does, depending on a whole host of things. If this is actually something that needs to be fast, as determined by profiling, test both ways and see. If it doesn't, do whatever looks like cleaner code, which is probably the array.
It depends on the compiler and the access patterns and the platform. Raymond Chen has an excellent cost-benefit analysis: http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx .
Even on non x86 platforms the use of bits can be prohibitive as at least one PPC platform out there uses microcoded instructions to perform a variable shift which can do nasty things with other hardware threads.
So it can be a win, but you need to understand the context in which it will be good and bad. (Which is a general thing anyway.)
For question #1: Yes, on most 32-bit platforms, an array of boolean values should be faster, because you will just be loading each 32-bit-aligned value in the array and testing it against 0. If you use a single word, you will have all that work plus the overhead of bit-fiddling.
For question #2: Again, yes, since sending data over a network is significantly slower than operating on data in the CPU and main memory, the overhead of sending even one word will strongly outweigh any performance gain or loss you get by aligning words or bit fiddling.
This is the code generated by 0 != (value & (1 << index)) to test a bit:
00401000 mov eax,1
00401005 shl eax,cl
00401007 and eax,1
And this by values[index] to test a bool[]:
00401000 movzx eax,byte ptr [ecx+eax]
Can't figure out how to put a loop around it that doesn't get optimized away, I'll vote bool[].
If you are going to check more than one value at a time, doing it in parallel will obviously be faster. If you're only checking one value, it's probably the same.
If you need a better answer than that, write some tests and get back to us.
I think a byte array is probably better than a full-word array for simple random access.
It will give better cache locality than using the full word size, and I don't think byte access is any slower on most/all common architectures.

Resources