space optimize a large array with many duplicates - arrays

I have an array where the index doubles as 'identifier for a collection of items' and the content of the array is a group-number. The group numbers fall into a finite range from 0..N, where N << length_of_the_array. Hence every is entry will be duplicated large number of times. Currently I have to use 2 bytes to represent group number (which can be > 1000 but < 6500 ), which due to the duplicated nature ends up consuming a lot of memory.
Are there ways to space optimize this array as the complete array can get into multiple MBs in size. Appreciate any pointers toward relevant optimization algorithm/technique. FYI: The programming language im using is cpp.

Do you still want efficient random-access to arbitrary elements? Or are you thinking about space-efficient serialization of the index->group map?
If you still want efficient random access, a single array lookup is not bad. It's at worst a single cache miss. Well really, at worst a page fault, or more likely a TLB miss, but that's unlikely if it's only a couple MB).
A sorted and run-length encoded list could be binary-searched (by searching an array of prefix-sums of the repeat-counts), but that only works if you can occasionally sort the list to keep duplicates together.
If the duplicates can't be at least somewhat grouped together, there's not much you can do that allows random access.
Packed 12-bit entries are probably not worth the trouble, unless that was enough to significantly reduce cache misses. A couple multiply instructions to generate the right address, and a shift and mask instruction on the 16b load containing the desired value, is not much overhead compared to a cache miss. Write access to packed bitfields is slower, and isn't atomic, so that's a serious downside. Getting a compiler to pack bitfields using structs can be compiler-specific. Maybe just using a char array would be best.

Related

Optimizing Cython: 5-byte string or size_t?

I'm writing a Cython module that deals with large numbers of 5-byte keys retrieved from a key-value store. I settled on 5 because it gives me enough room for combinations (~5 trillion) for very little space.
Result sets are retrieved and combined in contiguous memory blocks that I can scan by 5-byte strides. It's pretty fast so far, but the code could use some cleanup.
I am wondering if I can simplify my code by using size_t or long ints instead of 5-byte units, and deal with arrays of numbers instead of byte blocks. Aside from the 5- to 8-byte storage increase, I wonder what performance penalty I might incur into, or if I may even make my program more efficient on a 64-bit machine if some processor instructions favor 8-byte units.
I am quite new to this kind of low-level programming so I thought I could use some advice before I start to write a proof of concept that covers all the operations that I need to do (lookup, concatenating, inserting, etc).
Tips are appreciated!

Index vs. Pointer

I'm using arrays of elements, many of which referencing each other, and I assumed in that case it's more efficient to use pointers.
But in some cases I need to know the index of an element I have the pointer to. For example I have p = &a[i] and I need to know the value of i. As I understand it, i can be computed through p - a. But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
So my question is, is cross referencing with pointers in a case where you need the indexes as well even worth it?
But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
This operation requires a division only when the size of the element is not a power of two, i.e. when it is not a pointer, or some standard type on most systems. Dividing by a power of two is done using bit shifting, which is extremely cheap.
computing an address from an array index involves a multiplication and is faster.
Same logic applies here, except the compiler shifts left instead of shifting right.
is cross referencing with pointers in a case where you need the indexes as well even worth it?
Counting CPU cycles without profiling is a case of premature optimization - a bad thing to consider when you are starting your design.
A more important consideration is that indexes are more robust, because they often survive array reallocation.
Consider an example: let's say you have an array that grows dynamically as you add elements to its back, an index into that array, and a pointer into that array. You add an element to the array, exhausting its capacity, so now it must grow. You call realloc, and get a new array (or an old array if there was enough extra memory after the "official" end). The pointer that you held is now invalid; the index, however, is still valid.
Indexing an array is dirt cheap in ways where I've never found any kind of performance boost by directly using pointers instead. That includes some very performance-critical areas like looping through each pixel of an image containing millions of them -- still no measurable performance difference between indices and pointers (though it does make a difference if you can access an image using one sequential loop over two).
I've actually found many opposite cases where turning pointers into 32-bit indices boosted performance after 64-bit hardware started becoming available when there was a need to store a boatload of them.
One of the reasons is obvious: you can take half the space now with 32-bit indices (assuming you don't need more than ~4.3 billion elements). If you're storing a boatload of them and taking half the memory as in the case of a graph data structure like indexed meshes, then typically you end up with fewer cache misses when your links/adjacency data can be stored in half the memory space.
But on a deeper level, using indices allows a lot more options. You can use purely contiguous structures that realloc to new sizes without worrying about invalidation as dasblinkenlight points out. The indices will also tend to be more dense (as opposed to sparsely fragmented across the entire 64-bit addressing space), even if you leave holes in the array, allowing for effective compression (delta, frame of reference, etc) if you want to squash down memory usage. You can then also use parallel arrays to associate data to something in parallel without using something much more expensive like a hash table. That includes parallel bitsets which allow you to do things like set intersections in linear time. It also allows for SoA reps (also parallel arrays) which tend to be optimal for sequential access patterns using SIMD.
You get a lot more room to optimize with indices, and I'd consider it mostly just a waste of memory if you keep pointers around on top of indices. The downside to indices for me is mostly just convenience. We have to have access to the array we're indexing on top of the index itself, while the pointer allows you to access the element without having access to its container. It's often more difficult and error-prone to write code and data structures revolving around indices and also harder to debug since we can't see the value of an element through an index. That said, if you accept the extra burden, then often you get more room to optimize with indices, not less.

Efficient data structure for very long array of characters with insertions

What data structure and/or algorithm would be good for implementing an array of characters with insertion. The typical workload would be a loop of few "random" reads followed by an "random" insert. The array would be huge, on the order of gigabytes.
Edit: In an extreme case the algorithm needs to be able to build up a gigabyte string with single byte insertions efficiently.
Since this data structure should allow various extreme cases (like inserting either long strings or lots of single characters, representing a very long array, possibly memory constrained), it is unlikely that a single well-known data structure will suit all the needs.
You could consider these two data structures:
Rope. It works well if you insert not too many relatively long strings.
Tiered vector. It has no problems with lots of single character (or short string) insertions, needs very little additional memory, allows very fast random reads. But it has pretty bad time complexity for insertions: O(sqrt(n)) (which could be improved if we allow some more additional memory). And it is inferior to rope when inserting long strings.
Or you could use a custom-made data structure based either on rope or on tiered vector.
If you expect too many small insertions and want to avoid too many rope splits, you could use a tree of arrays; insert to the middle of array while it is short; but if size of the array grows to some limit, next insert should split it, just like in rope. Arrays referenced by tree nodes should be rather large (about 1000 bytes or more) to obtain better balance between very expensive cache misses while accessing tree nodes (so we should minimize number of nodes that do not fit in cache) and slightly less expensive memmoves.
Most appropriate memory allocation scheme for these arrays is like this: when array does not fit to its allocated space, split it into 2 equal parts, allocate some fixed number of bytes (like 2000) for each half, then copy each half to the middle of allocated space. While inserting characters near to the end of this array, move tail characters to the right. When inserting characters near to the beginning, move preceding characters to the left. So average length of memmove is only 1/4 of average array length. Two neighbor allocation spaces could share unused bytes between them, so we need to make a split only when bytes of one chunk are about to overwrite used bytes of other chunk. This approach is simple but needs some additional space to allow array growth. We could use some general-purpose allocator to get only space actually used by arrays (or allow very limited growth space), but it is much slower and most likely leads to memory fragmentation and even larger amount of unused memory. Possibly better way to save some memory is to use several fixed allocation spaces (like 1500, 1700, 2000) and reserve some fixed amount (determined experimentally) of chunks of each size. Other way to save memory is (instead of splitting one 2000-byte array into two 1000-byte arrays) merge two adjacent arrays (like 2000+1600), then split the result into three arrays (1200+1200+1200).
You've mentioned "packing bits together" to reduce RAM usage. That's not impossible with such a data structure (if your data is compressible). Actually two compression algorithms may be used here without sacrificing too much performance: Huffman coding or LZ4. For Huffman coding you need a static frequency table (precomputed in advance). For reading you'll need to decode only about 1/4 of average array size go get to proper position plus length of the string to read. For insertion you'll need to decode the same 1/4 of average array size, then move bitstream of the same size, then encode the inserted string. With LZ4 there is no need to deal with bitstreams, only whole bytes are used; probably it is worth to increase size of the arrays to get better compression.
Tiered vector may be optimized to be more cache friendly. Add about 100-200 bytes of reserved space to each block. With each insertion memmove bytes to this space. And only after there is no space for next memmove, start exchanging data between blocks (not by single byte as in original data structure, but 100-200 bytes at once).
To improve O(sqrt(n)) insertion time, consider tiered vector as special case of a trie, with only 2 levels. We could add one more level. Then (after insertion) we could stop inter-block data exchange when we come to the end of second-level block, allocate additional first-level block, and place extra bytes there. When one or several additional blocks are filled up, we could continue data exchange on third-level block (root of the trie). In theory this may be extended to log(n) levels to guarantee O(log(n)) single-character inserts and reads. But in practice probably 3 or 4 levels is better choice, so that we have O(n1/3) or O(n1/4) amortized insertion complexity.
I would suggest you to use a rope data structure. The SGI C++ STL provides an implementation which is typically made available as an extension: <ext/rope>. If you need to reimplement this data structure, it might be a good idea to consult their implementation notes.

C fastest way to compare two bitmaps

There are two arrays of bitmaps in the form of char arrays with millions of records. What could be fastest way to compare them using C.
I can imagine to use bitwise operator xor 1 byte at a time in a for loop.
Important point about bitmaps:
1% to 10% of times algorithm is run, bitmaps can differ. Most of the time they will be same. When hey can differ, they can as much as 100%. There is high probability of change of bits in continuous streak.
Both bitmaps are of same length.
Aim:
Check do they differ and if yes then where.
Be correct every time (probability of detecting error if there is one should be 1).
This answer assumes you mean 'bitmap' as a sequence of 0/1 values rather than 'bitmap image format'
If you simply have two bitmaps of the same length and wish to compare them quickly, memcmp() will be effective as someone suggested in the comments. You could if you want try using SSE type optimizations, but these are not as easy as memcmp(). memcmp() is assuming you simply want to know 'they are different' and nothing more.
If you want to know how many bits they are different by, e.g. 615 bits differ, then again you have little option except to XOR every byte and count the number of differences. As others have noted, you probably want to do this more at 32/64 or even 256 bits at a time, depending on your platform. However, if the arrays are millions of bytes long, then the biggest delay (with current CPUs) will be the time to transfer main memory to the CPU, and it wont matter terribly what the CPU does (lots of caveats here)
If you question is more asking about comparing A to B, but really you are doing this lots of times, such as A to B and C,D,E etc, then you can do a couple of things
A. Store a checksum of each array and first compare the checksums, if these are the same then there is a high chance the arrays are the same. Obviously there is a risk here that checksums can be equal but the data can differ, so make sure that a false result in this case will not have dramatic side effects. And, if you cannot withstand false results, do not use this technique.
B. if the arrays have structure, such as they are image data, then leverage specific tools for this, how is beyond this answer to explain.
C. If the image data can be compressed effectively, then compress each array and compare using the compressed form. If you use ZIP type of compression you cannot tell directly from zip how many bits differ, but other techniques such as RLE can be effective to quickly count bit differences (but are a lot of work to build and get correct and fast)
D. If the risk with (a) is acceptable, then you can checksum each chunk of say 262144 bits, and only count differences where checksums differ. This heavily reduces main memory access and will go lots faster.
All of the options A..D are about reducing main memory access as this is the nub of any performance gain (for problem as stated)

Is it possible to create a float array of 10^13 elements in C?

I am writing a program in C to solve an optimisation problem, for which I need to create an array of type float with an order of 1013 elements. Is it practically possible to do so on a machine with 20GB memory.
A float in C occupies 4 bytes (assuming IEEE floating point arithmetic, which is pretty close to universal nowadays). That means 1013 elements are naïvely going to require 4×1013 bytes of space. That's quite a bit (40 TB, a.k.a. quite a lot of disk for a desktop system, and rather more than most people can afford when it comes to RAM) so you need to find another approach.
Is the data sparse (i.e., mostly zeroes)? If it is, you can try using a hash table or tree to store only the values which are anything else; if your data is sufficiently sparse, that'll let you fit everything in. Also be aware that processing 1013 elements will take a very long time. Even if you could process a billion items a second (very fast, even now) it would still take 104 seconds (several hours) and I'd be willing to bet that in any non-trivial situation you'll not be able to get anything near that speed. Can you find some way to make not just the data storage sparse but also the processing, so that you can leave that massive bulk of zeroes alone?
Of course, if the data is non-sparse then you're doomed. In that case, you might need to find a smaller, more tractable problem instead.
I suppose if you had a 64 bit machine with a lot of swap space, you could just declare an array of size 10^13 and it may work.
But for a data set of this size it becomes important to consider carefully the nature of the problem. Do you really need random access read and write operations for all 10^13 elements? Is the array at all sparse? Could you express this as a map/reduce problem? If so, sequential access to 10^13 elements is much more practical than random access.

Resources