Efficient data structure for very long array of characters with insertions - arrays

What data structure and/or algorithm would be good for implementing an array of characters with insertion. The typical workload would be a loop of few "random" reads followed by an "random" insert. The array would be huge, on the order of gigabytes.
Edit: In an extreme case the algorithm needs to be able to build up a gigabyte string with single byte insertions efficiently.

Since this data structure should allow various extreme cases (like inserting either long strings or lots of single characters, representing a very long array, possibly memory constrained), it is unlikely that a single well-known data structure will suit all the needs.
You could consider these two data structures:
Rope. It works well if you insert not too many relatively long strings.
Tiered vector. It has no problems with lots of single character (or short string) insertions, needs very little additional memory, allows very fast random reads. But it has pretty bad time complexity for insertions: O(sqrt(n)) (which could be improved if we allow some more additional memory). And it is inferior to rope when inserting long strings.
Or you could use a custom-made data structure based either on rope or on tiered vector.
If you expect too many small insertions and want to avoid too many rope splits, you could use a tree of arrays; insert to the middle of array while it is short; but if size of the array grows to some limit, next insert should split it, just like in rope. Arrays referenced by tree nodes should be rather large (about 1000 bytes or more) to obtain better balance between very expensive cache misses while accessing tree nodes (so we should minimize number of nodes that do not fit in cache) and slightly less expensive memmoves.
Most appropriate memory allocation scheme for these arrays is like this: when array does not fit to its allocated space, split it into 2 equal parts, allocate some fixed number of bytes (like 2000) for each half, then copy each half to the middle of allocated space. While inserting characters near to the end of this array, move tail characters to the right. When inserting characters near to the beginning, move preceding characters to the left. So average length of memmove is only 1/4 of average array length. Two neighbor allocation spaces could share unused bytes between them, so we need to make a split only when bytes of one chunk are about to overwrite used bytes of other chunk. This approach is simple but needs some additional space to allow array growth. We could use some general-purpose allocator to get only space actually used by arrays (or allow very limited growth space), but it is much slower and most likely leads to memory fragmentation and even larger amount of unused memory. Possibly better way to save some memory is to use several fixed allocation spaces (like 1500, 1700, 2000) and reserve some fixed amount (determined experimentally) of chunks of each size. Other way to save memory is (instead of splitting one 2000-byte array into two 1000-byte arrays) merge two adjacent arrays (like 2000+1600), then split the result into three arrays (1200+1200+1200).
You've mentioned "packing bits together" to reduce RAM usage. That's not impossible with such a data structure (if your data is compressible). Actually two compression algorithms may be used here without sacrificing too much performance: Huffman coding or LZ4. For Huffman coding you need a static frequency table (precomputed in advance). For reading you'll need to decode only about 1/4 of average array size go get to proper position plus length of the string to read. For insertion you'll need to decode the same 1/4 of average array size, then move bitstream of the same size, then encode the inserted string. With LZ4 there is no need to deal with bitstreams, only whole bytes are used; probably it is worth to increase size of the arrays to get better compression.
Tiered vector may be optimized to be more cache friendly. Add about 100-200 bytes of reserved space to each block. With each insertion memmove bytes to this space. And only after there is no space for next memmove, start exchanging data between blocks (not by single byte as in original data structure, but 100-200 bytes at once).
To improve O(sqrt(n)) insertion time, consider tiered vector as special case of a trie, with only 2 levels. We could add one more level. Then (after insertion) we could stop inter-block data exchange when we come to the end of second-level block, allocate additional first-level block, and place extra bytes there. When one or several additional blocks are filled up, we could continue data exchange on third-level block (root of the trie). In theory this may be extended to log(n) levels to guarantee O(log(n)) single-character inserts and reads. But in practice probably 3 or 4 levels is better choice, so that we have O(n1/3) or O(n1/4) amortized insertion complexity.

I would suggest you to use a rope data structure. The SGI C++ STL provides an implementation which is typically made available as an extension: <ext/rope>. If you need to reimplement this data structure, it might be a good idea to consult their implementation notes.

Related

Index vs. Pointer

I'm using arrays of elements, many of which referencing each other, and I assumed in that case it's more efficient to use pointers.
But in some cases I need to know the index of an element I have the pointer to. For example I have p = &a[i] and I need to know the value of i. As I understand it, i can be computed through p - a. But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
So my question is, is cross referencing with pointers in a case where you need the indexes as well even worth it?
But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
This operation requires a division only when the size of the element is not a power of two, i.e. when it is not a pointer, or some standard type on most systems. Dividing by a power of two is done using bit shifting, which is extremely cheap.
computing an address from an array index involves a multiplication and is faster.
Same logic applies here, except the compiler shifts left instead of shifting right.
is cross referencing with pointers in a case where you need the indexes as well even worth it?
Counting CPU cycles without profiling is a case of premature optimization - a bad thing to consider when you are starting your design.
A more important consideration is that indexes are more robust, because they often survive array reallocation.
Consider an example: let's say you have an array that grows dynamically as you add elements to its back, an index into that array, and a pointer into that array. You add an element to the array, exhausting its capacity, so now it must grow. You call realloc, and get a new array (or an old array if there was enough extra memory after the "official" end). The pointer that you held is now invalid; the index, however, is still valid.
Indexing an array is dirt cheap in ways where I've never found any kind of performance boost by directly using pointers instead. That includes some very performance-critical areas like looping through each pixel of an image containing millions of them -- still no measurable performance difference between indices and pointers (though it does make a difference if you can access an image using one sequential loop over two).
I've actually found many opposite cases where turning pointers into 32-bit indices boosted performance after 64-bit hardware started becoming available when there was a need to store a boatload of them.
One of the reasons is obvious: you can take half the space now with 32-bit indices (assuming you don't need more than ~4.3 billion elements). If you're storing a boatload of them and taking half the memory as in the case of a graph data structure like indexed meshes, then typically you end up with fewer cache misses when your links/adjacency data can be stored in half the memory space.
But on a deeper level, using indices allows a lot more options. You can use purely contiguous structures that realloc to new sizes without worrying about invalidation as dasblinkenlight points out. The indices will also tend to be more dense (as opposed to sparsely fragmented across the entire 64-bit addressing space), even if you leave holes in the array, allowing for effective compression (delta, frame of reference, etc) if you want to squash down memory usage. You can then also use parallel arrays to associate data to something in parallel without using something much more expensive like a hash table. That includes parallel bitsets which allow you to do things like set intersections in linear time. It also allows for SoA reps (also parallel arrays) which tend to be optimal for sequential access patterns using SIMD.
You get a lot more room to optimize with indices, and I'd consider it mostly just a waste of memory if you keep pointers around on top of indices. The downside to indices for me is mostly just convenience. We have to have access to the array we're indexing on top of the index itself, while the pointer allows you to access the element without having access to its container. It's often more difficult and error-prone to write code and data structures revolving around indices and also harder to debug since we can't see the value of an element through an index. That said, if you accept the extra burden, then often you get more room to optimize with indices, not less.

using linked list vs large array for searching

I am implementing some algorithmic changes to the conventional game of life for an assignment.
Essentially, I currently have two options for implementing a multithreaded searching algorithm that improves on the efficiency of a previous algorithm.
Either search through a linked list using two threads and relay the data to two other threads to process(application is running on a quad core)
To have a massive preallocated array which will remain largely empty containing only pointers to predefined structures, in which case the searching could be done much faster and there would be no issues in syncing the threads.
Would a faster search outweigh memory requirements and reduce computing time?
It should be mentioned that the array will remain largely empty, but the overall memory allocated to it would be far larger than the linked list, not to mention the index of the furthest most nonempty array element could also be stored so as to prevent the program from searching an entire array.
I should also mention that the array stores pointers to live cell coordinates, and as such is only kept so large as a worst case measure. I am also planning on ignoring any NULL values in order to skip array elements who have been deleted.
Game Of Life and searching?????
If you want a multithreaded Game Of Life, calculate line n/2 on its own but don't store it in the array, just in a buffer, run two threads that calculate and store lines 0 to n/2 - 1 resp. lines n/2 + 1 to n-1, then copy the line n/2 into the result.
For four threads, calculate lines at n/4, n/2, 3n/4 first, give each thread a quarter of the job, then copy the three lines into the array.
If your array is as sparse as most GOL boards, then the list will likely be much, much faster. Having a pointer to the next piece of data is way better than scanning for it.
That said, the overall performance may not be better, as others have mentioned.

space optimize a large array with many duplicates

I have an array where the index doubles as 'identifier for a collection of items' and the content of the array is a group-number. The group numbers fall into a finite range from 0..N, where N << length_of_the_array. Hence every is entry will be duplicated large number of times. Currently I have to use 2 bytes to represent group number (which can be > 1000 but < 6500 ), which due to the duplicated nature ends up consuming a lot of memory.
Are there ways to space optimize this array as the complete array can get into multiple MBs in size. Appreciate any pointers toward relevant optimization algorithm/technique. FYI: The programming language im using is cpp.
Do you still want efficient random-access to arbitrary elements? Or are you thinking about space-efficient serialization of the index->group map?
If you still want efficient random access, a single array lookup is not bad. It's at worst a single cache miss. Well really, at worst a page fault, or more likely a TLB miss, but that's unlikely if it's only a couple MB).
A sorted and run-length encoded list could be binary-searched (by searching an array of prefix-sums of the repeat-counts), but that only works if you can occasionally sort the list to keep duplicates together.
If the duplicates can't be at least somewhat grouped together, there's not much you can do that allows random access.
Packed 12-bit entries are probably not worth the trouble, unless that was enough to significantly reduce cache misses. A couple multiply instructions to generate the right address, and a shift and mask instruction on the 16b load containing the desired value, is not much overhead compared to a cache miss. Write access to packed bitfields is slower, and isn't atomic, so that's a serious downside. Getting a compiler to pack bitfields using structs can be compiler-specific. Maybe just using a char array would be best.

Storing records in a byte array vs using an array of structs

I have 200 million records, some of which have variable sized fields (string, variable length array etc.). I need to perform some filters, aggregations etc. on them (analytics oriented queries).
I want to just lay them all out in memory (enough to fit in a big box) and then do linear scans on them. There are two approach I can take, and I want to hear your opinions on which approach is better, for maximizing speed:
Using an array of structs with char* and int* etc. to deal with variable length fields
Use a large byte array, scan the byte array like a binary stream, and then parse the records
Which approach would you recommend?
Update: Using C.
The unfortunate answer is that "it depends on the details which you haven't provided" which, while true, is not particularly useful. The general advice to approaching a problem like this is to start with the simplest/most obvious design and then profile and optimize it as needed. If it really matters you can start with a few very basic benchmark tests of a few designs using your exact data and use-cases to get a more accurate idea of what direction you should take.
Looking in general at a few specific designs and their general pros/cons:
One Big Buffer
char* pBuffer = malloc(200000000);
Assumes your data can fit into memory all at once.
Would work better for all text (or mostly text) data.
Wouldn't be my first choice for large data as it just mirrors the data on the disk. Better just to use the hardware/software file cache/read ahead and read data directly from the drive, or map it if needed.
For linear scans this is a good format but you lose a bit if it requires complex parsing (especially if you have to do multiple scans).
Potential for the least overhead assuming you can pack the structures one after the other.
Static Structure
typedef struct {
char Data1[32];
int Data2[10];
} myStruct;
myStruct *pData = malloc(sizeof(myStruct)*200000000);
Simplest design and likely the best potential for speed at a cost of memory (without actual profiling).
If your variable length arrays have a wide range of sizes you are going to waste a lot of memory. Since you have 200 million records you may not have enough memory to use this method.
For a linear scan this is likely the best memory structure due to memory cache/prefetching.
Dynamic Structure
typedef struct {
char* pData1;
int* pData2;
} myStruct2;
myStruct2 *pData = malloc(sizeof(myStruct2)*200000000);
With 200 million records this is going require a lot of dynamic memory allocations which is very likely going to have a significant impact on speed.
Has the potential to be more memory efficient if your dynamic arrays have a wide range of sizes (though see next point).
Be aware of the overhead of the pointer sizes. On a 32-bit system this structure needs 8 bytes (ignoring padding) to store the pointers which is 1.6 GB alone for 200 million records! If your dynamic arrays are generally small (or empty) you may be spending more memory on the overhead than the actual data.
For a linear scan of data this type of structure will probably perform poorly as you are accessing memory in a non-linear manner which cannot be predicted by the prefetcher.
Streaming
If you only need to do one scan of the data then I would look at a streaming solution where you read a small bit of the data at a time from the file.
Works well with very large data sets that wouldn't fit into memory.
Main limitation here is the disk read speed and how complex your parsing is.
Even if you have to do multiple passes with file caching this might be comparable in speed to the other methods.
Which of these is "best" really depends on your specific case...I can think of situations where each one would be the preferred method.
You could use structs, indeed, but you'd have to be very careful about alignments and aliasing, and it would all need patching up when there's a variable length section. In particular, you could not use an array of such structs because all entries in an array must be constant size.
I suggest the flat array approach. Then add a healthy dose of abstraction; you don't want your "business logic" doing bit-twiddling.
Better still, if you need to do a single linear scan over the whole data set then you should treat it like a data stream, and de-serialize (copy) the records into proper, native structs, one at a time.
"Which approach would you recommend?" Neither actually. With this amount of data my recommendation would be something like a linked list of your structs. However, if you are 100% sure that you will be able to allocate the required amount of memory (with 1 malloc call) for all your data, then use array of structs.

Realloc Vs Linked List Scanning

I have to read from a file an unknown number of rows and save them in to a structure (I would like to avoid a prepocessing to count the total number of elements).
After the reading phase I have to make some computations on each of the elements of these rows.
I figured out two ways:
Use realloc each time I read a row. This way the allocation phase is slow but the computation phase is easier thanks to the index access.
Use a linked list each time I read a row. This way the allocation phase is faster but the computation phase is slower.
What is better from a complexity point of view?
How often will you traverse the linked list? If it's only once go for the linked-list. Another few things: vill there be a lot of small allocations? You could make a few smaller buffers for let's say 10 lines and link those togeteher. But that's all a question of profiling.
I'd do the simplest thing first and see if that fits my needs only then i'd think about optimizing.
Sometimes one wastes too much time thinking about the optimum even when the second best solution also fits the needs perfectly.
Without more details on how you are going to use the information, it is a bit tough to comment on the complexity. However, here are a few thoughts:
If you use realloc, it would likely be better to realloc to add "some" more items (rather than one each and every time). Typically, a good algorithm is to double the size each time.
If you use a linked list, you could speed up the access in a simple post-processing step. Allocate an array of pointers to the items and traverse the list once setting the array elements to each item in the list.
If the items are of a fixed size in the file, you could pre-compute the size simply by seeking to the end of the file, determining the size, divide by the item size and you have the result. Even if it is not a fixed size, you could possibly use this as an estimate to get "close" to the necessary size and reduce the number of reallocs required.
as other users already have stated:
Premature optimization is the root of
all evil
Donald Knuth
I have a different proposal using realloc: in the C++ STL the std::vector container grows every time an object is inserted and not enough space is available. The size of the growing depends on the current pre-allocated size but is implementation specific. For example, you could save the actual number of preallocated objects. If the size runs out, you call reallocate with the double amount of space as currently allocated. I hope this was somewhat understandable!
The caveeat is of course, that you propably will allocate more space than you actually will consume and need.

Resources