I'm using arrays of elements, many of which referencing each other, and I assumed in that case it's more efficient to use pointers.
But in some cases I need to know the index of an element I have the pointer to. For example I have p = &a[i] and I need to know the value of i. As I understand it, i can be computed through p - a. But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
So my question is, is cross referencing with pointers in a case where you need the indexes as well even worth it?
But this operation inherently involves division, which is expensive, whereas computing an address from an array index involves a multiplication and is faster.
This operation requires a division only when the size of the element is not a power of two, i.e. when it is not a pointer, or some standard type on most systems. Dividing by a power of two is done using bit shifting, which is extremely cheap.
computing an address from an array index involves a multiplication and is faster.
Same logic applies here, except the compiler shifts left instead of shifting right.
is cross referencing with pointers in a case where you need the indexes as well even worth it?
Counting CPU cycles without profiling is a case of premature optimization - a bad thing to consider when you are starting your design.
A more important consideration is that indexes are more robust, because they often survive array reallocation.
Consider an example: let's say you have an array that grows dynamically as you add elements to its back, an index into that array, and a pointer into that array. You add an element to the array, exhausting its capacity, so now it must grow. You call realloc, and get a new array (or an old array if there was enough extra memory after the "official" end). The pointer that you held is now invalid; the index, however, is still valid.
Indexing an array is dirt cheap in ways where I've never found any kind of performance boost by directly using pointers instead. That includes some very performance-critical areas like looping through each pixel of an image containing millions of them -- still no measurable performance difference between indices and pointers (though it does make a difference if you can access an image using one sequential loop over two).
I've actually found many opposite cases where turning pointers into 32-bit indices boosted performance after 64-bit hardware started becoming available when there was a need to store a boatload of them.
One of the reasons is obvious: you can take half the space now with 32-bit indices (assuming you don't need more than ~4.3 billion elements). If you're storing a boatload of them and taking half the memory as in the case of a graph data structure like indexed meshes, then typically you end up with fewer cache misses when your links/adjacency data can be stored in half the memory space.
But on a deeper level, using indices allows a lot more options. You can use purely contiguous structures that realloc to new sizes without worrying about invalidation as dasblinkenlight points out. The indices will also tend to be more dense (as opposed to sparsely fragmented across the entire 64-bit addressing space), even if you leave holes in the array, allowing for effective compression (delta, frame of reference, etc) if you want to squash down memory usage. You can then also use parallel arrays to associate data to something in parallel without using something much more expensive like a hash table. That includes parallel bitsets which allow you to do things like set intersections in linear time. It also allows for SoA reps (also parallel arrays) which tend to be optimal for sequential access patterns using SIMD.
You get a lot more room to optimize with indices, and I'd consider it mostly just a waste of memory if you keep pointers around on top of indices. The downside to indices for me is mostly just convenience. We have to have access to the array we're indexing on top of the index itself, while the pointer allows you to access the element without having access to its container. It's often more difficult and error-prone to write code and data structures revolving around indices and also harder to debug since we can't see the value of an element through an index. That said, if you accept the extra burden, then often you get more room to optimize with indices, not less.
Related
Suppose that I have N unsorted arrays, of integers. I'd like to find the intersection of those arrays.
There are two good ways to approach this problem.
One, I can sort the arrays in place with an nlogn sort, like QuickSort or MergeSort. Then I can put a pointer at the start of each array. Compare each array to the one below it, iterating the pointer of whichever array[pointer] is smaller, or if they're all equal, you've found an intersection.
This is an O(nlogn) solution, with constant memory (since everything is done in-place).
The second solution is to use a hash map, putting in the values that appear in the first array as keys, and then incrementing those values as you traverse through the remaining arrays (and then grabbing everything that had a value of N). This is an O(n) solution, with O(n) memory, where n is the total size of all of the arrays.
Theoretically, the former solution is o(nlogn), and the latter is O(n). However, hash maps do not have great locality, due to the way that items can be randomly scattered through the map, due to collisions. The other solution, although o(nlogn), traverses through the array one at a time, exhibiting excellent locality. Since a CPU will tend to pull the array values from memory that are next to the current index into the cache, the O(nlogn) solution will be hitting the cache much more often than the hash map solution.
Therefore, given a significantly large array size (as number of elements goes to infinity), is it feasible that the o(nlogn) solution is actually faster than the O(n) solution?
For integers you can use a non-comparison sort (see counting, radix sort). A large set might be encoded, e.g. sequential runs into ranges. That would compress the data set and allow for skipping past large blocks (see RoaringBitmaps). There is the potential to be hardware friendly and have O(n) complexity.
Complexity theory does not account for constants. As you suspect there is always the potential for an algorithm with a higher complexity to be faster than the alternative, due to the hidden constants. By exploiting the nature of the problem, e.g. limiting the solution to integers, there are potential optimizations not available to general purpose approach. Good algorithm design often requires understanding and leveraging those constraints.
I have an array where the index doubles as 'identifier for a collection of items' and the content of the array is a group-number. The group numbers fall into a finite range from 0..N, where N << length_of_the_array. Hence every is entry will be duplicated large number of times. Currently I have to use 2 bytes to represent group number (which can be > 1000 but < 6500 ), which due to the duplicated nature ends up consuming a lot of memory.
Are there ways to space optimize this array as the complete array can get into multiple MBs in size. Appreciate any pointers toward relevant optimization algorithm/technique. FYI: The programming language im using is cpp.
Do you still want efficient random-access to arbitrary elements? Or are you thinking about space-efficient serialization of the index->group map?
If you still want efficient random access, a single array lookup is not bad. It's at worst a single cache miss. Well really, at worst a page fault, or more likely a TLB miss, but that's unlikely if it's only a couple MB).
A sorted and run-length encoded list could be binary-searched (by searching an array of prefix-sums of the repeat-counts), but that only works if you can occasionally sort the list to keep duplicates together.
If the duplicates can't be at least somewhat grouped together, there's not much you can do that allows random access.
Packed 12-bit entries are probably not worth the trouble, unless that was enough to significantly reduce cache misses. A couple multiply instructions to generate the right address, and a shift and mask instruction on the 16b load containing the desired value, is not much overhead compared to a cache miss. Write access to packed bitfields is slower, and isn't atomic, so that's a serious downside. Getting a compiler to pack bitfields using structs can be compiler-specific. Maybe just using a char array would be best.
I've noticed that it is very common (especially in interview questions and homework assignments) to implement a dynamic array; typically, I see the question phrased as something like:
Implement an array which doubles in capacity when full
Or something very similar. They almost always (in my experience) use the word double explicitly, rather than a more general
Implement an array which increases in capacity when full
My question is, why double? I understand why it would be a bad idea to use a constant value (thanks to this question) but it seems like it makes more sense to use a larger multiple than double; why not triple the capacity, or quadruple it, or square it?
To be clear, I'm not asking how to double the capacity of an array, I'm asking why doubling is the convention.
Yes, it is common practice.
Doubling is a good way to manage memory. Heap management algorithms are often based on the classic Buddy System, its an easy way to deal with addressing and coalescing and other challenges. Knowing this, it is good to stick with multiples of 2 when dealing with allocation (though there are hybrid algorithms, like slab allocator, to help with fragmentation, so it isn't so important as it once was to use the multiple).
Knuth covers it in one of his books that I have but forgot the title.
See http://en.wikipedia.org/wiki/Buddy_memory_allocation
Another reason to double an array size is about the addition cost. You don't want each Add() operation to trigger a reallocation call. If you've filled N slots, there is a good chance you'll need some multiple of N anyway, history is a good indicator of future needs, so the object needs to "graduate" to the next arena size. By doubling, the frequency of reallocation falls off logarithmically (Log N). Doubling is just the most convenient multiple (being the smallest whole multiplier it is more memory efficient than 3*N or 4*N, plus it tends to follow heap memory management models closely).
The reason behind doubling is that it turns repeatedly appending an element into an amortized O(1) operation. Put another way, appending n elements takes O(n) time.
More accurately, increasing by any multiplicative factor achieves that, but doubling is a common choice. I've seen other choices, such as in increasing by a factor of 1.5.
Suppose in speed-critical code we have a pair of arrays that are frequently used together, where the exact size doesn't matter, it just needs to be set to something reasonable, e.g.
int a[256], b[256];
Is this potentially a pessimization because the low address bits being the same can make it harder for the cache to handle both arrays simultaneously? Would it be better to specify e.g. 300 instead of 256?
Moving my comment to an answer:
You are correct to suspect that powers-of-two could be problematic. But it usually only applies when you have more than 2 strides. It doesn't get really bad until you exceed your L1 cache associativity. But even before that you might run into false aliasing issues.
Here are two examples where powers-of-two actually become problematic:
Why are elementwise additions much faster in separate loops than in a combined loop?
Matrix multiplication: Small difference in matrix size, large difference in timings
In the first example, there are 4 arrays - all of which are aligned to the same offset from the start of a 4k page.
In the second example, the column-wise hopping of a matrix completely destroys performance when the size is a power-of-two.
In any case, note that the key concept is actually the alignment of the arrays, not the size of them. If you find that you are running into slow-downs, just add some padding between your arrays to break the alignment.
I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.
Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)
32 MB? thats not too big.... quicksort should do the trick.
Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case