Apologies if this is question has an obvious answer but I can't phrase it properly to find an answer online.
In Fortran, suppose I have an array (>100,000) of real numbers. I will continuously access this array (in a consecutive manner) over and over again throughout one step of a time integration scheme. Each subsequent steps some elements of this array will no longer be needed. I do not know how many, it could be anywhere from none of them to all of them. My question is:
Is it better to: (1) Go through this array every step and copy the remaining elements I need to a new array, even though only a very small percentage might need to be taken out, or (2) should I have a new array of integer indexes which I update every timestep to access this array. My understanding is that if the memory access is consecutive it should be very quick and I think this should outweigh the cost of copying the array. On the other hand updating the integer indexes would be very speedy but the cost would be that the data would then be fragmented and accessing it would be slower.
Or is this the type of question with no definitive answer and I need to go and test both methodologies to find which is better for my application?
Hard to say beforehand, so the easy answer would indeed be *"Measure!"
Some speculation might help with what to measure, though. Everything that follows under the assumption that the code is performance critical indeed.
Memory Latency:
100k elements does usually exceed L1 and L2 cache, so memory locality will play a role. OTOH, a linear scan is way better than a scatter access.
If memory latency is significant compared to the per-element operations, and the majority of elements become "uninteresting" after a given number of iterations, I would aim at:
mark individual elements as "to be skipped in future iterations"
compact the memory (i.e. remove skippable elements) when ~ 50% of the elements become skippable
(Test for above conditions: for a naive implementation, does the time of a single iteration grow faster-than-linear with number of elements?)
Cache-friendly blocks:
If memory latency is an issue and it is possible to apply multiple operations to a small chunk (say, 32KiB of data), do so.
Parallelization:
(the elephant in the room). Can be added easily if you can process in cache-friendly blocks.
Related
Since the elements of an array are stored contiguously in memory, I understand that sequentially scanning all the elements of an array would be much faster than in a linked list of same size. For the former you only have to increment some index variable and then read the value at that index whereas for LL you have to read pointers and then fetch data at non-contiguous memory addresses. My question is specifically how we would categorise these two cases in terms of time complexity?
For scanning array, does perfoming n * random accesses i.e. O(1) operations mean that overall it becomes O(n)? In that case wouldn't both be O(n)?
Sorry if this question doesn't make sense, maybe my understanding of time complexities isn't so good.
You are correct that
sequentially scanning a linked list or an array takes time O(n), and that
it's much faster to scan an array than a linked list due to locality of reference.
How can this be? The issue has to do with what you're counting with that O(n). Typically, when doing an analysis of an algorithm we assume that looking up a location in memory takes time O(1), and since in both cases you're doing O(n) memory lookups the total time is O(n).
However, the assumption that all memory lookups take the same amount of time is not a very good one in practice, especially with linked structures. You sometimes see analyses performed that do this in a different way. We assume that, somewhere in the machine, there's a cache that can hold B elements at any one time. Every time we do a memory lookup, if it's in cache, it's (essentially) free, and if it's not in cache then we do some work to load that memory address - plus the contents of memory around that location - into cache. We then only care about how many times we have to load something into the cache, since that more accurately predicts the runtime.
In the case of a linked list, where cells can be scattered randomly throughout memory, we'd expect to do Θ(n) memory transfers when scanning a linked list, since we basically will never expect to have a linked list cell already in cache. However, with an array, every time we find an array element not in cache, we pull into cache all the adjacent memory locations, which then means the next few elements will definitely be in the cache. This means that only (roughly) every 1/B lookups will trigger a cache miss, so we expect to do Θ(n / B) memory transfers. This predicts theoretically what we see empirically - it's much faster to sequentially scan an array than a linked list.
So to summarize, this is really an issue of what you're measuring and how you measure it. If you just count memory accesses, both sequential accesses will require O(n) memory accesses and thus take O(n) time. However, if you just care about cache misses, then sequential access of a dynamic array requires Θ(n / B) transfers while a linked list scan requires Θ(n) transfers, which is why the linked list appears to be slower.
This sort of analysis is often used when designing data structures for databases. The B-tree (and its relative the B+-tree), which are used extensively in databases and file systems, are specifically tuned around the size of a disk page to reduce memory transfers. More recent work has been done to design "cache-oblivious" data structures that always take optimal advantage of the cache even without knowing its size.
Unfortunatelly you missunderstood how these things work.
Sequencialy scan all array elements is O(n), being n the size of
the array, since you will visit each element. You will need to calculate each address and the fetch the data n times;
Sequencialy scan all linked list elements is O(n), being n the size of the linked list, since you will visit each element throught the links;
Acessing one element of an array is O(1), since the access is related to one memory address calculation and one fetch process;
Acessing one element of an linked lisk is O(n), being n the position that you want to access, because you need to arrive on the nth element hopping each link until you reach the desired element.
Accessing the value at a certain index, let's say, 500, in an array is "immediate" (O(1)) while, with a linked list, you must iterate over 500 nodes to get the wanted one (O(n)).
Therefore, with an array, an indice at the beginning or at the end of the container is accessible at the same speed while with a linked list, more the index is high more it takes time to get it.
At the contrary, inserting a node in a linked list is easy and fast, while doing the same in an array is slower.
So the question becomes what is the more common operation : accessing indices (writing, reading) or manipulating the container structure (inserting, removing) ? The answer seems obvious but it could be some cases where it's not.
Copy of static array is SO(N) (because the space required for this operation scales linearly with N)
Insert into static array is SO(1)
'because although it needs to COPY the first array and add the new element, after the copy it frees the space of the first array' - a quote from the source I'm learning about standard collections' TS complexity.
When we are dealing with time complexity, if the algorithm houses O(N) operation, the whole time complexity of the algorithm is AT LEAST O(N).
I struggle to understand how exactly do we measure Space complexity. I thought we look at 'the difference in memory after the completion of the run of the algorithm to determine whether it scales with N' - that would constitute insert into static array being SO(1).
But then, by that method, if I have an algorithm, that through the course of its run uses N! space to get a single value, and at the very end I would clean up the memory allocated for that N! items, the algorithm would be Space-O(1).
Actually, every single algorithm, that does not directly deal with entities, that remain after its run course would be O(1) (cause as we do not need the created entities after the algorithm, we can clean up the memory at the end).
Please help me understand the situation here. I know, in the real-world complexity analysis, we sometimes indulge in technical hypocrisy (like claiming Big-O (worst case scenario) of get from HashTable is Time-O(1), when it's actually O(N), but rare enough for us to claim it relevant, or insert into the end of Dynamic array is Time-O(1), when it's also O(N), but the Amortized analysis claims it's as well rare enough to claim it's Time-O(1)).
Is it one of these situations, when the insertion into static array is actually Space-O(N), but we take it for granted it's Space-O(1), or do I misunderstand, how space complexity works?
Space complexity simply describes how memory consumption is related to the size of input data.
In case of SO(1), the memory consumption of a program does not depend on the size of input. Example of this would be "grep", "sed" and similar tools that operate on streams. For a bigger dataset they would simply run more time. In Java for example you have SAX XML Parser which does not build DOM model in memory, but emits events for XML elements as it sees them; thus it can work with very large XML documents even on machines with limited memory.
In case of SO(N), memory consumption of a program grows linearly with the size of input. So if a program consumes 500 MB for 200 records and 700 MB for 400 records, then you know it will consume 1 GB for 700 records. Most programs actually fall in this category. For example, an email that is twice as long as some other email will consume twice as much space.
Programs that consume beyond SO(N) usually deal with complex data analysis, such as machine learning, AI, or data science. There you want to study relationships between individual records, so you sometimes build higher dimensional "cubes" and you get SO(N^2), SO(N^3) etc.
So technically it's not correct to say that "insertion into a static array is SO(N)". The correct thing to say would be that "an algorithm for XYZ has space complexity SO(N)".
Time and space complexity are important when you are sizing your infrastructure. For example, you have tested your system on 50 users, but you know that in production there will be 10000 users, so you ask yourself "how much RAM, disk and CPU do I need to have in my server".
In many cases, time and space can be traded for each other. There's no better way to see this than in databases. For example, a table can have indexes which consume some additional space, but give you a huge speedup for looking up things because you don't have to search through all the records.
Another example of trading time for space would be cracking password hashes using rainbow tables. There you are essentially pre-computing some data which later saves you a lot of work, as you don't have to perform those expensive computations during cracking, but only look up the already calculated result.
So I am doing a project, that requires me to find all the anagrams in a given file. Each file has words on each line.
What I have done so far:
1.) sort the word (using Mergesort - (I think this is the best in the worst case.. right?))
2.) place into the hashtable using a hash function
3.) if there is a collision move to the next available space further in the array (basically going down one by one until you see an empty spot in the hashtable) (is there a better way for this? What I am doing in linear probing).
Problem:
When it runs out of space in the hash table.. what do I do? I came up with two solutions, either scan the file before inputing anything into the hash table and have one exact size or keep resizing the array and rehashing as it get more and more full. I don't know which one to choose. Any tips would be helpful.
A few suggestions:
Sorting is often a good idea, and I can think of a way to use it here, but there's no advantage to sorting items if all you do afterwards is put them into a hashtable. Hashtables are designed for constant-time insertions even when the sequence of inserted items is in no particular order.
Mergesort is one of several sorting algorithms with O(nlog n) worst-case complexity, which is optimal if all you can do is compare two elements to see which is smaller. If you can do other operations, like index an array, O(n) sorting can be done with radixsort -- but it's almost certainly not worth your time to investigate this (especially as you may not even need to sort at all).
If you resize a hashtable by a constant factor when it gets full (e.g. doubling or tripling the size) then you maintain constant-time inserts. (If you resize by a constant amount, your inserts will degrade to linear time per insertion, or quadratic time over all n insertions.) This might seem wasteful of memory, but if your resize factor is k, then the proportion of wasted space will never be more than (k-1)/k (e.g. 50% for doubling). So there's never any asymptotic execution-time advantage to computing the exact size beforehand, although this may be some constant factor faster (or slower).
There are a variety of ways to handle hash collisions that trade off execution time versus maximum usable density in different ways.
I know that I can simply use bucket array for associative container if I have uniformly distributed integer keys or keys that can be mapped into uniformly distributed integers. If I can create the array big enough to ensure a certain load factor (which assumes the collection is not too dynamic), than the expected number of collisions for a key will be bounded, because this is simply hash table with identity hash function.
Edit: I view strings as equivalent to positional fractions in the range [0..1]. So they can be mapped into any integer range by multiplication and taking floor of the result.
I can also do prefix queries efficiently, just like with tries. I presume (without knowing a proof) that the expected number of empty slots corresponding to a given prefix that have to be skipped sequentially before the first bucket with at least one element is reached is also going to be bounded by constant (again depending on the chosen load factor).
And of course, I can do stabbing queries in worst-case constant time, and range queries in solely output sensitive linear expected time (if the conjecture of denseness from the previous paragraph is indeed true).
What are the advantages of a tries then?
If the distribution is uniform, I don't see anything that tries do better. But I may be wrong.
If the distribution has large uncompensated skew (because we had no prior probabilities or just looking at the worst case), the bucket array performs poorly, but tries also become heavily imbalanced, and can have linear worst case performance with strings of arbitrary length. So the use of either structure for your data is questionable.
So my question is - what are the performance advantages of tries over bucket arrays that can be formally demonstrated? What kind of distributions elicit those advantages?
I was thinking of distributions with self-similar structure at different scales. I believe those are called fractal distributions, of which I confess to know nothing. May be then, if the distribution is prone to clustering at every scale, tries can provide superior performance, by keeping the load factor of each node similar, adding levels at dense regions as necessary - something that bucket arrays can not do.
Thanks
Tries are good if your strings share common prefixes. In that case, the prefix is stored only once and can be queried with linear performance in the output string length. In a bucket array, all strings with the same prefixes would end up close together in your key space, so you have very skewed load where most buckets are empty and some are huge.
More generally, tries are also good if particular patterns (e.g. the letters t and h together) occur often. If there are many such patterns, the order of the trie's tree nodes will typically be small, and little storage is wasted.
One of the advantages of tries I can think of is insertion. Bucket array may need to be resized at some point and this is expensive operation. So worst-case insertion time into trie is much better than into bucket array.
Another thing is that you need to map string to fraction to be used with bucket arrays. So if you have short keys, theoretically trie can be more efficient, because you don't need to do the mapping.
I have to implement an algorithm to decompose 3D volumes in voxels. The algorithm starts by identifying which vertexes is on each side of the cut plan and in a second step which edge traverse the cutting plan.
This process could be optimized by using the benefit of sorted list. Identifying the split point is O log(n). But I have to maintain one such sorted list per axis and this for vertexes and edges. Since this is to be implemented to be used by GPU I also have some constrains on memory management (i.e. CUDA). Intrusive listsM/trees and C are imposed.
With a complete "voxelization" I expect to endup with ~4000 points, and 12000 edges. Fortunately this can be optimized by using a smarter strategy to get rid of processed voxels and order residual volumes cutting to keep their number to a minimum. In this case I would expect to have less then 100 points and 300 edges. This makes the process more complex to manage but could end up beeing more efficient.
The question is thus to help me identify the criteria to determine when the benefit of using a sorted data structure is worth the effort and complexity overhead compared to simple intrusive linked lists.
chmike, this really sounds like the sort of thing you want to do first the simpler way, and see how it behaves. Any sort of GPU voxelization approach is pretty fragile to system details once you get into big volumes at least (which you don't seem to have). In your shoes I'd definitely want the straightforward implementation first, if for no other reason that to check against....
The question will ALWAYS boil down to which operator is most common, accessing, or adding.
If you have an unordered list, adding to it takes no time, and accessing particular items takes extra time.
If you have a sorted list, adding to it takes more time, but accessing it is quicker.
Most applications spending most of their time accessing the data, rather than adding to it, which means that the (running) time overhead in creating a sorted list will usually be balanced or covered by the time saved in accessing the list.
If there is a lot of churn in your data (which it doesn't sound like there is) then maintaining a sorted list isn't necessarily advisable, because you will be constantly resorting the list as considerable CPU cost.
The complexity of the data structures only matters if they cannot be sorted in a useful way. If they can be sorted, then you'll have to go by the heuristic of
number of accesses:number of changes
to determine if sorting is a good idea.
After considering all answers I found out that the later method used to avoid duplicate computation would end up being less efficient because of the effort to maintain and navigate in the data structure. Beside, the initial method is straightforward to parallelize with a few small kernel routines and thus more appropriate for GPU implementation.
Checking back my initial method I also found significant optimization opportunities that leaves the volume cut method well behind.
Since I had to pick one answer I chose devinb because he answer the question, but Simon's comment, backed up by Tobias Warre comment, were as valuable for me.
Thanks to all of you for helping me sorting out this issue.
Stack overflow is an impressive service.