Linear Probing vs Chaining - c

In Algorithm Design Foundations,Analysis, and Internet Examples by Michael T. Goodrich ,Roberto Tamassia in section 2.5.5 Collision-Handling Schemes the last paragraph says
These open addressing schemes save some space over the separate
chaining method, but they are not necessarily faster. In experimental
and theoretical analysis, the chaining method is either competitive or
faster than the other methods, depending upon the load factor of the
methods.
But regarding the speed previous SO Answer says exact opposite.

Linear probing wins when the load factor = n/m is smaller. That is when the number of elements is small compared to the slots. But exactly reverse happen when load factor tends to 1. The table become saturated and every time we have to travel nearly whole table resulting in exponential growth. On the other hand Chaining still grows linearly.

Related

Size of the hash table

Let the size of the hash table to be static (I set it once). I want to set it according to the number of entries. Searching yielded that the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.
For simplicity, assume that the hash table will not accept any new entries and won't delete any.
The number of entries will be 200, 2000, 20000 and 2000000.
However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?
I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.
I using C and I want to build my own structure, for educating myself.
the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.
It certainly shouldn't. Probably this recommendation implies that load factor of 0.5 is good tradeoff, at least by default.
What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:
If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.
The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.
The entries will be 200, 2000, 20000 and 2000000.
I don't understand what did you mean by this.
However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?
The general rule is called space-time tradeoff: the more memory you allocate for hash table, the faster hash table operate. Here you can find some charts illustrating this. So, if you think that by assigning table size ~ 2 * N you would waste memory, you can freely choose smaller size, but be ready that operations on hash table will become slower on average.
I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.
It's impossible to avoid collisions completely (remember birthday paradox? :) Certain ratio of collisions is an ordinary situation. This ratio affects only average operation speed, see the previous section.
The answer to your question depends somewhat on the quality of your hash function. If you have a good quality hash function (i.e. one where on average, the bits of the hash code will be "distributed evenly"), then:
the necessity to have a prime number of buckets disappears;
you can expect the number of items per bucket to be Poisson distributed.
So firstly, the advice to use a prime number of buckets is is essentially a kludge to help alleviate situations where you have a poor hash function. Provided that you have a good quality hash function, it's not clear that there are really any constraints per se on the number of buckets, and one common choice is to use a power of two so that the modulo is just a bitwise AND (though either way, it's not crucial nowadays). A good hash table implementation will include a secondary hash to try and alleviate the situation where the original hash function is of poor quality-- see the source code to Java's HashTable for an example.
A common load factor is 0.75 (i.e. you have 100 buckets for every 75 entries). This translates to approximately 50% of buckets having just one single entry in them-- so it's good performance-wise-- though of couse it also wastes some space. What the "correct" load factor is for you depends on the time/space tradeoff that you want to make.
In very high-performance applications, a potential design consideration is also how you actually organise the structure/buckets in memory to maximise CPU cache performance. (The answer to what is the "best" structure is essentially "the one that performs best in your experiments with your data".)

Is it wrong to account for malloc in an amortized analysis of a dynamic array?

I had points docked on a homework assignment for calculating the wrong total cost in an amortized analysis of a dynamic array. I think the grader probably only looked at the total and not the steps I had taken, and I think I accounted for malloc and their answer key did not.
Here is a section of my analysis:
The example we were shown did not account for malloc, but I saw a video that did, and it made a lot of sense, so I put it in there. I realize that although malloc is a relatively costly operation, it would probably be O(1) here, so I could have left it out.
But my question is this: Is there only 1 way to calculate cost when doing this type of analysis? Is there an objective right and wrong cost, or is the conclusion drawn what really matters?
You asked, "Is there only 1 way to calculate cost when doing this type of analysis?" The answer is no.
These analyses are on mathematical models of machines, not real ones. When we say things like "appending to a resizable array is O(1) amortized", we are abstracting away the costs of various procedures needed in the algorithm. The motivation is to be able to compare algorithms even when you and I own different machines.
In addition to different physical machines, however, there are also different models of machines. For instance, some models don't allow integers to be multiplied in constant time. Some models allow variables to be real numbers with infinite precision. In some models all computation is free and the only cost tracked is the latency of fetching data from memory.
As hardware evolves, computer scientists make arguments for new models to be used in the analysis of algorithms. See, for instance, the work of Tomasz Jurkiewicz, including "The Cost of Address Translation".
It sounds like your model included a concrete cost to malloc. That is neither wrong nor right. It might be a more accurate model on your computer and a less accurate model on the graders.

Time Complexity of Hash Tables in C

I'm fairly new to the the concept of hash tables, and I've been reading up on different types of hash table lookup and insertion techniques.
I'm wondering what the difference is between the time complexities of linear probing, chaining, and quadratic probing?
I'm mainly interested in the the insertion, deletion, and search of nodes in the hash table. So if I graph the system time per process ( insert/search/delete process ) versus the process number, how would the graphs differ?
I'm guessing that:
- quadratic probing would be worst-case O(nlogn) or O(logn) for searching
- linear probing would be worst-case O(n) for search
- Not sure but I think O(n^2) for chaining
Could someone confirm this? Thanks!
It's actually surprisingly difficult to accurately analyze all of these different hashing schemes for a variety of reasons. First, unless you make very strong assumptions on your hash function, it is difficult to analyze the behavior of these hashing schemes accurately, because different types of hash functions can lead to different performances. Second, the interactions with processor caches mean that certain types of hash tables that are slightly worse in theory can actually outperform hash tables that are slightly better in theory because their access patterns are better.
If you assume that your hash function looks like a truly random function, and if you keep the load factor in the hash table to be at most a constant, all of these hashing schemes have expected O(1) lookup times. In other words, each scheme, on expectation, only requires you to do a constant number of lookups to find any particular element.
In theory, linear probing is a bit worse than quadratic hashing and chained hashing, because elements tend to cluster near one another unless the hash function has strong theoretical properties. However, in practice it can be extremely fast because of locality of reference: all of the lookups tend to be close to one another, so fewer cache misses occur. Quadratic probing has fewer collisions, but doesn't have as good locality. Chained hashing tends to have extremely few collisions, but tends to have poorer locality of reference because the chained elements are often not stored contiguously.
In the worst case, all of these data structures can take O(n) time to do a lookup. It's extremely unlikely for this to occur. In linear probing, this would require all the elements to be stored continuously with no gaps and you would have to look up one of the first elements. With quadratic hashing, you'd have to have a very strange looking set of buckets in order to get this behavior. With chained hashing, your hash function would have to dump every single element into the same bucket to get the absolute worst-case behavior. All of these are exponentially unlikely.
In short, if you pick any of these data structures, you are unlikely to get seriously burned unless you have a very bad hash function. I would suggest using chained hashing as a default since it has good performance and doesn't hit worst-case behavior often. If you know you have a good hash function, or have a small hash table, then linear probing might be a good option.
Hope this helps!

When to switch from unordered lists to sorted lists ? [optimization]

I have to implement an algorithm to decompose 3D volumes in voxels. The algorithm starts by identifying which vertexes is on each side of the cut plan and in a second step which edge traverse the cutting plan.
This process could be optimized by using the benefit of sorted list. Identifying the split point is O log(n). But I have to maintain one such sorted list per axis and this for vertexes and edges. Since this is to be implemented to be used by GPU I also have some constrains on memory management (i.e. CUDA). Intrusive listsM/trees and C are imposed.
With a complete "voxelization" I expect to endup with ~4000 points, and 12000 edges. Fortunately this can be optimized by using a smarter strategy to get rid of processed voxels and order residual volumes cutting to keep their number to a minimum. In this case I would expect to have less then 100 points and 300 edges. This makes the process more complex to manage but could end up beeing more efficient.
The question is thus to help me identify the criteria to determine when the benefit of using a sorted data structure is worth the effort and complexity overhead compared to simple intrusive linked lists.
chmike, this really sounds like the sort of thing you want to do first the simpler way, and see how it behaves. Any sort of GPU voxelization approach is pretty fragile to system details once you get into big volumes at least (which you don't seem to have). In your shoes I'd definitely want the straightforward implementation first, if for no other reason that to check against....
The question will ALWAYS boil down to which operator is most common, accessing, or adding.
If you have an unordered list, adding to it takes no time, and accessing particular items takes extra time.
If you have a sorted list, adding to it takes more time, but accessing it is quicker.
Most applications spending most of their time accessing the data, rather than adding to it, which means that the (running) time overhead in creating a sorted list will usually be balanced or covered by the time saved in accessing the list.
If there is a lot of churn in your data (which it doesn't sound like there is) then maintaining a sorted list isn't necessarily advisable, because you will be constantly resorting the list as considerable CPU cost.
The complexity of the data structures only matters if they cannot be sorted in a useful way. If they can be sorted, then you'll have to go by the heuristic of
number of accesses:number of changes
to determine if sorting is a good idea.
After considering all answers I found out that the later method used to avoid duplicate computation would end up being less efficient because of the effort to maintain and navigate in the data structure. Beside, the initial method is straightforward to parallelize with a few small kernel routines and thus more appropriate for GPU implementation.
Checking back my initial method I also found significant optimization opportunities that leaves the volume cut method well behind.
Since I had to pick one answer I chose devinb because he answer the question, but Simon's comment, backed up by Tobias Warre comment, were as valuable for me.
Thanks to all of you for helping me sorting out this issue.
Stack overflow is an impressive service.

How do you measure SQL Fill Factor value

Usually when I'm creating indexes on tables, I generally guess what the Fill Factor should be based on an educated guess of how the table will be used (many reads or many writes).
Is there a more scientific way to determine a more accurate Fill Factor value?
You could try running a big list of realistic operations and looking at IO queues for the different actions.
There are a lot of variables that govern it, such as the size of each row and the number of writes vs reads.
Basically: high fill factor = quicker read, low = quicker write.
However it's not quite that simple, as almost all writes will be to a subset of rows that need to be looked up first.
For instance: set a fill factor to 10% and each single-row update will take 10 times as long to find the row it's changing, even though a page split would then be very unlikely.
Generally you see fill factors 70% (very high write) to 95% (very high read).
It's a bit of an art form.
I find that a good way of thinking of fill factors is as pages in an address book - the more tightly you pack the addresses the harder it is to change them, but the slimmer the book. I think I explained it better on my blog.
I would tend to be of the opinion that if you're after performance improvements, your time is much better spent elsewhere, tweaking your schema, optimising your queries and ensuring good index coverage. Fill factor is one of those things that you only need to worry about when you know that everything else in your system is optimal. I don't know anyone that can say that.

Resources