Constraint Programming CP Optimizer - How to model a cumulative capacity constraint - cumulative-sum

I am new to the CP Optimizer (from ILOG), and am trying to find the right keywords to get me on a track to modelling cumulative capacity constraints.
For example, I have two tasks:
A, which lasts for 6 days and contributes 5 units per day to a cumulative capacity constraint
B, which lasts for 3 days and contributes 10 units per day to a cumulative capacity constraint
I would like to put in a cumulative capacity constraint over an interval. For example, a constraint from days 0,...,4 with a maximum cumulative value of 50.
Scheduling A and B to both start at day 0 would yield a cumulative contribution of:
A: 5 + 5 + 5 + 5 + 5 = 25 (with the remaining 5 outside of this interval not hitting this constraint)
B: 10 + 10 + 10 = 30
This would exceed the cumulative capacity of 50 over that timespan.
However, if I move the intervals such that A starts on day 1 (rather than 0) and B starts on day 0, then
A: 5 + 5 + 5 + 5 = 20 (with the remaining 5+5 outside of this interval not hitting this constraint)
B: 10 + 10 + 10 = 30
This schedule would satisfy this constraint.
Can someone please help me find the right functions / words to solve this? I realize this could be trivial for experts in CP. I don't necessarily need full code, just pointing me in the right direction would be extremely helpful!
I have tried to use StepAtEnd for A and B, however, this lets task A completely slide under the capacity of 50 units. StepAtStart will push A or B outside of the constraint window. I understand that an alternative approach would be to put a constraint on 10 units per day, but I am intentionally trying to model this as a cumulative constraint (in the bigger picture, letting daily constraints flex above 10, but ultimately staying under 50 cumulatively over a specific window).

For computing the integral of a cumul function over the period T, you can do as follows.
Let f be the cumul function (in your case pulse(A,6) + pulse(B,10)).
Let the period T = [t0, t0+k).
Let z_i be a fixed interval of size 1, starting at t0+i, for i in 0..k-1.
(the z_i's cover T)
Let g be a new cumul function
defined by: g=sum(pulse(z_i, 0, MAX))
and constrained by: g+f==MAX.
Then, F, the integral of f over the period T is:
F = T*MAX - Sum(heightAtStart(z_i, g))
And for your problem, you just need to constrain F to be less than or equal to 50.

In scheduling, the concept of a cumulative constraint is usually reserved for resources that have a maximum usage at each time index, as you talk about in the final sentence.
What you are talking about seems to be more of a sliding sum constraint, which limits the sum of variables in any contiguous subsequence of a certain length.

If you want to reason about the daily contributions, you can model each task as a chain of consecutive intervals, each a single day long, and each contributing at the begin/end of the day. E.g. using the Python API:
from docplex.cp.model import CpoModel
m = CpoModel()
cumul_contribs = []
# Task A
task_a = m.interval_var_list(6, start=(0,100),end=(0,100), size=1, name='a')
for (ivl1,ivl2) in zip(task_a, task_a[1:]):
m.add(m.end_at_start(ivl1, ivl2))
for ivl in task_a:
cumul_contribs.append(m.step_at_start(ivl, 5))
# Task B
task_b = m.interval_var_list(3, start=(0,100),end=(0,100), size=1, name='b')
for (ivl1,ivl2) in zip(task_b, task_b[1:]):
m.add(m.end_at_start(ivl1, ivl2))
for ivl in task_b:
cumul_contribs.append(m.step_at_start(ivl, 10))
# Constrain the cumulative resource
m.add(m.always_in(sum(cumul_contribs), (0,5), 0, 50))
msol = m.solve()
msol.write()

Related

Why does this code generate random duplicates?

Let me start by saying I'm new to python/pyspark
I've got a dataframe of 100 items, I'm slicing that up into batches of 25 then for each batch I need to do work on each row. I'm getting duplicate values in the last do work step. I've verified my original list does not contain duplicates, my slice step generates 4 distinct lists
batchsize = 25
sliced = []
emailLog = []
for i in range(1,bc_df.count(),batchsize):
sliced.append({"slice":bc_df.filter(bc_df.Index >= i).limit(batchsize).rdd.collect()})
for s in sliced:
for r in s['slice']:
emailLog.append({"email":r['emailAddress']})
re = sc.parallelize(emailLog)
re_df = sqlContext.createDataFrame(re)
re_df.createOrReplaceTempView('email_logView')
%sql
select count(distinct(email)) from email_logView
My expectation is to have 100 distinct email addresses, I sometiems get 75, 52, 96, 100
Your issue is caused by this line because it is not deterministic and allows duplicates:
sliced.append({"slice":bc_df.filter(bc_df.Index >= i).limit(batchsize).rdd.collect()})
Let's take a closer look at what is happening (I assume that the index column ranges from 1 to 100).
Your range function generates four values for i (1,26,51 and 76).
During the first iteration you request all rows which index is 1 or greater (i.e. [1,100]) and take 25 of them.
During the second iteration you request all rows which index is 26 or greater (i.e. [26,100]) and take 25 of them.
During the third iteration you request all rows which index is 51 or greater (i.e. [51,100]) and take 25 of them.
During the fourth iteration you request all rows which index is 76 or greater (i.e. [76,100]) and take 25 of them.
You already see that the intervals are overlapping. That means that the email addresses of an iteration could also have been taken by previous iterations.
You can fix this by simply extending your filter with an upper limit. For example:
sliced.append({"slice":bc_df.filter((bc_df.Index >= i) & (bc_df.Index < i + batchsize)).rdd.collect()})
That is just a quick fix to solve your problem. As general advise I recommend you to avoid .collect() as often as possible because it does not scale horizontaly.

Rank and unrank Combination with constraints

I want to rank and unrank combinations with an Element distance constraint. Selected elements cannot repeated.
For example:
n := 10 elements choose from
k := 5 elements being choosen
d := 3 max distance between 2 choosen elements
1,3,5,8,9 matches the constraint
1,5,6,7,8 dont matches the constraint
How can ranking the combination with given distance constraint, where 1,2,3,4,5 is smaller than 1,2,3,4,6 ? Is there a way do the ranking without compute the combinations with smaller Rank?
You can do this by first creating and populating a two-dimensional array, which I will call NVT for "number of valid tails", to record the number of valid "tails" that start at a certain position with a given value. For example, NVT[4][6] = 3, because a combination that has 6 in position #4 can end in 3 distinct ways: (…, 6, 7), (…, 6, 8), and (…, 6, 9).
To populate NVT, start with NVT[k][…], which is just a row of all 1s. Then work your way back to earlier positions; for example, NVT[2][5] = NVT[3][6] + NVT[3][7] + NVT[3][8], because a "tail" starting at position #3 with value 5 will consist of that 5 plus a "tail" starting at position #4 with value 6, 7, or 8.
Note that we don't care whether there's actually a valid way to reach a given tail. For example, NVT[4][1] = 3 because of the valid tails (1, 2), (1, 3), and (1, 4), even though there are no complete combinations of the form (…, 1, _).
Once you've done that, you can compute the rank of a combination C as follows:
For position #1, count up all the valid tails starting at position #1 with a value less than C[1]. For example, if C starts with 3, then this will be NVT[1][1] + NVT[1][2], representing all combinations that start with 1 or 2.
Then do the same for all subsequent positions. These will represent combinations that start off the same way as C up until a given position, but then have a lesser value at that position.
For example, if C is (1, 3, 5, 8, 9), this comes out to
0 +
(NVT[2][1] + NVT[2][2]) +
(NVT[3][1] + NVT[3][2] + NVT[3][3] + NVT[3][4]) +
(NVT[4][1] + NVT[4][2] + NVT[4][3] + NVT[4][4] + NVT[4][5] + NVT[4][6] + NVT[4][7]) +
(NVT[5][1] + NVT[5][2] + NVT[5][3] + NVT[5][4] + NVT[5][5] + NVT[5][6] + NVT[5][7] + NVT[5][8]).
Conversely, you can find the combination C with a given rank r as follows:
Create a temporary variable rr, for "remaining rank", initially equal to r.
To find C[1] — the value in position #1 — count up valid tails starting at position #1, starting with the least possible value (namely 1), stopping once this would exceed rr. For example, since NVT[1][1] = 66 and NVT[1][2] = 27, the combination with rank 75 must start with 2 (because 75 ≥ 66 and 75 < 66 + 27). Then subtract this sum from rr (in this case leaving 75 − 66 = 9).
Then do the same for all subsequent positions, making sure to keep in mind the least possible value given what was in the previous position. Continuing our example with r = 75, C[1] = 2, and rr = 9, we know that C[2] ≥ 3; and since NVT[2][3] = 23 > rr, we immediately find that C[2] = 3.
Complexity analysis:
Space:
NVT requires O(nk) space.
Returning a combination as a length-k array inherently requires O(k) space; but if we return the combination one value at a time (by invoking a callback or printing to a console or something), then the computation itself doesn't actually depend on this array, and only requires O(1) extra space.
Aside from that, everything else can be managed in O(1) space; we don't need any recursion or temporary arrays or anything.
Time:
Populating NVT takes O(nkd) time. (Note: if d is greater than n, then we can just set d equal to n.)
Given NVT, computing the rank of a given combination takes worst-case O(nk) time.
Given NVT, computing the combination with a given rank takes worst-case O(nk) time.
Implementation note: The details of the above are a bit fiddly; it would be easy to get an off-by-one error, or mix up two variables, or whatnot, if you don't have concrete data to look at. Since there are only 168 valid combinations for your example, I recommend generating all of them, so that you can reference them during debugging.
Possible additional optimization: If you expect n to be quite large, and you expect to do a lot of queries to "rank" and "unrank" combinations, then you might find it useful to create a second array, which I will call NVTLT for "number of valid tails less than", to record the number of valid "tails" that start at a certain position with a value less than a given value. For example, NVTLT[3][5] = NVT[3][1] + NVT[3][2] + NVT[3][3] + NVT[3][4], or if you prefer, NVTLT[3][5] = NVTLT[3][4] + NVT[3][4]. (You can do this as an in-place transformation, completely overwriting NVT, so it's an O(nk) pass with no additional space.) Using NVTLT instead of NVT for your queries will let you do binary search for values, rather than linear search, giving worst-case O(k log n) time instead of O(nk) time. Note that this optimization is even trickier than the above, so even if you intend to perform this optimization, I recommend starting with the above, getting it working perfectly, and only then adding this optimization.

Query on a array

Assume that I have an array A = {a, b, c, d, e, f, g, h........} and Q queries. in each query I will be asked to do one of the following operation:
1 i j -> increase i the element by 1 and decrease j the element by one
2 x -> tell the number of elements of the array which are less than x
if there was no update operation I could have done this by lower bound. I can still do it by sorting the array and finding the lower bound but complexity will be too high since the size of array A and Q can be both 10^5. is there any faster algorithm or way to do this?
The simplest way is to use std::count_if.
What complexity bound do you have to meet? 10^5^2 is still only 10^10.
If you have to do better than that, I suspect you have to have a "value" which has back pointers to the "index", and an "index" which is a pointer to the value. Sort the values initially, and then when you update, move the value to the right point. (Probably best to see if the value needs to move at all before searching).
Then the query is still a lower bound operation.
Once you sort the array (O(n log n) complexity), a query "LESS(X)" will run in log n time since you can use binary search. Once you know that element X is found (or the next largest element in A is found) at position k-th, you know that k is your answer (k elements are less than X).
The (i, j) command implies a partial reorder of the array between the element which is immediately less than min(A[i]+1, A[j]-1) and the one which is immediately after max(A[i], A[j]). These you find both in log n, worst case log n + n, time: this is close to the worst case:
k 0 1 2 3 4 5 6 7 8 9 command: (4, 5)
v 7 14 14 15 15 15 16 16 16 18
^ ^
becomes 16 becomes 14 -- does it go before 3 or before 1?
The re-sort is then worst case n, since your array is already almost sorted except for two elements, which means you'll do well by using two runs of insertion sort.
So with m update queries and q simple queries you can expect to have
n log n + m*2*(log n + 2*n) + q * log n
complexity. Average case (no pathological arrays, reasonable sparseness, no pathological updates, (j-i) = d << n) will be
( n + 2m + q ) * log n + 2m*d
which is linearithmic. With n = m = q = 10^5, you get an overall complexity which is still below 10^7 unless you've got pathological arrays and ad hoc queries, in which case the complexity should be quadratic (or maybe even cubic; I haven't examined it closely).
In a real world scenario, you can also conceivably employ some tricks. Remember the last values of the modified indexes of i and j, and the last location query k. This costs little. Now on the next query, chances are that you will be able to use one of the three values to prime your binary search and shave some time.

How do I check to see if two (or more) elements of an array/vector are the same?

For one of my homework problems, we had to write a function that creates an array containing n random numbers between 1 and 365. (Done). Then, check if any of these n birthdays are identical. Is there a shorter way to do this than doing several loops or several logical expressions?
Thank you!
CODE SO FAR, NOT DONE YET!!
function = [prob] bdayprob(N,n)
N = input('Please enter the number of experiments performed: N = ');
n = input('Please enter the sample size: n = ');
count = 0;
for(i=1:n)
x(i) = randi(365);
if(x(i)== x)
count = count + 1
end
return
If I'm interpreting your question properly, you want to check to see if generating n integers or days results in n unique numbers. Given your current knowledge in MATLAB, it's as simple as doing:
n = 30; %// Define sample size
N = 10; %// Define number of trials
%// Define logical array where each location tells you whether
%// birthdays were repeated for a trial
check = false(1, N);
%// For each trial...
for idx = 1 : N
%// Generate sample size random numbers
days = randi(365, n, 1);
%// Check to see if the total number of unique birthdays
%// are equal to the sample size
check(idx) = numel(unique(days)) == n;
end
Woah! Let's go through the code slowly shall we? We first define the sample size and the number of trials. We then specify a logical array where each location tells you whether or not there were repeated birthdays generated for that trial. Now, we start with a loop where for each trial, we generate random numbers from 1 to 365 that is of n or sample size long. We then use unique and figure out all unique integers that were generated from this random generation. If all of the birthdays are unique, then the total number of unique birthdays generated should equal the sample size. If we don't, then we have repeats. For example, if we generated a sample of [1 1 1 2 2], the output of unique would be [1 2], and the total number of unique elements is 2. Since this doesn't equal 5 or the sample size, then we know that the birthdays generated weren't unique. However, if we had [1 3 4 6 7], unique would give the same output, and since the output length is the same as the sample size, we know that all of the days are unique.
So, we check to see if this number is equal to the sample size for each iteration. If it is, then we output true. If not, we output false. When I run this code on my end, this is what I get for check. I set the sample size to 30 and the number of trials to be 10.
check =
0 0 1 1 0 0 0 0 1 0
Take note that if you increase the sample size, there is a higher probability that you will get duplicates, because randi can be considered as sampling with replacement. Therefore, the larger the sample size, the higher the chance of getting duplicate values. I made the sample size small on purpose so that we can see that it's possible to get unique days. However, if you set it to something like 100, or 200, you will most likely get check to be all false as there will most likely be duplicates per trial.
Here are some more approaches that avoid loops. Let
n = 20; %// define sample size
x = randi(365,n,1); %// generate n values between 1 and 365
Any of the following code snippets returns true (or 1) if there are two identical values in x, and false (or 0) otherwise:
Sort and then check if any two consecutive elements are the same:
result = any(diff(sort(x))==0);
Do all pairwise comparisons manually; remove self-pairs and duplicate pairs; and check if any of the remaining comparisons is true:
result = nnz(tril(bsxfun(#eq, x, x.'),-1))>0;
Compute the distance between distinct values, considering each pair just once, and then check if any distance is 0:
result = any(pdist(x(:))==0);
Find the number of occurrences of the most common value (mode):
[~, occurs] = mode(x);
result = occurs>1;
I don't know if I'm supposed to solve the problem for you, but perhaps a few hints may lead you in the right direction (besides I'm not a matlab expert so it will be in general terms):
Maybe not, but you have to ask yourself what they expect of you. The solution you propose requires you to loop through the array in two nested loops which will mean n*(n-1)/2 times through the loop (ie quadratic time complexity).
There are a number of ways you can improve the time complexity of the problem. The most straightforward would be to have a 365 element table where you can keep track if a particular number has been seen yet - which would require only a single loop (ie linear time complexity), but perhaps that's not what they're looking for either. But maybe that solution is a little bit ad-hoc? What we're basically looking for is a fast lookup if a particular number has been seen before - there exists more memory efficient structures that allows look up in O(1) time and O(log n) time (if you know these you have an arsenal of tools to use).
Then of course you could use the pidgeonhole principle to provide the answer much faster in some special cases (remember that you only asked to determine whether two or more numbers are equal or not).

Efficient histogram implementation using a hash function

Is there a more efficient approach to computing a histogram than a binary search for a non-linear bin distribution?
I'm actually only interested in the bit of the algorithm that matches the key (value) to the bin (the transfer function?) , i.e. for a bunch of floating point values I just want to know the appropriate bin index for each value.
I know that for a linear bin distribution you can get O(1) by dividing the value by the bin width, and that for non linear bins a binary search gets you O(logN). My current implementation uses a binary search on unequal bin widths.
In the spirit of improving efficiency I was curious as to whether you could use a hash function to map a value to its appropriate bin and achieve O(1) time complexity when you have bins of unequal widths?
In some simple cases you can get O(1).
Suppose, your values are 8-bit, from 0 to 255.
If you split them into 8 bins of sizes 2, 2, 4, 8, 16, 32, 64, 128, then the bin value ranges will be: 0-1, 2-3, 4-7, 8-15, 16-31, 32-63, 64-127, 128-255.
In binary these ranges look like:
0000000x (bin 0)
0000001x
000001xx
00001xxx
0001xxxx
001xxxxx
01xxxxxx
1xxxxxxx (bin 7)
So, if you can quickly (in O(1)) count how many most significant zero bits there are in the value, you can get the bin number from it.
In this particular case you may precalculate a look-up table of 256 elements, containing the bin number and finding the appropriate bin for a value is just one table look-up.
Actually, with 8-bit values you can use bins of arbitrary sizes since the look-up table is small.
If you were to go with bins of sizes of powers of 2, you could reuse this look-up table for 16-bit values as well. And you'd need two look-ups. You can extend it to even longer values.
Ordinary hash functions are intended to scatter different values quite randomly across some range. A single-bit difference in arguments may lead to dozens of bits different in results. For that reason, ordinary hash functions are not suitable for the situation described in the question.
An alternative is to build an array P with entries that index into the table B of bin limits. Given some value x, we find the bin j it belongs to (or sometimes a nearby bin) via j = P[⌊x·r⌋] where r is a ratio that depends on the size of P and the maximum value in B. The effectiveness of this approach depends on the values in B and the size of P.
The behavior of functions like P[⌊x·r⌋] can be seen via the python code shown below. (The method is about the same in any programming language. However, tips for Python-to-C are given below.) Suppose the code is stored in file histobins.py and loaded into the ipython interpreter with the command import histobins as hb. Then a command like hb.betterparts(27, 99, 9, 80,155) produces output like
At 80 parts, steps = 20 = 7+13
At 81 parts, steps = 16 = 7+9
At 86 parts, steps = 14 = 6+8
At 97 parts, steps = 13 = 12+1
At 108 parts, steps = 12 = 3+9
At 109 parts, steps = 12 = 8+4
At 118 parts, steps = 12 = 6+6
At 119 parts, steps = 10 = 7+3
At 122 parts, steps = 10 = 3+7
At 141 parts, steps = 10 = 5+5
At 142 parts, steps = 10 = 4+6
At 143 parts, steps = 9 = 7+2
These parameters to betterparts set nbins=27, topsize=99, seed=9, plo=80, phi=155 which creates a test set of 27 bins for values from 0 to 99, with random seed 9, and size of P from 80 to 155-1. The number of “steps” is the number of times the two while loops in testparts() operated during a test with 10*nbins values from 0 to topsize. Eg, “At 143 parts, steps = 9 = 7+2” means that when the size of P is 143, out of 270 trials, 261 times P[⌊x·r⌋] produced the correct index at once; 7 times the index had to be decreased, and twice it had to be increased.
The general idea of the method is to trade off space for time. Another tradeoff is preparation time versus operation time. If you are going to be doing billions of lookups, it is worthwhile to do a few thousand trials to find a good value of |P|, the size of P. If you are going to be doing only a few millions of lookups, it might be better to just pick some large value of |P| and run with it, or perhaps just run betterparts over a narrow range. Instead of doing 75 tests as above, if we start with larger |P| fewer tests may give a good enough result. For example, 10 tests via “hb.betterparts(27, 99, 9, 190,200)” produces
At 190 parts, steps = 11 = 5+6
At 191 parts, steps = 5 = 3+2
At 196 parts, steps = 5 = 4+1
As long as P fits into some level of cache (along with other relevant data) making |P| larger will speed up access. So, making |P| as large as practical is a good idea. As |P| gets larger, the difference in performance between one value of |P| and the next gets smaller and smaller. The limiting factors on speed then include time to multiply and time to set up while loops. One approach for faster multiplies may be to choose a power of 2 as a multiplier; compute |P| to match; then use shifts or adds to exponents instead of multiplies. One approach to spending less time setting up while loops is to move the statement if bins[bin] <= x < bins[bin+1]: (or its C equivalent, see below) to before the while statements and do the while's only if the if statement fails.
Python code is shown below. Note, in translating from Python to C,
• # begins a comment
• def begins a function
• a statement like ntest, right, wrong, x = 10*nbins, 0, 0, 0 assigns values to respective identifiers
• a statement like return (ntest, right, wrong, stepdown, stepup) returns a tuple of 5 values that the caller can assign to a tuple or to respective identifiers
• the scope of a def, while, or if ends with a line not indented farther than the def, while, or if
• bins = [0] initializes a list (an extendible indexable array) with value 0 as its initial entry
• bins.append(t) appends value t at the end of list bins
• for i,j in enumerate(p): runs a loop over the elements of iterable p (in this case, p is a list), making the index i and corresponding entry j == p[i] available inside the loop
• range(nparts) stands for a list of the values 0, 1, ... nparts-1
• range(plo, phi) stands for a list of the values plo, plo+1, ... phi-1
• if bins[bin] <= x < bins[bin+1] means if ((bins[bin] <= x) && (x < bins[bin+1]))
• int(round(x*float(nparts)/topsize))) actually rounds x·r, instead of computing ⌊x·r⌋ as advertised above
def makebins(nbins, topsize):
bins, t = [0], 0
for i in range(nbins):
t += random.random()
bins.append(t)
for i in range(nbins+1):
bins[i] *= topsize/t
bins.append(topsize+1)
return bins
#________________________________________________________________
def showbins(bins):
print ''.join('{:6.2f} '.format(x) for x in bins)
def showparts(nbins, bins, topsize, nparts, p):
ratio = float(topsize)/nparts
for i,j in enumerate(p):
print '{:3d}. {:3d} {:6.2f} {:7.2f} '.format(i, j, bins[j], i*ratio)
print 'nbins: {} topsize: {} nparts: {} ratio: {}'.format(nbins, topsize, nparts, ratio)
print 'p = ', p
print 'bins = ',
showbins(bins)
#________________________________________________________________
def testparts(nbins, topsize, nparts, seed):
# Make bins and make lookup table p
import random
if seed > 0: random.seed(seed)
bins = makebins(nbins,topsize)
ratio, j, p = float(topsize)/nparts, 0, range(nparts)
for i in range(nparts):
while j<nbins and i*ratio >= bins[j+1]:
j += 1
p[i] = j
p.append(j)
#showparts(nbins, bins, topsize, nparts, p)
# Count # of hits and steps with avg. of 10 items per bin
ntest, right, wrong, x = 10*nbins, 0, 0, 0
delta, stepdown, stepup = topsize/float(ntest), 0, 0
for i in range(ntest):
bin = p[min(nparts, max(0, int(round(x*float(nparts)/topsize))))]
while bin < nbins and x >= bins[bin+1]:
bin += 1; stepup += 1
while bin > 0 and x < bins[bin]:
bin -= 1; stepdown += 1
if bins[bin] <= x < bins[bin+1]: # Test if bin is correct
right += 1
else:
wrong += 1
print 'Wrong bin {} {:7.3f} at x={:7.3f} Too {}'.format(bin, bins[bin], x, 'high' if bins[bin] > x else 'low')
x += delta
return (ntest, right, wrong, stepdown, stepup)
#________________________________________________________________
def betterparts(nbins, topsize, seed, plo, phi):
beststep = 1e9
for parts in range(plo, phi):
ntest, right, wrong, stepdown, stepup = testparts(nbins, topsize, parts, seed)
if wrong: print 'Error with ', parts, ' parts'
steps = stepdown + stepup
if steps <= beststep:
beststep = steps
print 'At {:3d} parts, steps = {:d} = {:d}+{:d}'.format(parts, steps, stepdown, stepup)
#________________________________________________________________
Interpolation search is your friend. It's kind of an optimistic, predictive binary search where it guesses where the bin should be based on a linear assumption about the distribution of inputs, rather than just splitting the search space in half at each step. It will be O(1) if the linear assumption is true, but still works (though more slowly) when the assumption is not. To the degree that its predictions are accurate, the search is fast.
Depends on the implementation of the hashing and the type of data you're working with. For smaller data sets a more simple algorithm like binary search might outperform constant lookup if the lookup-overhead of hashing is larger on average.
The usual implementation of hashing, consists of an array of linked lists and a hashing function that maps a string to an index in the array of linked lists. There's a thing called the load factor, which is the number of elements in the hash map / length of the linked-list array. Thus for load factors < 1 you'll achieve constant lookup in the best case because no linked-list will contain more than one element (best case).
There's only one way to find out which is better - implement a hash map and see for yourself. You should be able to get something near constant lookup :)

Resources