Need some help calculating percentile - c

An rpc server is given which receives millions of requests a day. Each request i takes processing time Ti to get processed. We want to find the 65th percentile processing time (when processing times are sorted according to their values in increasing order) at any moment. We cannot store processing times of all the requests of the past as the number of requests is very large. And so the answer need not be exact 65th percentile, you can give some approximate answer i.e. processing time which will be around the exact 65th percentile number.
Hint: Its something to do how a histogram (i.e. an overview) is stored for a very large data without storing all of data.

Take one day's data. Use it to figure out what size to make your buckets (say one day's data shows that the vast majority (95%?) of your data is within 0.5 seconds of 1 second (ridiculous values, but hang in)
To get 65th percentile, you'll want at least 20 buckets in that range, but be generous, and make it 80. So you divide your 1 second window (-0.5 seconds to +0.5 seconds) into 80 buckets by making each 1/80th of a second wide.
Each bucket is 1/80th of 1 second. Make bucket 0 be (center - deviation) = (1 - 0.5) = 0.5 to itself + 1/80th of a second. Bucket 1 is 0.5+1/80th - 0.5 + 2/80ths. Etc.
For every value, find out which bucket it falls in, and increment a counter for that bucket.
To find 65th percentile, get the total count, and walk the buckets from zero until you get to 65% of that total.
Whenever you want to reset, set the counters all to zero.
If you always want to have good data available, keep two of these, and alternate resetting them, using the one you reset least recently as having more useful data.

Use an updown filter:
if q < x:
q += .01 * (x - q) # up a little
else:
q += .005 * (x - q) # down a little
Here a quantile estimator q tracks the x stream,
moving a little towards each x.
If both factors were .01, it would move up as often as down,
tracking the 50 th percentile.
With .01 up, .005 down, it floats up, 67 th percentile;
in general, it tracks the up / (up + down) th percentile.
Bigger up/down factors track faster but noisier --
you'll have to experiment on your real data.
(I have no idea how to analyze updowns, would appreciate a link.)
The updown() below works on long vectors X, Q in order to plot them:
#!/usr/bin/env python
from __future__ import division
import sys
import numpy as np
import pylab as pl
def updown( X, Q, up=.01, down=.01 ):
""" updown filter: running ~ up / (up + down) th percentile
here vecs X in, Q out to plot
"""
q = X[0]
for j, x in np.ndenumerate(X):
if q < x:
q += up * (x - q) # up a little
else:
q += down * (x - q) # down a little
Q[j] = q
return q
#...............................................................................
if __name__ == "__main__":
N = 1000
up = .01
down = .005
plot = 0
seed = 1
exec "\n".join( sys.argv[1:] ) # python this.py N= up= down=
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, suppress=True ) # .2f
title = "updown random.exponential: N %d up %.2g down %.2g" % (N, up, down)
print title
X = np.random.exponential( size=N )
Q = np.zeros(N)
updown( X, Q, up=up, down=down )
# M = np.zeros(N)
# updown( X, M, up=up, down=up )
print "last 10 Q:", Q[-10:]
if plot:
fig = pl.figure( figsize=(8,3) )
pl.title(title)
x = np.arange(N)
pl.plot( x, X, "," )
pl.plot( x, Q )
pl.ylim( 0, 2 )
png = "updown.png"
print >>sys.stderr, "writing", png
pl.savefig( png )
pl.show()

An easier way to get the value that represents a given percentile of a list or array is the scoreatpercentile function in the scipy.stats module.
>>>import scipy.stats as ss
>>>ss.scoreatpercentile(v,65)
there's a sibling percentileofscore to return the percentile given the value

you will need to store a running sum and a total count.
then check out standard deviation calculations.

Related

How to find contiguous subarray of integers in an array from n arrays such that the sum of elements of such contiguous subarrays is minimum

Input: n arrays of integers of length p.
Output: An array of p integers built by copying contiguous subarrays of the input arrays into matching indices of the output, satisfying the following conditions.
At most one subarray is used from each input array.
Every index of the output array is filled from exactly one subarray.
The output array has the minimum possible sum.
Suppose I have 2 arrays:
[1,7,2]
[2,1,8]
So if I choose a subarray [1,7] from array 1 and subarray [8] from array 2. since these 2 subarrays are not overlapping for any index and are contiguous. We are also not taking any subarray twice from an array from which we have already chosen a subarray.
We have the number of elements in the arrays inside the collection = 2 + 1 = 3, which is the same as the length of the individual array (i.e. len(array 1) which is equal to 3). So, this collection is valid.
The sum here for [1,7] and [8] is 1 + 7 + 8 = 16
We have to find a collection of such subarrays such that the total sum of the elements of subarrays is minimum.
A solution to the above 2 arrays would be a collection [2,1] from array 1 and [2] from array 2.
This is a valid collection and the sum is 2 + 1 + 2 = 5 which is the minimum sum for any such collection in this case.
I cannot think of any optimal or correct approach, so I need help.
Some Ideas:
I tried a greedy approach by choosing minimum elements from all array for a particular index since the index is always increasing (non-overlapping) after a valid choice, I don't have to bother about storing minimum value indices for every array. But this approach is clearly not correct since it will visit the same array twice.
Another method I thought was to start from the 0th index for all arrays and start storing their sum up to k elements for every array since the no. of arrays are finite, I can store the sum upto k elements in an array. Now I tried to take a minimum across these sums and for a "minimum sum", the corresponding subarray giving this sum (i.e. k such elements in that array) can be a candidate for a valid subarray of size k, thus if we take this subarray, we can add a k + 1-th element corresponding to every array into their corresponding sum and if the original minimum still holds, then we can keep on repeating this step. When the minima fail, we can consider the subarray up to the index for which minima holds and this will be a valid starting subarray. However, this approach will also clearly fail because there could exist another subarray of size < k giving minima along with remaining index elements from our subarray of size k.
Sorting is not possible either, since if we sort then we are breaking consecutive condition.
Of course, there is a brute force method too.
I am thinking, working through a greedy approach might give a progress in the approach.
I have searched on other Stackoverflow posts, but couldn't find anything which could help my problem.
To get you started, here's a recursive branch-&-bound backtracking - and potentially exhaustive - search. Ordering heuristics can have a huge effect on how efficient these are, but without mounds of "real life" data to test against there's scant basis for picking one over another. This incorporates what may be the single most obvious ordering rule.
Because it's a work in progress, it prints stuff as it goes along: all solutions found, whenever they meet or beat the current best; and the index at which a search is cut off early, when that happens (because it becomes obvious that the partial solution at that point can't be extended to meet or beat the best full solution known so far).
For example,
>>> crunch([[5, 6, 7], [8, 0, 3], [2, 8, 7], [8, 2, 3]])
displays
new best
L2[0:1] = [2] 2
L1[1:2] = [0] 2
L3[2:3] = [3] 5
sum 5
cut at 2
L2[0:1] = [2] 2
L1[1:3] = [0, 3] 5
sum 5
cut at 2
cut at 2
cut at 2
cut at 1
cut at 1
cut at 2
cut at 2
cut at 2
cut at 1
cut at 1
cut at 1
cut at 0
cut at 0
So it found two ways to get a minimal sum 5, and the simple ordering heuristic was effective enough that all other paths to full solutions were cut off early.
def disp(lists, ixs):
from itertools import groupby
total = 0
i = 0
for k, g in groupby(ixs):
j = i + len(list(g))
chunk = lists[k][i:j]
total += sum(chunk)
print(f"L{k}[{i}:{j}] = {chunk} {total}")
i = j
def crunch(lists):
n = len(lists[0])
assert all(len(L) == n for L in lists)
# Start with a sum we know can be beat.
smallest_sum = sum(lists[0]) + 1
smallest_ixs = [None] * n
ixsofar = [None] * n
def inner(i, sumsofar, freelists):
nonlocal smallest_sum
assert sumsofar <= smallest_sum
if i == n:
print()
if sumsofar < smallest_sum:
smallest_sum = sumsofar
smallest_ixs[:] = ixsofar
print("new best")
disp(lists, ixsofar)
print("sum", sumsofar)
return
# Simple greedy heuristic: try available lists in the order
# of smallest-to-largest at index i.
for lix in sorted(freelists, key=lambda lix: lists[lix][i]):
L = lists[lix]
newsum = sumsofar
freelists.remove(lix)
# Try all slices in L starting at i.
for j in range(i, n):
newsum += L[j]
# ">" to find all smallest answers;
# ">=" to find just one (potentially faster)
if newsum > smallest_sum:
print("cut at", j)
break
ixsofar[j] = lix
inner(j + 1, newsum, freelists)
freelists.add(lix)
inner(0, 0, set(range(len(lists))))
How bad is brute force?
Bad. A brute force way to compute it: say there are n lists each with p elements. The code's ixsofar vector contains p integers each in range(n). The only constraint is that all occurrences of any integer that appears in it must be consecutive. So a brute force way to compute the total number of such vectors is to generate all p-tuples and count the number that meet the constraints. This is woefully inefficient, taking O(n**p) time, but is really easy, so hard to get wrong:
def countb(n, p):
from itertools import product, groupby
result = 0
seen = set()
for t in product(range(n), repeat=p):
seen.clear()
for k, g in groupby(t):
if k in seen:
break
seen.add(k)
else:
#print(t)
result += 1
return result
For small arguments, we can use that as a sanity check on the next function, which is efficient. This builds on common "stars and bars" combinatorial arguments to deduce the result:
def count(n, p):
# n lists of length p
# for r regions, r from 1 through min(p, n)
# number of ways to split up: comb((p - r) + r - 1, r - 1)
# for each, ff(n, r) ways to spray in list indices = comb(n, r) * r!
from math import comb, prod
total = 0
for r in range(1, min(n, p) + 1):
total += comb(p-1, r-1) * prod(range(n, n-r, -1))
return total
Faster
Following is the best code I have for this so far. It builds in more "smarts" to the code I posted before. In one sense, it's very effective. For example, for randomized p = n = 20 inputs it usually finishes within a second. That's nothing to sneeze at, since:
>>> count(20, 20)
1399496554158060983080
>>> _.bit_length()
71
That is, trying every possible way would effectively take forever. The number of cases to try doesn't even fit in a 64-bit int.
On the other hand, boost n (the number of lists) to 30, and it can take an hour. At 50, I haven't seen a non-contrived case finish yet, even if left to run overnight. The combinatorial explosion eventually becomes overwhelming.
OTOH, I'm looking for the smallest sum, period. If you needed to solve problems like this in real life, you'd either need a much smarter approach, or settle for iterative approximation algorithms.
Note: this is still a work in progress, so isn't polished, and prints some stuff as it goes along. Mostly that's been reduced to running a "watchdog" thread that wakes up every 10 minutes to show the current state of the ixsofar vector.
def crunch(lists):
import datetime
now = datetime.datetime.now
start = now()
n = len(lists[0])
assert all(len(L) == n for L in lists)
# Start with a sum we know can be beat.
smallest_sum = min(map(sum, lists)) + 1
smallest_ixs = [None] * n
ixsofar = [None] * n
import threading
def watcher(stop):
if stop.wait(60):
return
lix = ixsofar[:]
while not stop.wait(timeout=600):
print("watch", now() - start, smallest_sum)
nlix = ixsofar[:]
for i, (a, b) in enumerate(zip(lix, nlix)):
if a != b:
nlix.insert(i,"--- " + str(i) + " -->")
print(nlix)
del nlix[i]
break
lix = nlix
stop = threading.Event()
w = threading.Thread(target=watcher, args=[stop])
w.start()
def inner(i, sumsofar, freelists):
nonlocal smallest_sum
assert sumsofar <= smallest_sum
if i == n:
print()
if sumsofar < smallest_sum:
smallest_sum = sumsofar
smallest_ixs[:] = ixsofar
print("new best")
disp(lists, ixsofar)
print("sum", sumsofar, now() - start)
return
# If only one input list is still free, we have to take all
# of its tail. This code block isn't necessary, but gives a
# minor speedup (skips layers of do-nothing calls),
# especially when the length of the lists is greater than
# the number of lists.
if len(freelists) == 1:
lix = freelists.pop()
L = lists[lix]
for j in range(i, n):
ixsofar[j] = lix
sumsofar += L[j]
if sumsofar >= smallest_sum:
break
else:
inner(n, sumsofar, freelists)
freelists.add(lix)
return
# Peek ahead. The smallest completion we could possibly get
# would come from picking the smallest element in each
# remaining column (restricted to the lists - rows - still
# available). This probably isn't achievable, but is an
# absolute lower bound on what's possible, so can be used to
# cut off searches early.
newsum = sumsofar
for j in range(i, n): # pick smallest from column j
newsum += min(lists[lix][j] for lix in freelists)
if newsum >= smallest_sum:
return
# Simple greedy heuristic: try available lists in the order
# of smallest-to-largest at index i.
sortedlix = sorted(freelists, key=lambda lix: lists[lix][i])
# What's the next int in the previous slice? As soon as we
# hit an int at least that large, we can do at least as well
# by just returning, to let the caller extend the previous
# slice instead.
if i:
prev = lists[ixsofar[i-1]][i]
else:
prev = lists[sortedlix[-1]][i] + 1
for lix in sortedlix:
L = lists[lix]
if prev <= L[i]:
return
freelists.remove(lix)
newsum = sumsofar
# Try all non-empty slices in L starting at i.
for j in range(i, n):
newsum += L[j]
if newsum >= smallest_sum:
break
ixsofar[j] = lix
inner(j + 1, newsum, freelists)
freelists.add(lix)
inner(0, 0, set(range(len(lists))))
stop.set()
w.join()
Bounded by DP
I've had a lot of fun with this :-) Here's the approach they were probably looking for, using dynamic programming (DP). I have several programs that run faster in "smallish" cases, but none that can really compete on a non-contrived 20x50 case. The runtime is O(2**n * n**2 * p). Yes, that's more than exponential in n! But it's still a minuscule fraction of what brute force can require (see above), and is a hard upper bound.
Note: this is just a loop nest slinging machine-size integers, and using no "fancy" Python features. It would be easy to recode in C, where it would run much faster. As is, this code runs over 10x faster under PyPy (as opposed to the standard CPython interpreter).
Key insight: suppose we're going left to right, have reached column j, the last list we picked from was D, and before that we picked columns from lists A, B, and C. How can we proceed? Well, we can pick the next column from D too, and the "used" set {A, B, C} doesn't change. Or we can pick some other list E, the "used" set changes to {A, B, C, D}, and E becomes the last list we picked from.
Now in all these cases, the details of how we reached state "used set {A, B, C} with last list D at column j" make no difference to the collection of possible completions. It doesn't matter how many columns we picked from each, or the order in which A, B, C were used: all that matters to future choices is that A, B, and C can't be used again, and D can be but - if so - must be used immediately.
Since all ways of reaching this state have the same possible completions, the cheapest full solution must have the cheapest way of reaching this state.
So we just go left to right, one column at a time, and remember for each state in the column the smallest sum reaching that state.
This isn't cheap, but it's finite ;-) Since states are subsets of row indices, combined with (the index of) the last list used, there are 2**n * n possible states to keep track of. In fact, there are only half that, since the way sketched above never includes the index of the last-used list in the used set, but catering to that would probably cost more than it saves.
As is, states here are not represented explicitly. Instead there's just a large list of sums-so-far, of length 2**n * n. The state is implied by the list index: index i represents the state where:
i >> n is the index of the last-used list.
The last n bits of i are a bitset, where bit 2**j is set if and only if list index j is in the set of used list indices.
You could, e.g., represent these by dicts mapping (frozenset, index) pairs to sums instead, but then memory use explodes, runtime zooms, and PyPy becomes much less effective at speeding it.
Sad but true: like most DP algorithms, this finds "the best" answer but retains scant memory of how it was reached. Adding code to allow for that is harder than what's here, and typically explodes memory requirements. Probably easiest here: write new to disk at the end of each outer-loop iteration, one file per column. Then memory use isn't affected. When it's done, those files can be read back in again, in reverse order, and mildly tedious code can reconstruct the path it must have taken to reach the winning state, working backwards one column at a time from the end.
def dumbdp(lists):
import datetime
_min = min
now = datetime.datetime.now
start = now()
n = len(lists)
p = len(lists[0])
assert all(len(L) == p for L in lists)
rangen = range(n)
USEDMASK = (1 << n) - 1
HUGE = sum(sum(L) for L in lists) + 1
new = [HUGE] * (2**n * n)
for i in rangen:
new[i << n] = lists[i][0]
for j in range(1, p):
print("working on", j, now() - start)
old = new
new = [HUGE] * (2**n * n)
for key, g in enumerate(old):
if g == HUGE:
continue
i = key >> n
new[key] = _min(new[key], g + lists[i][j])
newused = (key & USEDMASK) | (1 << i)
for i in rangen:
mask = 1 << i
if newused & mask == 0:
newkey = newused | (i << n)
new[newkey] = _min(new[newkey],
g + lists[i][j])
result = min(new)
print("DONE", result, now() - start)
return result

Probability: Estimating NoSQL Query Size / COUNT Using Random Samples

I have a very large NoSQL database. Each item in the database is assigned a uniformly distributed random value between 0 and 1. This database is so large that performing a COUNT on queries does not yield acceptable performance, but I'd like to use the random values to estimate COUNT.
The idea is this:
Run a query and order the query by the random value. Random values are indexed, so it's fast.
Grab the lowest N values, and see how big the largest value is, say R.
Estimate COUNT as N / R
The question is two-fold:
Is N / R the best way to estimate COUNT? Maybe it should be (N+1)/R? Maybe we could look at the other values (average, variance, etc), and not just the largest value to get a better estimate?
What is the error margin on this estimated value of COUNT?
Note: I thought about posting this in the math stack exchange, but given this is for databases, I thought it would be more appropriate here.
This actually would be better on math or statistics stack exchange.
The reasonable estimate is that if R is large and x is your order statistic, then R is approximately n / x - 1. About 95% of the time the error will be within 2 R / sqrt(n) of this. So looking at the 100th element will estimate the right answer to within about 20%. Looking at the 10,000th element will estimate it to within about 2%. And the millionth element will get you the right answer to within about 0.2%.
To see this, start with the fact that the n'th order statistic has a Beta distribution with parameters 𝛼 = n and β = R + 1 - n. Which means that the mean value of the n'th smallest value out of R values is n/(R+1). And its variance is 𝛼β / ((𝛼 + β)^2 (𝛼 + β + 1)). If we assume that R is much larger than n, then this is approximately n R / R^3 = n / R^2. Which means that our standard deviation is sqrt(n) / R.
If x is our order statistic, this means that (n / x) - 1 is a reasonable estimate of R. And how much is it off by? Well, we can use the tangent line approximation. The function (n / x) - 1 has a derivative of - n / x^2 Its derivative at x = n/(R+1) is therefore (R + 1)^2 / n. Which for large R is roughly R^2 / n. Stick in our standard deviation of sqrt(n) / R and we come up with an error proportional to R / sqrt(n). Since a 95% confidence interval would be 2 standard deviations, you probably will have an error of around 2 R / sqrt(n).

Best way to pick random elements from an array with at least a min diff in R

I would like to randomly choose from an array a certain number of elements in a way that those respect always a limit in their reciprocal distance.
For example, having a vector a <- seq(1,1000), how can I pick 20 elements with a minimum distance of 15 between each other?
For now, I am using a simple iteration for which I reject the choice whenever is too next to any element, but it is cumbersome and tends to be long if the number of elements to pick is high. Is there a best-practice/function for this?
EDIT - Summary of answers and analysis
So far I had two working answers which I wrapped in two specific functions.
# dash2 approach
# ---------------
rand_pick_min <- function(ar, min.dist, n.picks){
stopifnot(is.numeric(min.dist),
is.numeric(n.picks), n.picks%%1 == 0)
if(length(ar)/n.picks < min.dist)
stop('The number of picks exceeds the maximum number of divisions that the array allows which is: ',
floor(length(ar)/min.dist))
picked <- array(NA, n.picks)
copy <- ar
for (i in 1:n.picks) {
stopifnot(length(copy) > 0)
picked[i] <- sample(copy, 1)
copy <- copy[ abs(copy - picked[i]) >= min.dist ]
}
return(picked)
}
# denis approach
# ---------------
rand_pick_min2 <- function(ar, min.dist, n.picks){
require(Surrogate)
stopifnot(is.numeric(min.dist),
is.numeric(n.picks), n.picks%%1 == 0)
if(length(ar)/n.picks < min.dist)
stop('The number of picks exceeds the maximum number of divisions that the array allows which is: ',
floor(length(ar)/min.dist))
lar <- length(ar)
dist <- Surrogate::RandVec(a=min.dist, b=(lar-(n.picks)*min.dist),
s=lar, n=(n.picks+1), m=1, Seed=sample(1:lar, size = 1))$RandVecOutput
return(cumsum(round(dist))[1:n.picks])
}
Using the same example proposed I run 3 tests. Firstly, the effective validity of the minimum limit
# Libs
require(ggplot2)
require(microbenchmark)
# Inputs
a <- seq(1, 1000) # test vector
md <- 15 # min distance
np <- 20 # number of picks
# Run
dist_vec <- c(sapply(1:500, function(x) c(dist(rand_pick_min(a, md, np))))) # sol 1
dist_vec2 <- c(sapply(1:500, function(x) c(dist(rand_pick_min2(a, md, np))))) # sol 2
# Tests - break the min
cat('Any distance breaking the min in sol 1?', any(dist_vec < md), '\n') # FALSE
cat('Any distance breaking the min in sol 2?', any(dist_vec2 < md), '\n') # FALSE
Secondly, I tested for the distribution of the resulting distances, obtaining the first two plots in order of solution (sol1 [A] is dash2's sol, while sol2 [B] is denis' one).
pa <- ggplot() + theme_classic() +
geom_density(aes_string(x = dist_vec), fill = 'lightgreen') +
geom_vline(aes_string(xintercept = mean(dist_vec)), col = 'darkred') + xlab('Distances')
pb <- ggplot() + theme_classic() +
geom_density(aes_string(x = dist_vec2), fill = 'lightgreen') +
geom_vline(aes_string(xintercept = mean(dist_vec)), col = 'darkred') + xlab('Distances')
print(pa)
print(pb)
Lastly, I computed the computational times needed for the two approaches as following and obtaining the last figure.
comp_times <- microbenchmark::microbenchmark(
'solution_1' = rand_pick_min(a, md, np),
'solution_2' = rand_pick_min2(a, md, np),
times = 500
)
ggplot2::autoplot(comp_times); ggsave('stckoverflow2.png')
Enlighted by the results, I am asking my-self if the distance distribution as it is should be expected or it is a deviation due to the applied methods.
EDIT2 - Answer to the last question, following the comment made by denis
Using many more sampling procedures (5000), I produced a pdf of the resulting positions and indeed your approach contains some artefact that makes your solution (B) deviate from the one I needed. Nonetheless, it would be interesting to have the ability to enforce a specific final distribution of positions.
If you want to avoid the hit and miss methods, you will have to translate your problem into a sampling of distances with constraints on the sum of your distances.
Basically how i translate what you want: your N positions sampled are equivalent to N+1 distance, ranging from the minimum distance to the size of your vector - N*mindist (the case where all your samples are packed together). You then need to constrain the sum of the distances to be equal to 1000 (the size of your vector).
In this case the solution will use Surrogate::RandVec from Surrogate package (see Random sampling to give an exact sum), that allows a sampling with a fixed sum.
library(Surrogate)
a <- seq(1,1000)
mind <- 15
N <- 20
dist <- Surrogate::RandVec(a=mind, b=(1000-(N)*mind), s=1000, n=(N+1), m=1, Seed=sample(1:1000, size = 1))$RandVecOutput
pos <- cumsum(round(dist))[1:20]
pos
> pos
[1] 22 59 76 128 204 239 289 340 389 440 489 546 567 607 724 773 808 843 883 927
dist is the sampling f the distance. You reconstruct your position by making the sum of the distances. It gives you pos, the vector of your index positions.
The advantage is that you can get any value, and that your sampling is supposed to be random. For the speed part I don't know, you'll need to compare to your method for your big data case.
Here is an histogramm of 1000 try:
I think the best solution, which guarantees randomness in some sense (I'm not exactly sure what sense!) may be:
Pick a random element
Remove all elements that are too close to that element
Pick another element
Return to 2.
So:
min_dist <- 15
a <- seq(1, 1000)
picked <- integer(20)
copy <- a
for (i in 1:20) {
stopifnot(length(copy) > 0)
picked[i] <- sample(copy, 1)
copy <- copy[ abs(copy - picked[i]) >= min_dist ]
}
Whether this is faster than sample-and-reject may depend on the characteristics of the original vector. Also, as you can see, you are not guaranteed to be able to get all the elements you want, though in your particular case there won't be a problem because 19 intervals of width 30 could never cover the whole of seq(1, 1000).

Efficient histogram implementation using a hash function

Is there a more efficient approach to computing a histogram than a binary search for a non-linear bin distribution?
I'm actually only interested in the bit of the algorithm that matches the key (value) to the bin (the transfer function?) , i.e. for a bunch of floating point values I just want to know the appropriate bin index for each value.
I know that for a linear bin distribution you can get O(1) by dividing the value by the bin width, and that for non linear bins a binary search gets you O(logN). My current implementation uses a binary search on unequal bin widths.
In the spirit of improving efficiency I was curious as to whether you could use a hash function to map a value to its appropriate bin and achieve O(1) time complexity when you have bins of unequal widths?
In some simple cases you can get O(1).
Suppose, your values are 8-bit, from 0 to 255.
If you split them into 8 bins of sizes 2, 2, 4, 8, 16, 32, 64, 128, then the bin value ranges will be: 0-1, 2-3, 4-7, 8-15, 16-31, 32-63, 64-127, 128-255.
In binary these ranges look like:
0000000x (bin 0)
0000001x
000001xx
00001xxx
0001xxxx
001xxxxx
01xxxxxx
1xxxxxxx (bin 7)
So, if you can quickly (in O(1)) count how many most significant zero bits there are in the value, you can get the bin number from it.
In this particular case you may precalculate a look-up table of 256 elements, containing the bin number and finding the appropriate bin for a value is just one table look-up.
Actually, with 8-bit values you can use bins of arbitrary sizes since the look-up table is small.
If you were to go with bins of sizes of powers of 2, you could reuse this look-up table for 16-bit values as well. And you'd need two look-ups. You can extend it to even longer values.
Ordinary hash functions are intended to scatter different values quite randomly across some range. A single-bit difference in arguments may lead to dozens of bits different in results. For that reason, ordinary hash functions are not suitable for the situation described in the question.
An alternative is to build an array P with entries that index into the table B of bin limits. Given some value x, we find the bin j it belongs to (or sometimes a nearby bin) via j = P[⌊x·r⌋] where r is a ratio that depends on the size of P and the maximum value in B. The effectiveness of this approach depends on the values in B and the size of P.
The behavior of functions like P[⌊x·r⌋] can be seen via the python code shown below. (The method is about the same in any programming language. However, tips for Python-to-C are given below.) Suppose the code is stored in file histobins.py and loaded into the ipython interpreter with the command import histobins as hb. Then a command like hb.betterparts(27, 99, 9, 80,155) produces output like
At 80 parts, steps = 20 = 7+13
At 81 parts, steps = 16 = 7+9
At 86 parts, steps = 14 = 6+8
At 97 parts, steps = 13 = 12+1
At 108 parts, steps = 12 = 3+9
At 109 parts, steps = 12 = 8+4
At 118 parts, steps = 12 = 6+6
At 119 parts, steps = 10 = 7+3
At 122 parts, steps = 10 = 3+7
At 141 parts, steps = 10 = 5+5
At 142 parts, steps = 10 = 4+6
At 143 parts, steps = 9 = 7+2
These parameters to betterparts set nbins=27, topsize=99, seed=9, plo=80, phi=155 which creates a test set of 27 bins for values from 0 to 99, with random seed 9, and size of P from 80 to 155-1. The number of “steps” is the number of times the two while loops in testparts() operated during a test with 10*nbins values from 0 to topsize. Eg, “At 143 parts, steps = 9 = 7+2” means that when the size of P is 143, out of 270 trials, 261 times P[⌊x·r⌋] produced the correct index at once; 7 times the index had to be decreased, and twice it had to be increased.
The general idea of the method is to trade off space for time. Another tradeoff is preparation time versus operation time. If you are going to be doing billions of lookups, it is worthwhile to do a few thousand trials to find a good value of |P|, the size of P. If you are going to be doing only a few millions of lookups, it might be better to just pick some large value of |P| and run with it, or perhaps just run betterparts over a narrow range. Instead of doing 75 tests as above, if we start with larger |P| fewer tests may give a good enough result. For example, 10 tests via “hb.betterparts(27, 99, 9, 190,200)” produces
At 190 parts, steps = 11 = 5+6
At 191 parts, steps = 5 = 3+2
At 196 parts, steps = 5 = 4+1
As long as P fits into some level of cache (along with other relevant data) making |P| larger will speed up access. So, making |P| as large as practical is a good idea. As |P| gets larger, the difference in performance between one value of |P| and the next gets smaller and smaller. The limiting factors on speed then include time to multiply and time to set up while loops. One approach for faster multiplies may be to choose a power of 2 as a multiplier; compute |P| to match; then use shifts or adds to exponents instead of multiplies. One approach to spending less time setting up while loops is to move the statement if bins[bin] <= x < bins[bin+1]: (or its C equivalent, see below) to before the while statements and do the while's only if the if statement fails.
Python code is shown below. Note, in translating from Python to C,
• # begins a comment
• def begins a function
• a statement like ntest, right, wrong, x = 10*nbins, 0, 0, 0 assigns values to respective identifiers
• a statement like return (ntest, right, wrong, stepdown, stepup) returns a tuple of 5 values that the caller can assign to a tuple or to respective identifiers
• the scope of a def, while, or if ends with a line not indented farther than the def, while, or if
• bins = [0] initializes a list (an extendible indexable array) with value 0 as its initial entry
• bins.append(t) appends value t at the end of list bins
• for i,j in enumerate(p): runs a loop over the elements of iterable p (in this case, p is a list), making the index i and corresponding entry j == p[i] available inside the loop
• range(nparts) stands for a list of the values 0, 1, ... nparts-1
• range(plo, phi) stands for a list of the values plo, plo+1, ... phi-1
• if bins[bin] <= x < bins[bin+1] means if ((bins[bin] <= x) && (x < bins[bin+1]))
• int(round(x*float(nparts)/topsize))) actually rounds x·r, instead of computing ⌊x·r⌋ as advertised above
def makebins(nbins, topsize):
bins, t = [0], 0
for i in range(nbins):
t += random.random()
bins.append(t)
for i in range(nbins+1):
bins[i] *= topsize/t
bins.append(topsize+1)
return bins
#________________________________________________________________
def showbins(bins):
print ''.join('{:6.2f} '.format(x) for x in bins)
def showparts(nbins, bins, topsize, nparts, p):
ratio = float(topsize)/nparts
for i,j in enumerate(p):
print '{:3d}. {:3d} {:6.2f} {:7.2f} '.format(i, j, bins[j], i*ratio)
print 'nbins: {} topsize: {} nparts: {} ratio: {}'.format(nbins, topsize, nparts, ratio)
print 'p = ', p
print 'bins = ',
showbins(bins)
#________________________________________________________________
def testparts(nbins, topsize, nparts, seed):
# Make bins and make lookup table p
import random
if seed > 0: random.seed(seed)
bins = makebins(nbins,topsize)
ratio, j, p = float(topsize)/nparts, 0, range(nparts)
for i in range(nparts):
while j<nbins and i*ratio >= bins[j+1]:
j += 1
p[i] = j
p.append(j)
#showparts(nbins, bins, topsize, nparts, p)
# Count # of hits and steps with avg. of 10 items per bin
ntest, right, wrong, x = 10*nbins, 0, 0, 0
delta, stepdown, stepup = topsize/float(ntest), 0, 0
for i in range(ntest):
bin = p[min(nparts, max(0, int(round(x*float(nparts)/topsize))))]
while bin < nbins and x >= bins[bin+1]:
bin += 1; stepup += 1
while bin > 0 and x < bins[bin]:
bin -= 1; stepdown += 1
if bins[bin] <= x < bins[bin+1]: # Test if bin is correct
right += 1
else:
wrong += 1
print 'Wrong bin {} {:7.3f} at x={:7.3f} Too {}'.format(bin, bins[bin], x, 'high' if bins[bin] > x else 'low')
x += delta
return (ntest, right, wrong, stepdown, stepup)
#________________________________________________________________
def betterparts(nbins, topsize, seed, plo, phi):
beststep = 1e9
for parts in range(plo, phi):
ntest, right, wrong, stepdown, stepup = testparts(nbins, topsize, parts, seed)
if wrong: print 'Error with ', parts, ' parts'
steps = stepdown + stepup
if steps <= beststep:
beststep = steps
print 'At {:3d} parts, steps = {:d} = {:d}+{:d}'.format(parts, steps, stepdown, stepup)
#________________________________________________________________
Interpolation search is your friend. It's kind of an optimistic, predictive binary search where it guesses where the bin should be based on a linear assumption about the distribution of inputs, rather than just splitting the search space in half at each step. It will be O(1) if the linear assumption is true, but still works (though more slowly) when the assumption is not. To the degree that its predictions are accurate, the search is fast.
Depends on the implementation of the hashing and the type of data you're working with. For smaller data sets a more simple algorithm like binary search might outperform constant lookup if the lookup-overhead of hashing is larger on average.
The usual implementation of hashing, consists of an array of linked lists and a hashing function that maps a string to an index in the array of linked lists. There's a thing called the load factor, which is the number of elements in the hash map / length of the linked-list array. Thus for load factors < 1 you'll achieve constant lookup in the best case because no linked-list will contain more than one element (best case).
There's only one way to find out which is better - implement a hash map and see for yourself. You should be able to get something near constant lookup :)

Calculate initial velocity to move a set distance with inertia

I want to move something a set distance. However in my system there is inertia/drag/negative accelaration. I'm using a simple calculation like this for it:
v = oldV + ((targetV - oldV) * inertia)
Applying that over a number of frames makes the movement 'ramp up' or decay, eg:
v = 10 + ((0 - 10) * 0.25) = 7.5 // velocity changes from 10 to 7.5 this frame
So I know the distance I want to travel and the acceleration, but not the initial velocity that will get me there. Maybe a better explanation is I want to know how hard to hit a billiard ball so that it stops on a certain point.
I've been looking at Equations of motion (http://en.wikipedia.org/wiki/Equations_of_motion) but can't work out what the correct one for my problem is...
Any ideas? Thanks - I am from a design not science background.
Update: Fiirhok has a solution with a fixed acceleration value; HTML+jQuery demo:
http://pastebin.com/ekDwCYvj
Is there any way to do this with a fractional value or an easing function? The benefit of that in my experience is that fixed acceleration and frame based animation sometimes overshoots the final point and needs to be forced, creating a slight snapping glitch.
This is a simple kinematics problem.
At some time t, the velocity (v) of an object under constant acceleration is described by:
v = v0 + at
Where v0 is the initial velocity and a is the acceleration. In your case, the final velocity is zero (the object is stopped) so we can solve for t:
t = -v0/a
To find the total difference traveled, we take the integral of the velocity (the first equation) over time. I haven't done an integral in years, but I'm pretty sure this one works out to:
d = v0t + 1/2 * at^2
We can substitute in the equation for t we developed ealier:
d = v0^2/a + 1/2 * v0^2 / a
And the solve for v0:
v0 = sqrt(-2ad)
Or, in a more programming-language format:
initialVelocity = sqrt( -2 * acceleration * distance );
The acceleration in this case is negative (the object is slowing down), and I'm assuming that it's constant, otherwise this gets more complicated.
If you want to use this inside a loop with a finite number of steps, you'll need to be a little careful. Each iteration of the loop represents a period of time. The object will move an amount equal to the average velocity times the length of time. A sample loop with the length of time of an iteration equal to 1 would look something like this:
position = 0;
currentVelocity = initialVelocity;
while( currentVelocity > 0 )
{
averageVelocity = currentVelocity + (acceleration / 2);
position = position + averageVelocity;
currentVelocity += acceleration;
}
If you want to move a set distance, use the following:
Distance travelled is just the integral of velocity with respect to time. You need to integrate your expression with respect to time with limits [v, 0] and this will give you an expression for distance in terms of v (initial velocity).

Resources