Modeling memory access on Z3

Modeling memory access on Z3 - arrays

I'm modelling a program's memory accesses using Z3 and I have a doubt about performance that I'd like to share.
I wanted to model on a compact way something like a:
memset(dst, 0, 1000);
My first try was to use the array theory, but that meant to either create a thousand terms like (assert (and (= (select mem 0) 0) (= (select mem 1) 0) ... or a thousand similar stores or a quantified formula:
(forall (x Int) (implies (and (>= x 0) (< x 1000)) (= (select mem x) 0))
But I was told to avoid quantifiers while using arrays.
Next idea was to define a UF:
(define-fun newmemfun ((idx Int)) Int (
ite (and (>= idx 0) (< idx 1000)) 0 (prevmemfun idx)
))
But that means that I need to define a new function for each memory write operation (even for individual store operations, not just multiple stores like memset or memcpy). Which would end up creating a very nested ITE structure that would even save "old" values for a same index. ie:
mem[0] = 1;
mem[0] = 2;
would be:
(ite (= idx 0) 2 (ite (= idx 0) 1 ...
Which is functionally correct but the size of the expression (and I guess the generated AST for it) tends to accumulate very fast and I'm not sure if Z3 is optimized to detect and handle this case.
So, the question is: what would be the most performant way to encode memory operations that can cope with large multiple stores like the example above and individual stores at the same time.
Thanks,
pablo.
PS: non-closed and non-matching parenthesis intended :P.

Without knowing a bit more about your end goal, aside from modeling memory accesses (e.g., are you going to be doing verification, test case generation, etc.?), it's somewhat hard to answer, as you have many options. However, you may have the most flexibility to control performance issues if you rely on one of the APIs. For example, you can define your own memory accesses as follows (link to z3py script: http://rise4fun.com/Z3Py/gO6i ):
address_bits = 7
data_bits = 8
s = Solver()
# mem is a list of length program step, of a list of length 2^address_bits of bitvectors of size 2^data_bits
mem =[]
# modify a single address addr to value at program step step
def modifyAddr(addr, value, step):
mem.append([]) # add new step
for i in range(0,2**address_bits):
mem[step+1].append( BitVec('m' + str(step + 1) + '_' + str(i), data_bits) )
if i != addr:
s.add(mem[step+1][i] == mem[step][i])
else:
s.add(mem[step+1][i] == value)
# set all memory addresses to a specified value at program step step
def memSet(value, step):
mem.append([])
for i in range(0,2**address_bits):
mem[step+1].append( BitVec('m' + str(step + 1) + '_' + str(i), data_bits) )
s.add(mem[step+1][i] == value)
modaddr = 23 # example address
step = -1
# initialize all addresses to 0
memSet(0, step)
step += 1
print s.check()
for i in range(0,step+1): print s.model()[mem[i][modaddr]] # print all step values for modaddr
modifyAddr(modaddr,3,step)
step += 1
print s.check()
for i in range(0,step+1): print s.model()[mem[i][modaddr]]
modifyAddr(modaddr,4,step)
step += 1
print s.check()
for i in range(0,step+1): print s.model()[mem[i][modaddr]]
modifyAddr(modaddr,2**6,step)
step += 1
print s.check()
for i in range(0,step+1): print s.model()[mem[i][modaddr]]
memSet(1,step)
step += 1
print s.check()
for i in range(0,step+1): print s.model()[mem[i][modaddr]]
for a in range(0,2**address_bits): # set all address values to their address number
modifyAddr(a,a,step)
step += 1
print s.check()
print "values for modaddr at all steps"
for i in range(0,step+1): print s.model()[mem[i][modaddr]] # print all values at each step for modaddr
print "values at final step"
for i in range(0,2**address_bits): print s.model()[mem[step][i]] # print all memory addresses at final step
This naive implementation allows you to either (a) set all memory addresses to some value (like your memset), or (b) modify a single memory address, constraining all other addresses to have the same value. For me, it took a few seconds to run and encoded about 128 steps of 128 addresses, so it had around 20000 bitvector expressions of 8 bits each.
Now, depending on what you are doing (e.g., do you allow atomic writes to several addresses like this memset, or do you want to model them all as individual writes?), you could add further functions, like modify a subset of addresses to some values in a program step. This will allow you some flexibility to trade off modeling accuracy for performance (e.g., atomic writes to blocks of memory versus modifying single addresses at a time, which will run into performance problems). Also, nothing about this implementation requires the APIs, you could encode this as an SMT-LIB file as well, but you will probably have more flexibility (e.g., lets say you want to interact with models to constrain future sat checks) if you use one of the APIs.

Related

How to find contiguous subarray of integers in an array from n arrays such that the sum of elements of such contiguous subarrays is minimum

Input: n arrays of integers of length p.
Output: An array of p integers built by copying contiguous subarrays of the input arrays into matching indices of the output, satisfying the following conditions.
At most one subarray is used from each input array.
Every index of the output array is filled from exactly one subarray.
The output array has the minimum possible sum.
Suppose I have 2 arrays:
[1,7,2]
[2,1,8]
So if I choose a subarray [1,7] from array 1 and subarray [8] from array 2. since these 2 subarrays are not overlapping for any index and are contiguous. We are also not taking any subarray twice from an array from which we have already chosen a subarray.
We have the number of elements in the arrays inside the collection = 2 + 1 = 3, which is the same as the length of the individual array (i.e. len(array 1) which is equal to 3). So, this collection is valid.
The sum here for [1,7] and [8] is 1 + 7 + 8 = 16
We have to find a collection of such subarrays such that the total sum of the elements of subarrays is minimum.
A solution to the above 2 arrays would be a collection [2,1] from array 1 and [2] from array 2.
This is a valid collection and the sum is 2 + 1 + 2 = 5 which is the minimum sum for any such collection in this case.
I cannot think of any optimal or correct approach, so I need help.
Some Ideas:
I tried a greedy approach by choosing minimum elements from all array for a particular index since the index is always increasing (non-overlapping) after a valid choice, I don't have to bother about storing minimum value indices for every array. But this approach is clearly not correct since it will visit the same array twice.
Another method I thought was to start from the 0th index for all arrays and start storing their sum up to k elements for every array since the no. of arrays are finite, I can store the sum upto k elements in an array. Now I tried to take a minimum across these sums and for a "minimum sum", the corresponding subarray giving this sum (i.e. k such elements in that array) can be a candidate for a valid subarray of size k, thus if we take this subarray, we can add a k + 1-th element corresponding to every array into their corresponding sum and if the original minimum still holds, then we can keep on repeating this step. When the minima fail, we can consider the subarray up to the index for which minima holds and this will be a valid starting subarray. However, this approach will also clearly fail because there could exist another subarray of size < k giving minima along with remaining index elements from our subarray of size k.
Sorting is not possible either, since if we sort then we are breaking consecutive condition.
Of course, there is a brute force method too.
I am thinking, working through a greedy approach might give a progress in the approach.
I have searched on other Stackoverflow posts, but couldn't find anything which could help my problem.

To get you started, here's a recursive branch-&-bound backtracking - and potentially exhaustive - search. Ordering heuristics can have a huge effect on how efficient these are, but without mounds of "real life" data to test against there's scant basis for picking one over another. This incorporates what may be the single most obvious ordering rule.
Because it's a work in progress, it prints stuff as it goes along: all solutions found, whenever they meet or beat the current best; and the index at which a search is cut off early, when that happens (because it becomes obvious that the partial solution at that point can't be extended to meet or beat the best full solution known so far).
For example,
>>> crunch([[5, 6, 7], [8, 0, 3], [2, 8, 7], [8, 2, 3]])
displays
new best
L2[0:1] = [2] 2
L1[1:2] = [0] 2
L3[2:3] = [3] 5
sum 5
cut at 2
L2[0:1] = [2] 2
L1[1:3] = [0, 3] 5
sum 5
cut at 2
cut at 2
cut at 2
cut at 1
cut at 1
cut at 2
cut at 2
cut at 2
cut at 1
cut at 1
cut at 1
cut at 0
cut at 0
So it found two ways to get a minimal sum 5, and the simple ordering heuristic was effective enough that all other paths to full solutions were cut off early.
def disp(lists, ixs):
from itertools import groupby
total = 0
i = 0
for k, g in groupby(ixs):
j = i + len(list(g))
chunk = lists[k][i:j]
total += sum(chunk)
print(f"L{k}[{i}:{j}] = {chunk} {total}")
i = j
def crunch(lists):
n = len(lists[0])
assert all(len(L) == n for L in lists)
# Start with a sum we know can be beat.
smallest_sum = sum(lists[0]) + 1
smallest_ixs = [None] * n
ixsofar = [None] * n
def inner(i, sumsofar, freelists):
nonlocal smallest_sum
assert sumsofar <= smallest_sum
if i == n:
print()
if sumsofar < smallest_sum:
smallest_sum = sumsofar
smallest_ixs[:] = ixsofar
print("new best")
disp(lists, ixsofar)
print("sum", sumsofar)
return
# Simple greedy heuristic: try available lists in the order
# of smallest-to-largest at index i.
for lix in sorted(freelists, key=lambda lix: lists[lix][i]):
L = lists[lix]
newsum = sumsofar
freelists.remove(lix)
# Try all slices in L starting at i.
for j in range(i, n):
newsum += L[j]
# ">" to find all smallest answers;
# ">=" to find just one (potentially faster)
if newsum > smallest_sum:
print("cut at", j)
break
ixsofar[j] = lix
inner(j + 1, newsum, freelists)
freelists.add(lix)
inner(0, 0, set(range(len(lists))))
How bad is brute force?
Bad. A brute force way to compute it: say there are n lists each with p elements. The code's ixsofar vector contains p integers each in range(n). The only constraint is that all occurrences of any integer that appears in it must be consecutive. So a brute force way to compute the total number of such vectors is to generate all p-tuples and count the number that meet the constraints. This is woefully inefficient, taking O(n**p) time, but is really easy, so hard to get wrong:
def countb(n, p):
from itertools import product, groupby
result = 0
seen = set()
for t in product(range(n), repeat=p):
seen.clear()
for k, g in groupby(t):
if k in seen:
break
seen.add(k)
else:
#print(t)
result += 1
return result
For small arguments, we can use that as a sanity check on the next function, which is efficient. This builds on common "stars and bars" combinatorial arguments to deduce the result:
def count(n, p):
# n lists of length p
# for r regions, r from 1 through min(p, n)
# number of ways to split up: comb((p - r) + r - 1, r - 1)
# for each, ff(n, r) ways to spray in list indices = comb(n, r) * r!
from math import comb, prod
total = 0
for r in range(1, min(n, p) + 1):
total += comb(p-1, r-1) * prod(range(n, n-r, -1))
return total
Faster
Following is the best code I have for this so far. It builds in more "smarts" to the code I posted before. In one sense, it's very effective. For example, for randomized p = n = 20 inputs it usually finishes within a second. That's nothing to sneeze at, since:
>>> count(20, 20)
1399496554158060983080
>>> _.bit_length()
71
That is, trying every possible way would effectively take forever. The number of cases to try doesn't even fit in a 64-bit int.
On the other hand, boost n (the number of lists) to 30, and it can take an hour. At 50, I haven't seen a non-contrived case finish yet, even if left to run overnight. The combinatorial explosion eventually becomes overwhelming.
OTOH, I'm looking for the smallest sum, period. If you needed to solve problems like this in real life, you'd either need a much smarter approach, or settle for iterative approximation algorithms.
Note: this is still a work in progress, so isn't polished, and prints some stuff as it goes along. Mostly that's been reduced to running a "watchdog" thread that wakes up every 10 minutes to show the current state of the ixsofar vector.
def crunch(lists):
import datetime
now = datetime.datetime.now
start = now()
n = len(lists[0])
assert all(len(L) == n for L in lists)
# Start with a sum we know can be beat.
smallest_sum = min(map(sum, lists)) + 1
smallest_ixs = [None] * n
ixsofar = [None] * n
import threading
def watcher(stop):
if stop.wait(60):
return
lix = ixsofar[:]
while not stop.wait(timeout=600):
print("watch", now() - start, smallest_sum)
nlix = ixsofar[:]
for i, (a, b) in enumerate(zip(lix, nlix)):
if a != b:
nlix.insert(i,"--- " + str(i) + " -->")
print(nlix)
del nlix[i]
break
lix = nlix
stop = threading.Event()
w = threading.Thread(target=watcher, args=[stop])
w.start()
def inner(i, sumsofar, freelists):
nonlocal smallest_sum
assert sumsofar <= smallest_sum
if i == n:
print()
if sumsofar < smallest_sum:
smallest_sum = sumsofar
smallest_ixs[:] = ixsofar
print("new best")
disp(lists, ixsofar)
print("sum", sumsofar, now() - start)
return
# If only one input list is still free, we have to take all
# of its tail. This code block isn't necessary, but gives a
# minor speedup (skips layers of do-nothing calls),
# especially when the length of the lists is greater than
# the number of lists.
if len(freelists) == 1:
lix = freelists.pop()
L = lists[lix]
for j in range(i, n):
ixsofar[j] = lix
sumsofar += L[j]
if sumsofar >= smallest_sum:
break
else:
inner(n, sumsofar, freelists)
freelists.add(lix)
return
# Peek ahead. The smallest completion we could possibly get
# would come from picking the smallest element in each
# remaining column (restricted to the lists - rows - still
# available). This probably isn't achievable, but is an
# absolute lower bound on what's possible, so can be used to
# cut off searches early.
newsum = sumsofar
for j in range(i, n): # pick smallest from column j
newsum += min(lists[lix][j] for lix in freelists)
if newsum >= smallest_sum:
return
# Simple greedy heuristic: try available lists in the order
# of smallest-to-largest at index i.
sortedlix = sorted(freelists, key=lambda lix: lists[lix][i])
# What's the next int in the previous slice? As soon as we
# hit an int at least that large, we can do at least as well
# by just returning, to let the caller extend the previous
# slice instead.
if i:
prev = lists[ixsofar[i-1]][i]
else:
prev = lists[sortedlix[-1]][i] + 1
for lix in sortedlix:
L = lists[lix]
if prev <= L[i]:
return
freelists.remove(lix)
newsum = sumsofar
# Try all non-empty slices in L starting at i.
for j in range(i, n):
newsum += L[j]
if newsum >= smallest_sum:
break
ixsofar[j] = lix
inner(j + 1, newsum, freelists)
freelists.add(lix)
inner(0, 0, set(range(len(lists))))
stop.set()
w.join()
Bounded by DP
I've had a lot of fun with this :-) Here's the approach they were probably looking for, using dynamic programming (DP). I have several programs that run faster in "smallish" cases, but none that can really compete on a non-contrived 20x50 case. The runtime is O(2**n * n**2 * p). Yes, that's more than exponential in n! But it's still a minuscule fraction of what brute force can require (see above), and is a hard upper bound.
Note: this is just a loop nest slinging machine-size integers, and using no "fancy" Python features. It would be easy to recode in C, where it would run much faster. As is, this code runs over 10x faster under PyPy (as opposed to the standard CPython interpreter).
Key insight: suppose we're going left to right, have reached column j, the last list we picked from was D, and before that we picked columns from lists A, B, and C. How can we proceed? Well, we can pick the next column from D too, and the "used" set {A, B, C} doesn't change. Or we can pick some other list E, the "used" set changes to {A, B, C, D}, and E becomes the last list we picked from.
Now in all these cases, the details of how we reached state "used set {A, B, C} with last list D at column j" make no difference to the collection of possible completions. It doesn't matter how many columns we picked from each, or the order in which A, B, C were used: all that matters to future choices is that A, B, and C can't be used again, and D can be but - if so - must be used immediately.
Since all ways of reaching this state have the same possible completions, the cheapest full solution must have the cheapest way of reaching this state.
So we just go left to right, one column at a time, and remember for each state in the column the smallest sum reaching that state.
This isn't cheap, but it's finite ;-) Since states are subsets of row indices, combined with (the index of) the last list used, there are 2**n * n possible states to keep track of. In fact, there are only half that, since the way sketched above never includes the index of the last-used list in the used set, but catering to that would probably cost more than it saves.
As is, states here are not represented explicitly. Instead there's just a large list of sums-so-far, of length 2**n * n. The state is implied by the list index: index i represents the state where:
i >> n is the index of the last-used list.
The last n bits of i are a bitset, where bit 2**j is set if and only if list index j is in the set of used list indices.
You could, e.g., represent these by dicts mapping (frozenset, index) pairs to sums instead, but then memory use explodes, runtime zooms, and PyPy becomes much less effective at speeding it.
Sad but true: like most DP algorithms, this finds "the best" answer but retains scant memory of how it was reached. Adding code to allow for that is harder than what's here, and typically explodes memory requirements. Probably easiest here: write new to disk at the end of each outer-loop iteration, one file per column. Then memory use isn't affected. When it's done, those files can be read back in again, in reverse order, and mildly tedious code can reconstruct the path it must have taken to reach the winning state, working backwards one column at a time from the end.
def dumbdp(lists):
import datetime
_min = min
now = datetime.datetime.now
start = now()
n = len(lists)
p = len(lists[0])
assert all(len(L) == p for L in lists)
rangen = range(n)
USEDMASK = (1 << n) - 1
HUGE = sum(sum(L) for L in lists) + 1
new = [HUGE] * (2**n * n)
for i in rangen:
new[i << n] = lists[i][0]
for j in range(1, p):
print("working on", j, now() - start)
old = new
new = [HUGE] * (2**n * n)
for key, g in enumerate(old):
if g == HUGE:
continue
i = key >> n
new[key] = _min(new[key], g + lists[i][j])
newused = (key & USEDMASK) | (1 << i)
for i in rangen:
mask = 1 << i
if newused & mask == 0:
newkey = newused | (i << n)
new[newkey] = _min(new[newkey],
g + lists[i][j])
result = min(new)
print("DONE", result, now() - start)
return result

Define a vector with random steps

I want to create an array that has incremental random steps, I've used this simple code.
t_inici=(0:10*rand:100);
The problem is that the random number keeps unchangable between steps. Is there any simple way to change the seed of the random number within each step?

If you have a set number of points, say nPts, then you could do the following
nPts = 10; % Could use 'randi' here for random number of points
lims = [0, 10] % Start and end points
x = rand(1, nPts); % Create random numbers
% Sort and scale x to fit your limits and be ordered
x = diff(lims) * ( sort(x) - min(x) ) / diff(minmax(x)) + lims(1)
This approach always includes your end point, which a 0:dx:10 approach would not necessarily.
If you had some maximum number of points, say nPtsMax, then you could do the following
nPtsMax = 1000; % Max number of points
lims = [0,10]; % Start and end points
% Could do 10* or any other multiplier as in your example in front of 'rand'
x = lims(1) + [0 cumsum(rand(1, nPtsMax))];
x(x > lims(2)) = []; % remove values above maximum limit
This approach may be slower, but is still fairly quick and better represents the behaviour in your question.

My first approach to this would be to generate N-2 samples, where N is the desired amount of samples randomly, sort them, and add the extrema:
N=50;
endpoint=100;
initpoint=0;
randsamples=sort(rand(1, N-2)*(endpoint-initpoint)+initpoint);
t_inici=[initpoint randsamples endpoint];
However not sure how "uniformly random" this is, as you are "faking" the last 2 data, to have the extrema included. This will somehow distort pure randomness (I think). If you are not necessarily interested on including the extrema, then just remove the last line and generate N points. That will make sure that they are indeed random (or as random as MATLAB can create them).

Here is an alternative solution with "uniformly random"
[initpoint,endpoint,coef]=deal(0,100,10);
t_inici(1)=initpoint;
while(t_inici(end)<endpoint)
t_inici(end+1)=t_inici(end)+rand()*coef;
end
t_inici(end)=[];
In my point of view, it fits your attempts well with unknown steps, start from 0, but not necessarily end at 100.

From your code it seems you want a uniformly random step that varies between each two entries. This implies that the number of entries that the vector will have is unknown in advance.
A way to do that is as follows. This is similar to Hunter Jiang's answer but adds entries in batches instead of one by one, in order to reduce the number of loop iterations.
Guess a number of required entries, n. Any value will do, but a large value will result in fewer iterations and will probably be more efficient.
Initiallize result to the first value.
Generate n entries and concatenate them to the (temporary) result.
See if the current entries are already too many.
If they are, cut as needed and output (final) result. Else go back to step 3.
Code:
lower_value = 0;
upper_value = 100;
step_scale = 10;
n = 5*(upper_value-lower_value)/step_scale*2; % STEP 1. The number 5 here is arbitrary.
% It's probably more efficient to err with too many than with too few
result = lower_value; % STEP 2
done = false;
while ~done
result = [result result(end)+cumsum(step_scale*rand(1,n))]; % STEP 3. Include
% n new entries
ind_final = find(result>upper_value,1)-1; % STEP 4. Index of first entry exceeding
% upper_value, if any
if ind_final % STEP 5. If non-empty, we're done
result = result(1:ind_final-1);
done = true;
end
end

Effective way of calculating Correlation/Covariance Matrix row-wise

I have around 3000 files. Each file has a around 55000 rows/identifier and around ~100 columns. I need to calculate row-wise correlation or weighted covariance for each file (depending upon the number of columns in the file). The number of rows are same in all the files. I would like to know what is the most effective way to calculate the correlation matrix for each file ? I have tried Perl and C++ but it is taking a lot of time to process a file -- Perl takes 6 days, C takes more than a day. Typically, I don't want to take more than 15-20 minutes per file.
Now, I would like to know if I could process it faster using some trick or something. Here is my pseudo code:
while (using the file handler)
reading the file line by line
Storing the column values in hash1 where the key is the identifier
Storing the mean and ssxx (Sum of Squared Deviations of x to the mean) to the hash2 and hash3 respectively (I used hash of hashed in Perl) by calling the mean and ssxx function
end
close file handler
for loop traversing the hash (this is nested for loop as I need values of 2 different identifiers to calculate correlation coefficient)
calculate ssxxy by calling the ssxy function i.e. Sum of Squared Deviations of x and y to their mean
calculate correlation coefficient.
end
Now, I am calculating the correlation coefficient for a pair only once and I am not calculating the correlation coefficient for the same identifier. I have taken care of that using my nested for loop. Do you think if there is a way to calculate the correlation coefficient faster ? Any hints/advice would be great. Thanks!
EDIT1:
My Input File looks like this -- for the first 10 identifiers:
"Ident_01" 6453.07 8895.79 8145.31 6388.25 6779.12
"Ident_02" 449.803 367.757 302.633 318.037 331.55
"Ident_03" 16.4878 198.937 220.376 91.352 237.983
"Ident_04" 26.4878 398.937 130.376 92.352 177.983
"Ident_05" 36.4878 298.937 430.376 93.352 167.983
"Ident_06" 46.4878 498.937 560.376 94.352 157.983
"Ident_07" 56.4878 598.937 700.376 95.352 147.983
"Ident_08" 66.4878 698.937 990.376 96.352 137.983
"Ident_09" 76.4878 798.937 120.376 97.352 117.983
"Ident_10" 86.4878 898.937 450.376 98.352 127.983
EDIT2: here is snippet/subroutines or functions that I wrote in perl
## Pearson Correlation Coefficient
sub correlation {
my( $arr1, $arr2) = #_;
my $ssxy = ssxy( $arr1->{string}, $arr2->{string}, $arr1->{mean}, $arr2->{mean} );
my $cor = $ssxy / sqrt( $arr1->{ssxx} * $arr2->{ssxx} );
return $cor ;
}
## Mean
sub mean {
my $arr1 = shift;
my $mu_x = sum( #$arr1) /scalar(#$arr1);
return($mu_x);
}
## Sum of Squared Deviations of x to the mean i.e. ssxx
sub ssxx {
my ( $arr1, $mean_x ) = #_;
my $ssxx = 0;
## looping over all the samples
for( my $i = 0; $i < #$arr1; $i++ ){
$ssxx = $ssxx + ( $arr1->[$i] - $mean_x )**2;
}
return($ssxx);
}
## Sum of Squared Deviations of xy to the mean i.e. ssxy
sub ssxy {
my( $arr1, $arr2, $mean_x, $mean_y ) = #_;
my $ssxy = 0;
## looping over all the samples
for( my $i = 0; $i < #$arr1; $i++ ){
$ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) * ( $arr2->[$i] - $mean_y );
}
return ($ssxy);
}

Have you searched CPAN? Method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.
gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }

While minor improvements might be possible, I would suggest investing in learning PDL. The documentation on matrix operations may be useful.

#Sinan and #Praveen have the right idea for how to do this within perl. I would suggest that the overhead inherent in perl means you will never get the efficiency that you are looking for. I would suggest that you work on optimizing your C code.
First step would be to set the -O3 flag for maximum code optimization.
From there, I would change your ssxx code so that it subtracts the mean from each data point in place: x[i] -= mean. This means that you no longer need to subtract the mean in your ssxy code so that you do the subtraction once instead 55001 times.
I would check the disassembly to guarantee that the (x-mean)**2 is compiled to a multiplication, instead of 2^(2 * log(x - mean)), or just write it that way instead.
What sort of data structure are you using for your data? A double** with memory allocated for each row will lead to extra calls to (the slow function) malloc. Also, it is more likely to lead to memory thrashing with the allocated memory being located in different places. Ideally, you should have as few calls to malloc for as large as possible blocks of memory, and using pointer arithmetic to traverse the data.
More optimizations should be possible. If you post your code, I can make some suggestions.

Does Lua optimize the ".." operator?

I have to execute the following code:
local filename = dir .. "/" .. base
thousands of times in a loop (it's a recursion that prints a directory tree).
Now, I wonder whether Lua concatenates the 3 strings (dir, "/", base) in one go (i.e., by allocating a string long enough to hold their total lengths) or whether it does this the inefficient way by doing it internally in two steps:
local filename = (dir .. "/") -- step1
.. base -- step2
This last way would be inefficient memory-wise because two strings are allocated instead of just one.
I don't care much about CPU cycles: I care mainly about memory consumption.
Finally, let me generalize the question:
Does Lua allocate only one string, or 4, when it executes the following code?
local result = str1 .. str2 .. str3 .. str4 .. str5
BTW, I know that I could do:
local filename = string.format("%s/%s", dir, base)
But I've yet to benchmark it (memory & CPU wise).
(BTW, I know about table:concat(). This has the added overhead of creating a table so I guess it won't be beneficial in all use cases.)
A bonus question:
In case Lua doesn't optimize the ".." operator, would it be a good idea to define a C function for concatenating strings, e.g. utils.concat(dir, "/", base, ".", extension)?

Although Lua performs a simple optimization on .. usage, you should still be careful to use it in a tight loop, especially when joining very large strings, because this will create lots of garbage and thus impact performance.
The best way to concatenate many strings is with table.concat.
table.concat lets you use a table as a temporary buffer for all the strings to be concatenated and perform the concatenation only when you are done adding strings to the buffer, like in the following silly example:
local buf = {}
for i = 1, 10000 do
buf[#buf+1] = get_a_string_from_somewhere()
end
local final_string = table.concat( buf )
The simple optimization for .. can be seen analyzing the disassembled bytecode of the following script:
-- file "lua_06.lua"
local a = "hello"
local b = "cruel"
local c = "world"
local z = a .. " " .. b .. " " .. c
print(z)
the output of luac -l -p lua_06.lua is the following (for Lua 5.2.2 - edit: the same bytecode is output also in Lua 5.3.6):
main (13 instructions at 003E40A0)
0+ params, 8 slots, 1 upvalue, 4 locals, 5 constants, 0 functions
1 [3] LOADK 0 -1 ; "hello"
2 [4] LOADK 1 -2 ; "cruel"
3 [5] LOADK 2 -3 ; "world"
4 [7] MOVE 3 0
5 [7] LOADK 4 -4 ; " "
6 [7] MOVE 5 1
7 [7] LOADK 6 -4 ; " "
8 [7] MOVE 7 2
9 [7] CONCAT 3 3 7
10 [9] GETTABUP 4 0 -5 ; _ENV "print"
11 [9] MOVE 5 3
12 [9] CALL 4 2 1
13 [9] RETURN 0 1
You can see that only a single CONCAT opcode is generated, although many .. operators are used in the script.
To fully understand when to use table.concat you must know that Lua strings are immutable. This means that whenever you try to concatenate two strings you are indeed creating a new string (unless the resulting string is already interned by the interpreter, but this is usually unlikely). For example, consider the following fragment:
local s = s .. "hello"
and assume that s already contains a huge string (say, 10MB). Executing that statement creates a new string (10MB + 5 characters) and discards the old one. So you have just created a 10MB dead object for the garbage collector to cope with. If you do this repeatedly you end up hogging the garbage collector. This is the real problem with .. and this is the typical use case where it is necessary to collect all the pieces of the final string in a table and to use table.concat on it: this won't avoid the generation of garbage (all the pieces will be garbage after the call to table.concat), but you will greatly reduce unnecessary garbage.
Conclusions
Use .. whenever you concatenate few, possibly short, strings, or you are not in a tight loop. In this case table.concat could give you worse performance because:
you must create a table (which usually you would throw away);
you have to call the function table.concat (the function call overhead impacts performance more than using the built-in .. operator a few times).
Use table.concat, if you need to concatenate many strings, especially if one or more of the following conditions are met:
you must do it in subsequent steps (the .. optimization works only inside the same expression);
you are in a tight loop;
the strings are large (say, several kBs or more).
Note that these are just rules of thumb. Where performance is really paramount you should profile your code.
Anyway Lua is quite fast compared with other scripting languages when dealing with strings, so usually you don't need to care so much.

In your example, whether the .. operator does optimization is hardly a problem for the performance, you don't have to worry about memory or CPU. And there's table.concat for concatenating many strings. (See Programming in Lua) for the use of table.concat.
Back to your question, in this piece of code
local result = str1 .. str2 .. str3 .. str4 .. str5
Lua allocates only one new string, check out this loop from Lua's relevant source in luaV_concat:
do { /* concat all strings */
size_t l = tsvalue(top-i)->len;
memcpy(buffer+tl, svalue(top-i), l * sizeof(char));
tl += l;
} while (--i > 0);
setsvalue2s(L, top-n, luaS_newlstr(L, buffer, tl));
total -= n-1; /* got 'n' strings to create 1 new */
L->top -= n-1; /* popped 'n' strings and pushed one */
You can see that Lua concatenate n strings in this loop but only pushes back to the stack one string in the end, which is the result string.

BTW, I know about table:concat(). This has the added overhead of creating a table so I guess it won't be beneficial in all use cases.
In this particular use case (and similar ones), you could consider reusing a table if you're concerned with creating lots of garbage tables:
local path = {}
...
-- someplace else, in a loop or function:
path[1], path[2] = dir, base
local filename = table.concat(path, "/")
path[1], path[2] = nil
...
you could even generalize this to a "concat" utility:
local rope = {}
function string_concat(...)
for i = 1, select("#", ...) do rope[i] = select(i, ...) end -- prepare rope
local res = table.concat(rope)
for i = 1, select("#", ...) do rope[i] = nil end -- clear rope
return res
end

Efficient histogram implementation using a hash function

Is there a more efficient approach to computing a histogram than a binary search for a non-linear bin distribution?
I'm actually only interested in the bit of the algorithm that matches the key (value) to the bin (the transfer function?) , i.e. for a bunch of floating point values I just want to know the appropriate bin index for each value.
I know that for a linear bin distribution you can get O(1) by dividing the value by the bin width, and that for non linear bins a binary search gets you O(logN). My current implementation uses a binary search on unequal bin widths.
In the spirit of improving efficiency I was curious as to whether you could use a hash function to map a value to its appropriate bin and achieve O(1) time complexity when you have bins of unequal widths?

In some simple cases you can get O(1).
Suppose, your values are 8-bit, from 0 to 255.
If you split them into 8 bins of sizes 2, 2, 4, 8, 16, 32, 64, 128, then the bin value ranges will be: 0-1, 2-3, 4-7, 8-15, 16-31, 32-63, 64-127, 128-255.
In binary these ranges look like:
0000000x (bin 0)
0000001x
000001xx
00001xxx
0001xxxx
001xxxxx
01xxxxxx
1xxxxxxx (bin 7)
So, if you can quickly (in O(1)) count how many most significant zero bits there are in the value, you can get the bin number from it.
In this particular case you may precalculate a look-up table of 256 elements, containing the bin number and finding the appropriate bin for a value is just one table look-up.
Actually, with 8-bit values you can use bins of arbitrary sizes since the look-up table is small.
If you were to go with bins of sizes of powers of 2, you could reuse this look-up table for 16-bit values as well. And you'd need two look-ups. You can extend it to even longer values.

Ordinary hash functions are intended to scatter different values quite randomly across some range. A single-bit difference in arguments may lead to dozens of bits different in results. For that reason, ordinary hash functions are not suitable for the situation described in the question.
An alternative is to build an array P with entries that index into the table B of bin limits. Given some value x, we find the bin j it belongs to (or sometimes a nearby bin) via j = P[⌊x·r⌋] where r is a ratio that depends on the size of P and the maximum value in B. The effectiveness of this approach depends on the values in B and the size of P.
The behavior of functions like P[⌊x·r⌋] can be seen via the python code shown below. (The method is about the same in any programming language. However, tips for Python-to-C are given below.) Suppose the code is stored in file histobins.py and loaded into the ipython interpreter with the command import histobins as hb. Then a command like hb.betterparts(27, 99, 9, 80,155) produces output like
At 80 parts, steps = 20 = 7+13
At 81 parts, steps = 16 = 7+9
At 86 parts, steps = 14 = 6+8
At 97 parts, steps = 13 = 12+1
At 108 parts, steps = 12 = 3+9
At 109 parts, steps = 12 = 8+4
At 118 parts, steps = 12 = 6+6
At 119 parts, steps = 10 = 7+3
At 122 parts, steps = 10 = 3+7
At 141 parts, steps = 10 = 5+5
At 142 parts, steps = 10 = 4+6
At 143 parts, steps = 9 = 7+2
These parameters to betterparts set nbins=27, topsize=99, seed=9, plo=80, phi=155 which creates a test set of 27 bins for values from 0 to 99, with random seed 9, and size of P from 80 to 155-1. The number of “steps” is the number of times the two while loops in testparts() operated during a test with 10*nbins values from 0 to topsize. Eg, “At 143 parts, steps = 9 = 7+2” means that when the size of P is 143, out of 270 trials, 261 times P[⌊x·r⌋] produced the correct index at once; 7 times the index had to be decreased, and twice it had to be increased.
The general idea of the method is to trade off space for time. Another tradeoff is preparation time versus operation time. If you are going to be doing billions of lookups, it is worthwhile to do a few thousand trials to find a good value of |P|, the size of P. If you are going to be doing only a few millions of lookups, it might be better to just pick some large value of |P| and run with it, or perhaps just run betterparts over a narrow range. Instead of doing 75 tests as above, if we start with larger |P| fewer tests may give a good enough result. For example, 10 tests via “hb.betterparts(27, 99, 9, 190,200)” produces
At 190 parts, steps = 11 = 5+6
At 191 parts, steps = 5 = 3+2
At 196 parts, steps = 5 = 4+1
As long as P fits into some level of cache (along with other relevant data) making |P| larger will speed up access. So, making |P| as large as practical is a good idea. As |P| gets larger, the difference in performance between one value of |P| and the next gets smaller and smaller. The limiting factors on speed then include time to multiply and time to set up while loops. One approach for faster multiplies may be to choose a power of 2 as a multiplier; compute |P| to match; then use shifts or adds to exponents instead of multiplies. One approach to spending less time setting up while loops is to move the statement if bins[bin] <= x < bins[bin+1]: (or its C equivalent, see below) to before the while statements and do the while's only if the if statement fails.
Python code is shown below. Note, in translating from Python to C,
• # begins a comment
• def begins a function
• a statement like ntest, right, wrong, x = 10*nbins, 0, 0, 0 assigns values to respective identifiers
• a statement like return (ntest, right, wrong, stepdown, stepup) returns a tuple of 5 values that the caller can assign to a tuple or to respective identifiers
• the scope of a def, while, or if ends with a line not indented farther than the def, while, or if
• bins = [0] initializes a list (an extendible indexable array) with value 0 as its initial entry
• bins.append(t) appends value t at the end of list bins
• for i,j in enumerate(p): runs a loop over the elements of iterable p (in this case, p is a list), making the index i and corresponding entry j == p[i] available inside the loop
• range(nparts) stands for a list of the values 0, 1, ... nparts-1
• range(plo, phi) stands for a list of the values plo, plo+1, ... phi-1
• if bins[bin] <= x < bins[bin+1] means if ((bins[bin] <= x) && (x < bins[bin+1]))
• int(round(x*float(nparts)/topsize))) actually rounds x·r, instead of computing ⌊x·r⌋ as advertised above
def makebins(nbins, topsize):
bins, t = [0], 0
for i in range(nbins):
t += random.random()
bins.append(t)
for i in range(nbins+1):
bins[i] *= topsize/t
bins.append(topsize+1)
return bins
#________________________________________________________________
def showbins(bins):
print ''.join('{:6.2f} '.format(x) for x in bins)
def showparts(nbins, bins, topsize, nparts, p):
ratio = float(topsize)/nparts
for i,j in enumerate(p):
print '{:3d}. {:3d} {:6.2f} {:7.2f} '.format(i, j, bins[j], i*ratio)
print 'nbins: {} topsize: {} nparts: {} ratio: {}'.format(nbins, topsize, nparts, ratio)
print 'p = ', p
print 'bins = ',
showbins(bins)
#________________________________________________________________
def testparts(nbins, topsize, nparts, seed):
# Make bins and make lookup table p
import random
if seed > 0: random.seed(seed)
bins = makebins(nbins,topsize)
ratio, j, p = float(topsize)/nparts, 0, range(nparts)
for i in range(nparts):
while j<nbins and i*ratio >= bins[j+1]:
j += 1
p[i] = j
p.append(j)
#showparts(nbins, bins, topsize, nparts, p)
# Count # of hits and steps with avg. of 10 items per bin
ntest, right, wrong, x = 10*nbins, 0, 0, 0
delta, stepdown, stepup = topsize/float(ntest), 0, 0
for i in range(ntest):
bin = p[min(nparts, max(0, int(round(x*float(nparts)/topsize))))]
while bin < nbins and x >= bins[bin+1]:
bin += 1; stepup += 1
while bin > 0 and x < bins[bin]:
bin -= 1; stepdown += 1
if bins[bin] <= x < bins[bin+1]: # Test if bin is correct
right += 1
else:
wrong += 1
print 'Wrong bin {} {:7.3f} at x={:7.3f} Too {}'.format(bin, bins[bin], x, 'high' if bins[bin] > x else 'low')
x += delta
return (ntest, right, wrong, stepdown, stepup)
#________________________________________________________________
def betterparts(nbins, topsize, seed, plo, phi):
beststep = 1e9
for parts in range(plo, phi):
ntest, right, wrong, stepdown, stepup = testparts(nbins, topsize, parts, seed)
if wrong: print 'Error with ', parts, ' parts'
steps = stepdown + stepup
if steps <= beststep:
beststep = steps
print 'At {:3d} parts, steps = {:d} = {:d}+{:d}'.format(parts, steps, stepdown, stepup)
#________________________________________________________________

Interpolation search is your friend. It's kind of an optimistic, predictive binary search where it guesses where the bin should be based on a linear assumption about the distribution of inputs, rather than just splitting the search space in half at each step. It will be O(1) if the linear assumption is true, but still works (though more slowly) when the assumption is not. To the degree that its predictions are accurate, the search is fast.

Depends on the implementation of the hashing and the type of data you're working with. For smaller data sets a more simple algorithm like binary search might outperform constant lookup if the lookup-overhead of hashing is larger on average.
The usual implementation of hashing, consists of an array of linked lists and a hashing function that maps a string to an index in the array of linked lists. There's a thing called the load factor, which is the number of elements in the hash map / length of the linked-list array. Thus for load factors < 1 you'll achieve constant lookup in the best case because no linked-list will contain more than one element (best case).
There's only one way to find out which is better - implement a hash map and see for yourself. You should be able to get something near constant lookup :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight