I have 2 arrays: the first array contains areas of flats and the second its prices. The values of arrays form a chart and will be used to calculate results of a cost function. The main task is to find the best parameter of the cost function to minimize its result. This is how the cost function looks like:
It was suggested creating a loop from 1 to 10 000 and find the best parameter that has less result. The complexity of this algorithm is 10 000 * size of the arrays.
I proposed an idea to calculate differences between corresponding elements of the arrays and put results into an array. Then find an average of all elements of this array. The obtained average value is the parameter which should provide a better result for our cost function. The algorithm is much more efficient than previous one and can provide more accurate results.
I am wondering whether my algorithm is applicable or not?
The cost function that you're proposing is the mean squared error of fitting a linear function to a collection of data points. This is a well-studied problem, and in fact there's a closed-form solution that will tell you the mathematically optimal value of a that you should pick. In that sense, I would recommend not using either of the solutions that are proposed here and to instead just solve things directly.
The cost function you have is a function purely of the variable a, so taking the derivative with respect to a, setting that derivative to zero, and solving should give you the optimal choice of a.
Cost(a) = (1 / 2m) Σi=0(axi - yi)2
Cost'(a) = (1 / 2m) Σi=02(axi - yi)xi
Cost'(a) = (1 / 2m) Σi=0(2axi2 - 2xiyi)
Setting this expression to 0 and simplifying tells us that
0 = (1 / 2m) Σi=0(2axi2 - 2xiyi)
0 = Σi=0(2axi2 - 2xiyi)
0 = 2a Σi=0xi2 - 2Σi=0xiyi
a Σi=0xi2 = Σi=0xiyi
a = (Σi=0xiyi) / (Σi=0xi2)
You should be able to compute this pretty easily in time O(n) by making a single pass over the array and computing the numerator and denominator
Say I have some array of length n where arr[k] represents how much of object k I want. I also have some arbitrary number of arrays which I can sum integer multiples of in any combination - my goal being to minimise the sum of the absolute differences across each element.
So as a dumb example if my target was [2,1] and my options were A = [2,3] and B = [0,1], then I could take A - 2B and have a cost of 0
I’m wondering if there is an efficient algorithm for approximating something like this? It has a weird knapsack-y flavour to is it maybe just intractable for large n? It doesn’t seem very amenable to DP methods
This is the (NP-hard) closest vector problem. There's an algorithm due to Fincke and Pohst ("Improved methods for calculating vectors of short length in a lattice, including a complexity analysis"), but I haven't personally worked with it.
Given an array of values,
arr = [8,10,4,5,3,7,6,0,1,9,13,2]
X is an array of values can be chosen at a time where X.length != 0 and X.length < arr.length
The chosen values are then fed into a function, score(), which will return a score based on the array of select values.
Example 1:
X = [8]
score(X) = 71
Example 2:
X = [4]
score(X) = 36
Example 3:
X = [8,10,7]
score(X) = 51
Example 4:
X = [5,9,0]
score(X) = 4
The function score() here is a blackbox and we can't modify how the function works, we just provide an input and the function will return the score output.
My problem: How to get the lowest score for each set of numbers?
Meaning, if X is an array that has only 1 value, and I feed all the different values in arr, each value will return me a different score value, and I find which arr value provides the lowest score.
If X is an array of 3 values, I feed a combination of all the different possible values in arr, with each different set of 3 values returning a different score and finding the lowest score.
This is simple enough to do if my arr is small. However if I have an array of 50 or even 100 values, how can I create an algorithm that would provide the lowest score based on the number of input values
tl;dr: If you don't know anything about score, then you can't speed it up.
In order to optimize score itself, you would have to know how it works. After all "optimizing" simply means "does the same thing more efficient", but how can you know if it really does "the same thing" if you don't know what "the same thing" is? Plus, speeding up score will not help you with the combinatorial explosion anyway. The number of combinations grows so fast, that any speedups to score will be quickly eaten up by slightly larger inputs.
In order to optimize how you apply score, you would again need to know something about it. If you knew something about score, you could, for example, only generate combinations that you know will yield different values, or combinations that you know will only yield larger values. In other words, you could exploit some structure in the output of score in order to reduce the input size. However, we don't know the structure of the output of score, in fact, we don't even know if there is some structure at all! So we can't exploit it. Plus, there would have to be some extreme redundancy and regularity in the structure, in order for a significant reduction in input size.
In his comment, #ndn suggested applying some form of machine learning to discover structure in the output.. How well this works depends on what kind of structure the output has. And of course, this again assumes that there even is some structure to discover, which we don't know. And again, even if there were some structure, it would have to very redundant and regular to make up for the combinatorial explosion of the input space.
Really, brute force is the only way. Our last straw is going to be parallelization. Maybe, if we distribute the problem across enough CPU cores, we can tackle it? Unfortunately, the combinatorial explosion in the input space is still really going to hurt you:
If we assume that we have a 10THz CPU (i.e. a thousand times faster than the fastest currently available CPU), and we assume that we can compute score in a single clock cycle, and we assume that we have a computer with 10 million cores (again, that's a thousand times larger than the largest supercomputers), it's still going to take over 400 years to find the optimal selection for an input array as small as 100 numbers. And even if we make our CPU a billion times faster and the computer a billion times bigger, simply doubling the size of the array to 200 items will increase the runtime to 500 trillion years.
There is a reason why we call combinatorial explosion "combinatorial explosion", after all.
I have a vector of numbers like this:
myVec= [ 1 2 3 4 5 6 7 8 ...]
and I have a custom function which takes the input of one number, performs an algorithm and returns another number.
cust(1)= 55, cust(2)= 497, cust(3)= 14, etc.
I want to be able to return the number in the first vector which yielded the highest outcome.
My current thought is to generate a second vector, outcomeVec, which contains the output from the custom function, and then find the index of that vector that has max(outcomeVec), then match that index to myVec. I am wondering, is there a more efficient way of doing this?
What you described is a good way to do it.
outcomeVec = myfunc(myVec);
[~,ndx] = max(outcomeVec);
myVec(ndx) % input that produces max output
Another option is to do it with a loop. This saves a little memory, but may be slower.
maxOutputValue = -Inf;
maxOutputNdx = NaN;
for ndx = 1:length(myVec)
output = myfunc(myVec(ndx));
if output > maxOutputValue
maxOutputValue = output;
maxOutputNdx = ndx;
myVec(maxOutputNdx) % input that produces max output
Those are pretty much your only options.
You could make it fancy by writing a general purpose function that takes in a function handle and an input array. That method would implement one of the techniques above and return the input value that produces the largest output.
Depending on the size of the range of discrete numbers you are searching over, you may find a solution with a golden section algorithm works more efficiently. I tried for instance to minimize the following:
bf = -21;
f =#(x) round(x-bf).^2;
within the range [-100 100] with a routine based on a script from the Mathworks file exchange. This specific file exchange script does not appear to implement the golden section correctly as it makes two function calls per iteration. After fixing this the number of calls required is reduced to 12, which certainly beats evaluating the function 200 times prior to a "dumb" call to min. The gains can quickly become dramatic. For instance, if the search region is [-100000 100000], golden finds the minimum in 25 function calls as opposed to 200000 - the dependence of the number of calls in golden section on the range is logarithmic, not linear.
So if the range is sufficiently large, other methods can definitely beat min by requiring less function calls. Minimization search routines sometimes incorporate such a search in early steps. However you will have a problem with convergence (termination) criteria, which you will have to modify so that the routine knows when to stop. The best option is probably to narrow the search region for application of min by starting out with a few iterations of golden section.
An important caveat is that golden section is guaranteed to work only with unimodal regions, that is, displaying a single minimum. In a region containing multiple minima it's likely to get stuck in one and may miss the global minimum. In that sense min is a sure bet.
Note also that the function in the example here rounds input x, whereas your function takes an integer input. This means you would have to place a wrapper around your function which rounds the input passed by the calling golden routine.
Others appear to have used genetic algorithms to perform such a search, although I did not research this.
There are some related questions that I've come across (like this, this, this, and this) but they all deal with fitting data to a known curve. Is there a way to fit given data to an unknown curve? By which I mean, given some data the algorithm will give me a fit which is one function or a sum of functions. I'm programming in C, but I'm at a complete loss on how to use the gsl package to do this. I'm open to using anything that can (ideally) be piped through C. But any help on what direction I should look will be greatly appreciated.
EDIT: This is basically experimental (physics) data that I've collected, so the data will have some trend modified by additive gaussian distributed noise. In general the trend will be non-linear, so I guess that a linear regression fitting method will be unsuitable. As for the ordering, the data is time-ordered, so the curve necessarily has to be fit in that order.
You might be looking for polynomial interpolation, in the field of numerical analysis.
In polynomial interpolation - given a set of points (x,y) - you are trying to find the best polynom that fits these points. One way to do it is using Newton interpolation, which is fairly easy to program.
The field of numerical analysis and interpolations in specifics is widely studied, and you might be able to get some nice upper bound to the error of the polynom.
Note however, because you are looking for a polynom that best fits your data, and the function is not really a polynom - the scale of the error when getting far from your initial training set blasts off.
Also note, your data set is finite, and there are inifnite number (actually, non-enumerable infinity) of functions that can fit the data (exactly or approximately) - so which one out of these is the best might be specific to what you actually are trying to achieve.
If you are looking for a model to fit your data, note that linear regression and polynomial interpolations are at the opposite ends of the scale: polynomial interpolation might be an overfitting to a model, while a linear regression might be underfitting it, what exactly should be used is case specific and varies from one application to the other.
Simple polynomial interpolation example:
Let's say we have (0,1),(1,2),(3,10) as our data.
The table1 we get using newton method is:
0 | 1 | |
1 | 2 | (2-1)/(1-0)=1 |
3 | 9 | (10-2)/(3-1)=4 | (4-1)/(3-0)=1
Now, the polynom we get is the "diagonal" that ends with the last element:
1 + 1*(x-0) + 1*(x-0)(x-1) = 1 + x + x^2 - x = x^2 +1
(and that is a perfect fit indeed to the data we used)
(1) The table is recursively created: The first 2 columns are the x,y values - and each next column is based on the prior one. It is really easy to implement once you get it, the full explanation is in the wikipedia page for newton interpolation.
Another alternative is using linear regression, but multi-dimensional.
The trick here is to artificially generate extra dimensions. You can do so by simply implying some functions on the original data set. A common usage is doing it to generate polynoms to match the data, so in here the function you imply is f(x) = x^i for all i < k (where k is the degree of the polynom you want to get).
For example, the data set (0,2),(2,3) with k = 3 you will get extra 2 dimnsions, and your data set will be: (0,2,4,8),(2,3,9,27).
The linear-regression algorithm will find the values a_0,a_1,...,a_k for the polynom p(x) = a_0 + a_1*x + ... + a_k * x^k that minimized the error for each point in the data comparing to the predicted model (the value of p(x)).
Now, the problem is - when you start increasing the dimension - you are moving from underfitting (of 1 dimensional linear regression) to overfitting (when k==n, you effectively getting polynomial interpolation).
To "chose" what is the best k value - you can use cross-validation, and chose the k that minimized the error according to your cross-validation.
Note that this process can be fully automated, all you need is to iteratively check all k values in the desired range1, and chose the model with the k that minimized the error according to the cross-validation.
(1) The range could be [1,n] - though it will probably be way too time consuming, I'd go for [1,sqrt(n)] or even [1,log(n)] - but it is just a hunch.
You might want to use (Fast) Fourier Transforms to convert data to frequency domain.
With the result of the transform (a set of amplitudes and phases and frequencies) even the most twisted set of data can be represented by several functions (harmonics) of the form:
r * cos(f * t - p)
where r is the harmonic amplitude, f is the frequency an p the phase.
Finally, the unknonwn data curve is the sum of all harmonics.
I have done this in R (you have some examples of it) but I believe C has enough tools to manage it. It is also possible to pipe C and R but don't know much about it. This might be of help.
This method is really good for large chunks of data because it has complexities of:
1) decompose data with Fast Fourier Transforms (FTT) = O(n log n)
2) built the function with the resulting components = O(n)
I have a simple machine learning question:
I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to
Choose 10 different elements.
Return min distance over (all pairings within the 10).
My distance metric is symmetric and respects the triangle inequality.
What kind of algorithm can I use? My first instinct is to do the following:
Cluster the n elements into 20
Replace each cluster with just the
element of that cluster that is
furthest from the mean element of
the original n.
Use brute force to solve the
problem on the remaining 20
candidates. Luckily, 20 choose 10 is
only 184,756.
Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.
Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.
Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)
Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)
The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:
maximize y'*D*y
subject to: 0 <= y_i <= 1 for all i, 1'*y = 10
This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).
The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.
Let me know if any of this is unclear, notation or otherwise.
Nice question.
I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.
Here's an obvious baseline I would compare any other solution against:
Repeat 100 times:
Greedily select the datapoint that whose removal decreases the objective function the least and remove it.