how to efficiently search a structured numeric array - arrays

I have a virtual array of GB size which is m by n and for which higher values are to the right and towards the top. By virtual i mean that return values are provided from another program from coordinates given, but the functions on a given run are not known to the programmer. It is guaranteed that a given number is in the array.
{Now turns out that such number is the product of two primes, and so is NP hard}
I looked at Efficient search of sorted numerical values
but it doesn't have the multiple row structure i need to reflect. I tried a "spiral" approach but it sometimes takes a long time to traverse. (Looking at more than half the possible slots) Typically rows have regular gaps, but will be different for each row. Columns tend to have (different) arithmetic progression.
The rows are sorted. The left most value in a row is less than the left most value in next higher row, The right most value in a row is less than the right most value in the next higher row. See example data below.
What i have tried is to first eliminate rows which cannot hold the target value and then pick the "middle" value row of those remaining. Do a binary search on that row, then go up or down according to whether the next row is likely (guess) to have more values in range or not. The target value is likely to be randomly placed within the possible slots available.
Here is some sample data
1008 1064 1120 1176 1232
999 1053 1107 1161 1215
988 1040 1092 1144 1196
975 1025 1075 1125 1175
960 1008 1056 1104 1152
Any ideas please?

This is equivalent to factorization if the target number is product of only two numbers (prime) which turns out to be the case {that wasn't clear at time of posting}.
Factorization is known to be NP hard.
An interesting sidelight on factorization and decision theory is here
https://cstheory.stackexchange.com/questions/25466/factoring-as-a-decision-problem
and here
http://rjlipton.wordpress.com/2011/01/23/is-factoring-really-in-bqp-really/

Related

Given a sorted array find number of all unique pairs which sum is greater than X

Is there any solution which can be achieved with complexity better than
O(n2)? Linear complexity would be best but O(nlogn) is also great.
I've tried to use binary search for each element which would be O(nlogn) but Im always missing something.
For example, given the sorted array: 2 8 8 9 10 20 21
and the number X=18
[2,20][2,21][8,20][8,21][8,20][8,21][9,10][9,20][9,21][10,20][10,21][20,21]
The function should return 12.
For your problem, for any given set of numbers, there may be some which, included with any other number, are guaranteed to be a solution because in themselves they are greater than the target amount for the summing.
Identify those. Any combination of those is a solution. With your example-data, 21 and 20 satisfy this, so all combinations of 21, which is six, and all remaining combinations of 20, which is five, satisfy your requirement (11 results so far).
Treat your sorted array now as a subset, which excludes the above set (if non-empty).
From your new subset, find all numbers which can be included with another number in that subset to satisfy your requirement. Starting from the highest (for "complexity" it may not matter, but for ease of coding, knowing early that you have none often helps) number in your subset, identify any, remembering that there need to be at least two such numbers for any more results to be added to those identified.
Take you data in successive adjacent pairs. Once you reach of sum which does not meet your requirement, you know you have found the bottom limit for your table.
With this sequence:
1 2 12 13 14 15 16 17 18 19
19 already meets the criteria. Subset is of nine numbers. Take 18 and 17. Meets the criteria. Take 17 and 16. Meets. 16 and 15. Meets... 13 and 12. Meets. 12 and 2. Does not meet. So 12 is the lowest value in the range, and the number of items in the range are seven.
This process yields 10 and 9 with your original data. One more combination, giving your 12 combinations that are sought.
Determining the number of combinations should be abstracted out, as it is simple to calculate and, depending on actual situation, perhaps faster to pre-calculate.
A good rule-of-thumb is that if you, yourself, can get the answer by looking at the data, it is pretty easy for a computer. Doing it the same way is a good starting point, though not always the end.
Find the subset of the highest elements which would cause the sum to be greater than the desired value when paired with any other member of the array.
Taking the subset which excludes those, use the highest element and work backwards to find the lowest element which gives you success. Have that as your third subset.
Your combinations are all those of the entire group for members of the first subset, plus solely those combinations of the third subset.
You are the student, you'll have to work out how complex that is.

A way to effectively remove outliers from a big array in matlab

So in my software that I am developing, at some point, I have a big array of around 250 elements. I am taking the average of those elements to obtain one mean value. The problem is I have outliers in this big array at the beginning and at the end. So for instance the array could be:
A = [150 200 250 300 1100 1106 1130 1132 1120 1125 1122 1121 1115 2100 2500 2400 2300]
So in this case I would like to remove 150 200 250 300 2100 2500 2400 2300 from the array...
I know I could set those indexes to zero but however, I need a way to automatically program the software to remove those outliers no matter how many there are at the start or and at the end.
Can anyone suggest a robust way of removing those outliers?
You can do something like:
A(A>(mean(A)-std(A)) & A<(mean(A)+std(A)))
> ans = 1100 1106 1130 1132 1120 1125 1122 1121 1115
Normally a robust estimator works better with outliers (https://en.wikipedia.org/wiki/Robust_statistics). The estimated mean and std will change a lot if the outliers are very large. I prefer to use the median and the median absolute deviation (https://en.wikipedia.org/wiki/Median_absolute_deviation).
med = median(A)
mad = median(abs(med-A))
out = (A <med - 3*mad) | (A > med + 3*mad)
A[out] = []
It depends too a lot in what your data represents and how the distribution looks (hist(A)). For example, if your data is skewed to large values you could remove the top 0.95 of the values or something similar. Sometimes do a transformation to make the distribution resemble a normal-distribution works better. For example if the distribution is skewed to the right use a log-transform.
I use a referral approach in this case. I can pick up e.g. 15 elements from a middle of the array, calculate average/median and than compare it to std or diff(A(end-1:end)). Actually try to use median instead of mean.

How would I organise a clustered set of 2D coordinates into groups of close together sets?

I have a large amount of 2D sets of coordinates on a 6000x6000 plane (2116 sets), available here: http://pastebin.com/kiMQi7yu (the context isn't really important so I just pasted the raw data).
I need to write an algorithm to group together coordinates that are close to each other by some threshold. The coordinates in my list are already in groups on that plane, but the order is very scattered.
Despite this task being rather brain-melting to me at first, I didn't admit defeat instantly; this is what I tried:
First sort the list by the Y value, then sort it by the X value. Run through the list checking the distance between the current set and the previous. If they are close enough (100 units) then add them to the same group.
This method didn't really work out (as I expected). There are still objects that are pretty close that are in different groups, because I'm only comparing the next set in the list and the list is sorted by the X position.
I'm out of ideas! The language I'm using is C but I suppose that's not really relevant since all I need is an idea for how the algorithm should work. Thanks!
Though I haven't looked at the data set, it seems that you already know how many groups there are. Have you considered using k means? http://en.m.wikipedia.org/wiki/K-means_clustering
I'm just thinking this along while I write.
Tile the "arena" with squares that have the diameter of your distance (200) as their diagonal.
If there are any points within a square (x,y), they are tentatively part of Cluster(x,y).
Within each square (x,y), there are (up to) 4 areas where the circles of Cluster(x-1,y), Cluster(x+1,y), Cluster(x, y-1) and Cluster(x,y+1) overlap "into" the square; of these consider only those Clusters that are tentatively non-empty.
If all points of Cluster(x,y) are in the (up to 4) overlapping segments of non-empty neighbouring clusters: reallocate these points to the pertaining Cluster and remove Cluster(x,y) from the set of non-empty Clusters.
Added later: Regarding 3., the set of points to be investigated for one neighbour can be coarsely but quickly (!) determined by looking at the rectangle enclosing the segment. [End of addition]
This is just an idea - I can't claim that I've ever done anything remotely like this.
A simple, often used method for spatially grouping points, is to calculate the distance between each unique pair of points. If the distance does not exceed some predefined limit, then the points belong to the same group.
One way to think about this algorithm, is to consider each point as a limit-diameter ball (made of soft foam, so that balls can intersect each other). All balls that are in contact belong to the same group.
In practice, you calculate the squared distance, (x2 - x1)2 + (y2 - y1)2, to avoid the relatively slow square root operation. (Just remember to square the limit, too.)
To track which group each point belongs to, a disjoint-set data structure is used.
If you have many points (a few thousand is not many), you can use partitioning or other methods to limit the number of pairs to consider. Partitioning is probably the most used, as it is very simple to implement: just divide the space into squares of limit size, and then you only need to consider points within each square, and between points in neighboring squares.
I wrote a small awk script to find the groups (no partitioning, about 84 lines or awk code, also numbers the groups consecutively from 1 onwards, and outputs each input point, the group number, and the number of points in each group). Here's the results summarized:
Limit Singles Pairs Triplets Clusters (of four or more points)
1.0 1313 290 29 24
2.0 1062 234 50 52
3.0 904 179 53 75
4.0 767 174 55 81
5.0 638 173 52 84
10.0 272 99 41 99
20.0 66 20 8 68
50.0 21 11 3 39
100.0 13 6 2 29
200.0 6 5 0 23
300.0 3 1 0 20
400.0 1 0 0 18
500.0 0 0 0 15
where Limit is the maximum distance at which the points are considered to belong to the same group.
If the data set is very detailed, you can have intertwined but separate groups. You can easily have a separate group in the hole of a donut-shaped group (or hollow ball in 3D). This is important to remember, so you don't make wrong assumptions on how the groups are separated.
Questions?
You can use a space-filling-curve, I.e a z curve a.k.a morton curve. Basically you translate x-and y value to binary and then concatenate th,e coordinates. The spatial index puts together close coordinates. You can verify it with the upper bounds and the mostsignificant bits.

Is the Leptonica implementation of 'Modified Median Cut' not using the median at all?

I'm playing around a bit with image processing and decided to read up on how color quantization worked and after a bit of reading I found the Modified Median Cut Quantization algorithm.
I've been reading the code of the C implementation in Leptonica library and came across something I thought was a bit odd.
Now I want to stress that I am far from an expert in this area, not am I a math-head, so I am predicting that this all comes down to me not understanding all of it and not that the implementation of the algorithm is wrong at all.
The algorithm states that the vbox should be split along the lagest axis and that it should be split using the following logic
The largest axis is divided by locating the bin with the median pixel
(by population), selecting the longer side, and dividing in the center
of that side. We could have simply put the bin with the median pixel
in the shorter side, but in the early stages of subdivision, this
tends to put low density clusters (that are not considered in the
subdivision) in the same vbox as part of a high density cluster that
will outvote it in median vbox color, even with future median-based
subdivisions. The algorithm used here is particularly important in
early subdivisions, and 3is useful for giving visible but low
population color clusters their own vbox. This has little effect on
the subdivision of high density clusters, which ultimately will have
roughly equal population in their vboxes.
For the sake of the argument, let's assume that we have a vbox that we are in the process of splitting and that the red axis is the largest. In the Leptonica algorithm, on line 01297, the code appears to do the following
Iterate over all the possible green and blue variations of the red color
For each iteration it adds to the total number of pixels (population) it's found along the red axis
For each red color it sum up the population of the current red and the previous ones, thus storing an accumulated value, for each red
note: when I say 'red' I mean each point along the axis that is covered by the iteration, the actual color may not be red but contains a certain amount of red
So for the sake of illustration, assume we have 9 "bins" along the red axis and that they have the following populations
4 8 20 16 1 9 12 8 8
After the iteration of all red bins, the partialsum array will contain the following count for the bins mentioned above
4 12 32 48 49 58 70 78 86
And total would have a value of 86
Once that's done it's time to perform the actual median cut and for the red axis this is performed on line 01346
It iterates over bins and check they accumulated sum. And here's the part that throws me of from the description of the algorithm. It looks for the first bin that has a value that is greater than total/2
Wouldn't total/2 mean that it is looking for a bin that has a value that is greater than the average value and not the median ? The median for the above bins would be 49
The use of 43 or 49 could potentially have a huge impact on how the boxes are split, even though the algorithm then proceeds by moving to the center of the larger side of where the matched value was..
Another thing that puzzles me a bit is that the paper specified that the bin with the median value should be located, but does not mention how to proceed if there are an even number of bins.. the median would be the result of (a+b)/2 and it's not guaranteed that any of the bins contains that population count. So this is what makes me thing that there are some approximations going on that are negligible because of how the split actually takes part at the center of the larger side of the selected bin.
Sorry if it got a bit long winded, but I wanted to be as thoroughas I could because it's been driving me nuts for a couple of days now ;)
In the 9-bin example, 49 is the number of pixels in the first 5 bins. 49 is the median number in the set of 9 partial sums, but we want the median pixel in the set of 86 pixels, which is 43 (or 44), and it resides in the 4th bin.
Inspection of the modified median cut algorithm in colorquant2.c of leptonica shows that the actual cut location for the 3d box does not necessarily occur adjacent to the bin containing the median pixel. The reasons for this are explained in the function medianCutApply(). This is one of the "modifications" to Paul Heckbert's original method. The other significant modification is to make the decision of which 3d box to cut next based on a combination of both population and the product (population * volume), thus permitting splitting of large but sparsely populated regions of color space.
I do not know the algo, but I would assume your array contains the population of each red; let's explain this with an example:
Assume you have four gradations of red: A,B,C and D
And you have the following sequence of red values:
AABDCADBBBAAA
To find the median, you would have to sort them according to red value and take the middle:
median
v
AAAAAABBBBCDD
Now let's use their approach:
A:6 => 6
B:4 => 10
C:1 => 11
D:2 => 13
13/2 = 6.5 => B
I think the mismatch happened because you are counting the population; the average color would be:
(6*A+4*B+1*C+2*D)/13

simplified _resample_ algorithm in matlab

I am generating a variable size rows of samples from a DSP algorithm.
I mean each of the row contains random number of elements(Well, depending on the input).
I would like to resize into a specific number of samples per row.
Ex: column count in each row: 15 24 41 09 27
Say I would like to make it 30 element in a row.
Each of the row is a digitized curve samples.
I'm interested in making it contain equisized sample elements.
I think you need to resample your row values, the idea is roughly like this:
interpolate each row to a continuous curve
quantize each curve to a fixed number of values (30)
Obviously, for row with >30 values, you will lose some information.

Resources