I am generating a variable size rows of samples from a DSP algorithm.
I mean each of the row contains random number of elements(Well, depending on the input).
I would like to resize into a specific number of samples per row.
Ex: column count in each row: 15 24 41 09 27
Say I would like to make it 30 element in a row.
Each of the row is a digitized curve samples.
I'm interested in making it contain equisized sample elements.
I think you need to resample your row values, the idea is roughly like this:
interpolate each row to a continuous curve
quantize each curve to a fixed number of values (30)
Obviously, for row with >30 values, you will lose some information.
Related
If 15 is the lowest number and 200 the highest number, what formula do I need to use to define a multiplication pattern that applies the 28 values needed to complete this multiplication table?
I would like to learn how to create this pattern to use, I tried it
through percentages but I was not successful exactly because it had
the minimum and the maximum, if it was only the minimum or only the
maximum, just multiply it by the percentage.
if you want to paste those values in the same column you need to do:
then the formula will be:
=ARRAYFORMULA(INDIRECT("A"&
MAX(IF(A3:A<>"", ROW(A3:A), )))+SORT(ROW(INDIRECT("A1:A"&
MAX(IF(A3:A<>"", ROW(A3:A), ))-2)), 1, 0)*(A1-INDIRECT("A"&
MAX(IF(A3:A<>"", ROW(A3:A), ))))/(
MAX(IF(A3:A<>"", ROW(A3:A), ))-1))
Try this formula in A2 =$A1+($A$30-$A$1)/29 and then drag down to A29
Referring to the snippet of a pivot table below in the image, there are 6,000 J####### models (i.e. J2253993, J2254008, J2254014 ... etc).
How can the difference between the last Odometer reading and the first Odometer reading for each model be calculated? There is no consistency in the number of recorded months for each model and there is no consistency between the first and last timestamps for each model.
i.e.
For model J2253993:
Desired answer is: 378
Because 2501 minus 2123
For model J2254008:
Desired answer is: 178
Because 1231 minus 1053
... And so on for the remaining 6,000 models
Would a dynamic array be needed?
Messy SUM/INDIRECT Solution
EDIT: A similar formula for Max-Min in column B (my first idea):
=INDEX(INDIRECT("B"&MATCH(E4,A$1:A$50000,0)+1&":B50000"),MATCH("",INDIRECT("B"&MATCH(E4,A$1:A$50000,0)+1&":B50000"),0)-1)-INDEX(B$1:B$50000,MATCH(E4,A$1:A$50000,0)+1)
I abandoned it because the image wasn't showing any empty cells.
EDIT-END
The formula is calculating the C column sums. A drawback is that you have to insert ="" in all the empty cells of column C unless you know a way how the MATCH function returns an empty cell. In the E column write the ID-s starting from the 4th row and in F4 write the formula:
=SUM(INDIRECT("C"&MATCH(E4,A$1:A$50000,0)+2&":C"&MATCH("",INDIRECT("C"&MATCH(E4,A$1:A$50000,0)+2&":C44"),0)-1+MATCH(E4,A$1:A$50000,0)+2))
Copy/Paste down.
If I am understanding you correctly, it looks like you just need to add a sum of the "Odometer Reading Change" column in your pivot table. When I sum them for J2253993 I get 378 like you say.
Pivot table will total all of the rows by model based on the way you have built it, no matter how many rows are there.
I have a virtual array of GB size which is m by n and for which higher values are to the right and towards the top. By virtual i mean that return values are provided from another program from coordinates given, but the functions on a given run are not known to the programmer. It is guaranteed that a given number is in the array.
{Now turns out that such number is the product of two primes, and so is NP hard}
I looked at Efficient search of sorted numerical values
but it doesn't have the multiple row structure i need to reflect. I tried a "spiral" approach but it sometimes takes a long time to traverse. (Looking at more than half the possible slots) Typically rows have regular gaps, but will be different for each row. Columns tend to have (different) arithmetic progression.
The rows are sorted. The left most value in a row is less than the left most value in next higher row, The right most value in a row is less than the right most value in the next higher row. See example data below.
What i have tried is to first eliminate rows which cannot hold the target value and then pick the "middle" value row of those remaining. Do a binary search on that row, then go up or down according to whether the next row is likely (guess) to have more values in range or not. The target value is likely to be randomly placed within the possible slots available.
Here is some sample data
1008 1064 1120 1176 1232
999 1053 1107 1161 1215
988 1040 1092 1144 1196
975 1025 1075 1125 1175
960 1008 1056 1104 1152
Any ideas please?
This is equivalent to factorization if the target number is product of only two numbers (prime) which turns out to be the case {that wasn't clear at time of posting}.
Factorization is known to be NP hard.
An interesting sidelight on factorization and decision theory is here
https://cstheory.stackexchange.com/questions/25466/factoring-as-a-decision-problem
and here
http://rjlipton.wordpress.com/2011/01/23/is-factoring-really-in-bqp-really/
I have a large amount of 2D sets of coordinates on a 6000x6000 plane (2116 sets), available here: http://pastebin.com/kiMQi7yu (the context isn't really important so I just pasted the raw data).
I need to write an algorithm to group together coordinates that are close to each other by some threshold. The coordinates in my list are already in groups on that plane, but the order is very scattered.
Despite this task being rather brain-melting to me at first, I didn't admit defeat instantly; this is what I tried:
First sort the list by the Y value, then sort it by the X value. Run through the list checking the distance between the current set and the previous. If they are close enough (100 units) then add them to the same group.
This method didn't really work out (as I expected). There are still objects that are pretty close that are in different groups, because I'm only comparing the next set in the list and the list is sorted by the X position.
I'm out of ideas! The language I'm using is C but I suppose that's not really relevant since all I need is an idea for how the algorithm should work. Thanks!
Though I haven't looked at the data set, it seems that you already know how many groups there are. Have you considered using k means? http://en.m.wikipedia.org/wiki/K-means_clustering
I'm just thinking this along while I write.
Tile the "arena" with squares that have the diameter of your distance (200) as their diagonal.
If there are any points within a square (x,y), they are tentatively part of Cluster(x,y).
Within each square (x,y), there are (up to) 4 areas where the circles of Cluster(x-1,y), Cluster(x+1,y), Cluster(x, y-1) and Cluster(x,y+1) overlap "into" the square; of these consider only those Clusters that are tentatively non-empty.
If all points of Cluster(x,y) are in the (up to 4) overlapping segments of non-empty neighbouring clusters: reallocate these points to the pertaining Cluster and remove Cluster(x,y) from the set of non-empty Clusters.
Added later: Regarding 3., the set of points to be investigated for one neighbour can be coarsely but quickly (!) determined by looking at the rectangle enclosing the segment. [End of addition]
This is just an idea - I can't claim that I've ever done anything remotely like this.
A simple, often used method for spatially grouping points, is to calculate the distance between each unique pair of points. If the distance does not exceed some predefined limit, then the points belong to the same group.
One way to think about this algorithm, is to consider each point as a limit-diameter ball (made of soft foam, so that balls can intersect each other). All balls that are in contact belong to the same group.
In practice, you calculate the squared distance, (x2 - x1)2 + (y2 - y1)2, to avoid the relatively slow square root operation. (Just remember to square the limit, too.)
To track which group each point belongs to, a disjoint-set data structure is used.
If you have many points (a few thousand is not many), you can use partitioning or other methods to limit the number of pairs to consider. Partitioning is probably the most used, as it is very simple to implement: just divide the space into squares of limit size, and then you only need to consider points within each square, and between points in neighboring squares.
I wrote a small awk script to find the groups (no partitioning, about 84 lines or awk code, also numbers the groups consecutively from 1 onwards, and outputs each input point, the group number, and the number of points in each group). Here's the results summarized:
Limit Singles Pairs Triplets Clusters (of four or more points)
1.0 1313 290 29 24
2.0 1062 234 50 52
3.0 904 179 53 75
4.0 767 174 55 81
5.0 638 173 52 84
10.0 272 99 41 99
20.0 66 20 8 68
50.0 21 11 3 39
100.0 13 6 2 29
200.0 6 5 0 23
300.0 3 1 0 20
400.0 1 0 0 18
500.0 0 0 0 15
where Limit is the maximum distance at which the points are considered to belong to the same group.
If the data set is very detailed, you can have intertwined but separate groups. You can easily have a separate group in the hole of a donut-shaped group (or hollow ball in 3D). This is important to remember, so you don't make wrong assumptions on how the groups are separated.
Questions?
You can use a space-filling-curve, I.e a z curve a.k.a morton curve. Basically you translate x-and y value to binary and then concatenate th,e coordinates. The spatial index puts together close coordinates. You can verify it with the upper bounds and the mostsignificant bits.
I'm playing around a bit with image processing and decided to read up on how color quantization worked and after a bit of reading I found the Modified Median Cut Quantization algorithm.
I've been reading the code of the C implementation in Leptonica library and came across something I thought was a bit odd.
Now I want to stress that I am far from an expert in this area, not am I a math-head, so I am predicting that this all comes down to me not understanding all of it and not that the implementation of the algorithm is wrong at all.
The algorithm states that the vbox should be split along the lagest axis and that it should be split using the following logic
The largest axis is divided by locating the bin with the median pixel
(by population), selecting the longer side, and dividing in the center
of that side. We could have simply put the bin with the median pixel
in the shorter side, but in the early stages of subdivision, this
tends to put low density clusters (that are not considered in the
subdivision) in the same vbox as part of a high density cluster that
will outvote it in median vbox color, even with future median-based
subdivisions. The algorithm used here is particularly important in
early subdivisions, and 3is useful for giving visible but low
population color clusters their own vbox. This has little effect on
the subdivision of high density clusters, which ultimately will have
roughly equal population in their vboxes.
For the sake of the argument, let's assume that we have a vbox that we are in the process of splitting and that the red axis is the largest. In the Leptonica algorithm, on line 01297, the code appears to do the following
Iterate over all the possible green and blue variations of the red color
For each iteration it adds to the total number of pixels (population) it's found along the red axis
For each red color it sum up the population of the current red and the previous ones, thus storing an accumulated value, for each red
note: when I say 'red' I mean each point along the axis that is covered by the iteration, the actual color may not be red but contains a certain amount of red
So for the sake of illustration, assume we have 9 "bins" along the red axis and that they have the following populations
4 8 20 16 1 9 12 8 8
After the iteration of all red bins, the partialsum array will contain the following count for the bins mentioned above
4 12 32 48 49 58 70 78 86
And total would have a value of 86
Once that's done it's time to perform the actual median cut and for the red axis this is performed on line 01346
It iterates over bins and check they accumulated sum. And here's the part that throws me of from the description of the algorithm. It looks for the first bin that has a value that is greater than total/2
Wouldn't total/2 mean that it is looking for a bin that has a value that is greater than the average value and not the median ? The median for the above bins would be 49
The use of 43 or 49 could potentially have a huge impact on how the boxes are split, even though the algorithm then proceeds by moving to the center of the larger side of where the matched value was..
Another thing that puzzles me a bit is that the paper specified that the bin with the median value should be located, but does not mention how to proceed if there are an even number of bins.. the median would be the result of (a+b)/2 and it's not guaranteed that any of the bins contains that population count. So this is what makes me thing that there are some approximations going on that are negligible because of how the split actually takes part at the center of the larger side of the selected bin.
Sorry if it got a bit long winded, but I wanted to be as thoroughas I could because it's been driving me nuts for a couple of days now ;)
In the 9-bin example, 49 is the number of pixels in the first 5 bins. 49 is the median number in the set of 9 partial sums, but we want the median pixel in the set of 86 pixels, which is 43 (or 44), and it resides in the 4th bin.
Inspection of the modified median cut algorithm in colorquant2.c of leptonica shows that the actual cut location for the 3d box does not necessarily occur adjacent to the bin containing the median pixel. The reasons for this are explained in the function medianCutApply(). This is one of the "modifications" to Paul Heckbert's original method. The other significant modification is to make the decision of which 3d box to cut next based on a combination of both population and the product (population * volume), thus permitting splitting of large but sparsely populated regions of color space.
I do not know the algo, but I would assume your array contains the population of each red; let's explain this with an example:
Assume you have four gradations of red: A,B,C and D
And you have the following sequence of red values:
AABDCADBBBAAA
To find the median, you would have to sort them according to red value and take the middle:
median
v
AAAAAABBBBCDD
Now let's use their approach:
A:6 => 6
B:4 => 10
C:1 => 11
D:2 => 13
13/2 = 6.5 => B
I think the mismatch happened because you are counting the population; the average color would be:
(6*A+4*B+1*C+2*D)/13