Calculating the probability of incorrect events within independent groups

Calculating the probability of incorrect events within independent groups - arrays

I have the following structure:
T = struct('Time',{20, 40, 50, 80, 120, 150, 190, 210, 250, 260, 270, 320, 350, 380, 385, 390, 395},...
'Trial',{'correct','incorrect','incorrect','correct','correct','correct','incorrect','incorrect','correct','correct','correct','incorrect','incorrect','correct','correct','incorrect','incorrect'});
I would like to perform the following two tasks:
I want to get the probability of having an 'incorrect' per each 100 ms time window (interval).
For example, for the first time window, the first 100 ms, there is 4 trials and 2 are 'incorrect' out of 4 so it would be 2/4 = 0.5
I want to plot a bar graph of the probabilities for each 100 ms time window. The x axis would be time and each bar's width would be 100 ms and its height is the probability for that specific window.
I really appreciate any help.

This goes against my policy in answering questions without any effort made by the question poser, but this seems like an interesting question, so I'll make an exception.
First, split up each of the Time and Trial fields so that they're in separate arrays. For the Trial fields, I'm going to convert them into labels 1 and 2 to denote correct and incorrect for ease of implementation:
time = [T.Time].';
trial = {T.Trial}.';
[~,~,trial_ID] = unique(trial);
Next, what you can do is take each entry in the time array and divide by 100 while taking the floor. Values that belong to the same ID mean that they belong to a group of 100 ms. Note that we also need to add 1 for the next step... you'll see why:
groups = floor(time/100) + 1;
Now, here's probably one of the most beautiful functions you can ever use in MATLAB: accumarray. accumarray groups portions of an array based on an ID and you apply a function to all of the values per group. In our case, we want to group all of the correct and incorrect IDs based on the groups array, then from there we determine the total fraction of values that are incorrect per group.
Specifically, what we're going to do is for each group of values specified in groups, we will take a look at the correct and incorrect numeric labels and determine how many were incorrect by summing over how many were equal to 2 for each group, then dividing by how many there were per group. The groups need to start at index 1, which is why we had to add 1 to groups. Without it, the first group would actually start at 0, and MATLAB starts indexing at 1, hence the offset:
per = accumarray(groups, trial_ID, [], #(x) sum(x == 2) / numel(x));
per contains the fraction that were correct per group, and we get:
>> per
per =
0.5000
0.3333
0.2500
0.6667
Very nice! Doing a quick hand calculation will demonstrate that you get the correct results.
Now the last part is to plot the probabilities on a bar graph. That's very simply:
bar(100*(1:numel(per)), per);
xlabel('Time (ms)');
ylabel('Probability');
I create a vector that starts from 100 and goes up in multiples of 100 up until as many groups as we have. In our case, we have 4 as the time goes up to 395 ms.
As such, we get:

Related

How can I write a loop to get samples from different range of a variable to draw histograms for them in another dataset?

In a data frame that has 2 columns of name and pvalue, I need to write a loop to get 20 samples (samples are gene set names which are sometimes too long) from different range of p-values including:
Less than or equal to 0.001
Between 0.001 and 0.01
Between 0.01 and 0.05
Between 0.05 and 0.10
Between 0.10 and 0.20
Between 0.20 and 0.50
Larger than 0.50
and then for each range of sampling, I want to find these 20 samples' name in another dataset to draw a histogram for each sample in one sheet. Finally I need to draw histograms of these 20 names in 4 row and 5 columns. I would like to write a loop to do this in a smart way as I need to repeat this proccess several times and also I am new in R programming and I am not famliar in writing loops well and what I want to do is a little bit complecated for me. I appreciate any helps. Thank you!
I think I have to start with getting 20 samples.
MAIN<-sample(DATA$name[DATA$pvalue<0.001, 20, replace=F)
It gives me the name of 20 samples.
Now I want to find each name in a new dataset. the new dataset is like the previous one including name and pvalue, but each name repeated about 100 times. And I want to draw a histogram for each name. Totally I would like to have 20 histograms in one sheet. I dont have any idea for this part.

How can you create an array of integers that roughly averages to a given number?

I'm not trying to find the average of an array, I'm trying to create an array that will roughly average to a desired number.
My use case is that I have 2 stepper motors that each need to perform a smooth movement over roughly the same amount of time. Steppers move in discrete steps with an integer ms delay between these steps. I need to be able to control the speed of the "faster" motor (ie: the one that needs to take more total steps with a smaller, constant delay between steps) and the speed of the "slower" motor should adjust as needed.
Consider the case where Motor A needs to take 100 steps and Motor B needs to take 150. The delay between Motor B's steps must be 1ms so the delay between Motor A's steps would then be 1.5ms. This doesn't work since the step delay must be an integer.
To that end, I believe you can solve this problem by generating an array with a length equal to the number of total steps where each element is an integer that, overall, averages to that 1.5ms delay. The example for this case would simply be:
motor_a_step_delays = [1, 2, 1, 2, 1, 2 ... 100 elements total ...]
My issue is that I can't seem to find a good way to create this array. The integer elements should be "close" (for lack of a better word). Something like [51, 1, 1, ... 97 more 1's...] would be correct, but not result in smooth, even movement.
This problem feels like it's been solved, but I don't know how to even start searching for it. This seems like it'd have utility in CNC, robotics, or game design applications.

As usual, the act of typing out my issue made me stop and think about what's actually happening.
Fundamentally, the array would only contain the floor and ceil of the desired average delay. If the desired average was 2.25, the final array would be some combination of 2 and 3, but never 1 or 4. Once I realized that, it seems so simple that the number of ceils and floors is proportional to how far the desired delay is from its ceil/floor. In other words, 2.25 would need an array of 75% 2's and 25% 3's. Easy!
Here is what I ended up with (Elixir):
def generate_step_delays(steps, desired_delay) do
desired_delay_ceil = ceil(desired_delay)
desired_delay_floor = floor(desired_delay)
# The ratio of ceils to floors needed
ratio = desired_delay - desired_delay_floor
ceil_list = List.duplicate(desired_delay_ceil, round(steps * ratio))
floor_list = List.duplicate(desired_delay_floor, steps - length(ceil_list))
ceil_list
|> Enum.concat(floor_list)
|> Enum.shuffle()
end
This implementation randomizes the final array since that works best for my case. However, it would be simple to swap out Enum.shuffle and evenly distribute the numbers if needed.

The best method to find a small group of numbers in a large group

I am very far from mathematics, but would like to have an advice from knowledgeable people.
Imagine a large group of numbers from 1 to 1 exabyte.
Within this group, we have to find a small hidden group of 1000 numbers (with no gaps) with a starting point from 1 petabyte let's say.
I understand that the group is large and probably there is no way to find coordinates of this small group.
But..
How do I need to try to scan a large group to get at least once a single number from a small group?
It's clear to me that the random number test is the worst way.
Remains another option.
Take 1 exabyte and to divide this number each time incrementally and to test a groupe of coordinates each time:
1 exabyte / 3 -> we will have 3 coordinates to test by adding the result to 1
each time.
1 exabyte / 4 -> we will have 4 coordinates to test by adding the result
to 1 each time.
....
Is there a better way?
May be a Pseudo code or a code in C?
P.S. I can not explain the problem in detail.I have not mentioned yet that I can increase the size of a small group. From 1000 to 1000000 (ex.) but it's more difficult to calculate for a computer. And with your help: a random solution + increase the small group seems to be a good choice now. **Thank you all for your ideas!**

You don't mention if sorted or not, so for either case:
Check every nth, where n = 1000 or whatever size of the 'small hidden group'.
When found, check either side (any amount to the side in range [1, n]).
If correct, check rest of range until fail.
If incorrect (at any point), check the other side.
If other side incorrect, check the next nth
e.g. pseudo:
for i in range(0,N/n){
if i in small_set{
left_limit = test(i, left)
right_limit = test(i, right)
if (right_limit - left_limit == n) return left_limit
}
}
def test(position, direction)
move = 1
if (direction == left) move = move *-1
while (position in small_set) position += move
return position
I do not claim this is the most efficient, but it is certainly simple, and the most efficient I can think of quickly.
Furthermore, you are incorrect in saying:
...the random number test is the worst way
On average, I believe the least efficient method would be a sequential test. If the first number is not in the group, then it is totally pointless checking the next n-1.
If it is an ordered set and the small group are consecutive then we can do exactly the same as the above, except no need to check the whole range, just i and i + n - 1, where i is the position of the first element from smaller set in the larger set.

Can we solve this using a greedy strategy? If not how do we solve this using dynamic programming?

Problem:
The city of Siruseri is impeccably planned. The city is divided into a rectangular array of cells with M rows and N columns. Each cell has a metro station. There is one train running left to right and back along each row, and one running top to bottom and back along each column. Each trains starts at some time T and goes back and forth along its route (a row or a column) forever.
Ordinary trains take two units of time to go from one station to the next. There are some fast trains that take only one unit of time to go from one station to the next. Finally, there are some slow trains that take three units of time to go from one station the next. You may assume that the halting time at any station is negligible.
Here is a description of a metro system with 3 rows and 4 columns:
S(1) F(2) O(2) F(4)
F(3) . . . .
S(2) . . . .
O(2) . . . .
The label at the beginning of each row/column indicates the type of train (F for fast, O for ordinary, S for slow) and its starting time. Thus, the train that travels along row 1 is a fast train and it starts at time 3. It starts at station (1,1) and moves right, visiting the stations along this row at times 3, 4, 5 and 6 respectively. It then returns back visiting the stations from right to left at times 6, 7, 8 and 9. It again moves right now visiting the stations at times 9, 10, 11 and 12, and so on. Similarly, the train along column 3 is an ordinary train starting at time 2. So, starting at the station (3,1), it visits the three stations on column 3 at times 2, 4 and 6, returns back to the top of the column visiting them at times 6,8 and 10, and so on.
Given a starting station, the starting time and a destination station, your task is to determine the earliest time at which one can reach the destination using these trains.
For example suppose we start at station (2,3) at time 8 and our aim is to reach the station (1,1). We may take the slow train of the second row at time 8 and reach (2,4) at time 11. It so happens that at time 11, the fast train on column 4 is at (2,4) travelling upwards, so we can take this fast train and reach (1,4) at time 12. Once again we are lucky and at time 12 the fast train on row 1 is at (1,4), so we can take this fast train and reach (1,1) at time 15. An alternative route would be to take the ordinary train on column 3 from (2,3) at time 8 and reach (1,3) at time 10. We then wait there till time 13 and take the fast train on row 1 going left, reaching (1,1) at time 15. You can verify that there is no way of reaching (1,1) earlier than that.
Test Data: You may assume that M, N ≤ 50.
Time Limit: 3 seconds
As the size of N,M is very small we can try to solve it by recursion.
At every station, we take two trains which can take us nearer to our destination. E.g.: If we want to go to 1,1 from 2,3 , we take the trains which take us more near to 2,3 and get down to the nearest station to our destination, while keeping track of the time we take, if we reach the destination, we keep track of the minimum time so far, and if the time taken to reach the destination is lesser than the minimum we update it.
We can determine which station a train is at a particular time using this method:
/* S is the starting time of the train and N is the number of stations it
visits, T is the time for which we want to find the station the train is at.
T always be greater than S*/
T = T-S+1
Station(T) = T%N, if T%N = 0, then Station(T) = N;
Here is my question:
How do we determine the earliest time when a particular train reaches the station we want in the direction we want?
As my above algorithm uses greedy strategy, will it give an accurate answer? If not then how do I approach this problem?
P.S : This is not homework, it is an online judge problem.

I believe greedy solution will fail here, but it will be a bit hard to construct a counter-example.
This problem is meant to be solved using Dijkstra's algorithm. Edges are the connection between adjacent nodes and depend on the type of train and its starting time. You also don't need to compute the whole graph - only compute edged for the current node you are considering. I have solved numerous similar problems and this is the way you solved. Also tried to use greedy several times before I learnt it never passes.
Hope this helps.

Is the Leptonica implementation of 'Modified Median Cut' not using the median at all?

I'm playing around a bit with image processing and decided to read up on how color quantization worked and after a bit of reading I found the Modified Median Cut Quantization algorithm.
I've been reading the code of the C implementation in Leptonica library and came across something I thought was a bit odd.
Now I want to stress that I am far from an expert in this area, not am I a math-head, so I am predicting that this all comes down to me not understanding all of it and not that the implementation of the algorithm is wrong at all.
The algorithm states that the vbox should be split along the lagest axis and that it should be split using the following logic
The largest axis is divided by locating the bin with the median pixel
(by population), selecting the longer side, and dividing in the center
of that side. We could have simply put the bin with the median pixel
in the shorter side, but in the early stages of subdivision, this
tends to put low density clusters (that are not considered in the
subdivision) in the same vbox as part of a high density cluster that
will outvote it in median vbox color, even with future median-based
subdivisions. The algorithm used here is particularly important in
early subdivisions, and 3is useful for giving visible but low
population color clusters their own vbox. This has little effect on
the subdivision of high density clusters, which ultimately will have
roughly equal population in their vboxes.
For the sake of the argument, let's assume that we have a vbox that we are in the process of splitting and that the red axis is the largest. In the Leptonica algorithm, on line 01297, the code appears to do the following
Iterate over all the possible green and blue variations of the red color
For each iteration it adds to the total number of pixels (population) it's found along the red axis
For each red color it sum up the population of the current red and the previous ones, thus storing an accumulated value, for each red
note: when I say 'red' I mean each point along the axis that is covered by the iteration, the actual color may not be red but contains a certain amount of red
So for the sake of illustration, assume we have 9 "bins" along the red axis and that they have the following populations
4 8 20 16 1 9 12 8 8
After the iteration of all red bins, the partialsum array will contain the following count for the bins mentioned above
4 12 32 48 49 58 70 78 86
And total would have a value of 86
Once that's done it's time to perform the actual median cut and for the red axis this is performed on line 01346
It iterates over bins and check they accumulated sum. And here's the part that throws me of from the description of the algorithm. It looks for the first bin that has a value that is greater than total/2
Wouldn't total/2 mean that it is looking for a bin that has a value that is greater than the average value and not the median ? The median for the above bins would be 49
The use of 43 or 49 could potentially have a huge impact on how the boxes are split, even though the algorithm then proceeds by moving to the center of the larger side of where the matched value was..
Another thing that puzzles me a bit is that the paper specified that the bin with the median value should be located, but does not mention how to proceed if there are an even number of bins.. the median would be the result of (a+b)/2 and it's not guaranteed that any of the bins contains that population count. So this is what makes me thing that there are some approximations going on that are negligible because of how the split actually takes part at the center of the larger side of the selected bin.
Sorry if it got a bit long winded, but I wanted to be as thoroughas I could because it's been driving me nuts for a couple of days now ;)

In the 9-bin example, 49 is the number of pixels in the first 5 bins. 49 is the median number in the set of 9 partial sums, but we want the median pixel in the set of 86 pixels, which is 43 (or 44), and it resides in the 4th bin.
Inspection of the modified median cut algorithm in colorquant2.c of leptonica shows that the actual cut location for the 3d box does not necessarily occur adjacent to the bin containing the median pixel. The reasons for this are explained in the function medianCutApply(). This is one of the "modifications" to Paul Heckbert's original method. The other significant modification is to make the decision of which 3d box to cut next based on a combination of both population and the product (population * volume), thus permitting splitting of large but sparsely populated regions of color space.

I do not know the algo, but I would assume your array contains the population of each red; let's explain this with an example:
Assume you have four gradations of red: A,B,C and D
And you have the following sequence of red values:
AABDCADBBBAAA
To find the median, you would have to sort them according to red value and take the middle:
median
v
AAAAAABBBBCDD
Now let's use their approach:
A:6 => 6
B:4 => 10
C:1 => 11
D:2 => 13
13/2 = 6.5 => B
I think the mismatch happened because you are counting the population; the average color would be:
(6*A+4*B+1*C+2*D)/13

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight