How can I write a loop to get samples from different range of a variable to draw histograms for them in another dataset? - loops

In a data frame that has 2 columns of name and pvalue, I need to write a loop to get 20 samples (samples are gene set names which are sometimes too long) from different range of p-values including:
Less than or equal to 0.001
Between 0.001 and 0.01
Between 0.01 and 0.05
Between 0.05 and 0.10
Between 0.10 and 0.20
Between 0.20 and 0.50
Larger than 0.50
and then for each range of sampling, I want to find these 20 samples' name in another dataset to draw a histogram for each sample in one sheet. Finally I need to draw histograms of these 20 names in 4 row and 5 columns. I would like to write a loop to do this in a smart way as I need to repeat this proccess several times and also I am new in R programming and I am not famliar in writing loops well and what I want to do is a little bit complecated for me. I appreciate any helps. Thank you!
I think I have to start with getting 20 samples.
MAIN<-sample(DATA$name[DATA$pvalue<0.001, 20, replace=F)
It gives me the name of 20 samples.
Now I want to find each name in a new dataset. the new dataset is like the previous one including name and pvalue, but each name repeated about 100 times. And I want to draw a histogram for each name. Totally I would like to have 20 histograms in one sheet. I dont have any idea for this part.

Related

Vlookup an array of formulas in Excel

I have one table with two columns
ID Probability
A 1%
B 2%
C 3%
D 4%
I have another table, with some IDs and corresponding weights:
ID Weight
A 50%
D 25%
A 15%
B 5%
B 5%
What I'm looking for is a way, in a single formula, to find the corresponding probabilities for each of the IDs in the second table using the data from the first, multiply each by their respective weights from the second table, then sum the results.
I recognise a simple way to solve it would be to add a proxy column to the second table and list corresponding probabilities using a vlookup and multiplying by the weight, then summing the results, but I feel like there must be a more elegant solution.
I've tried entering the second table IDs as an array in both Vlookup and Index/Match formulas, but while both accept a range as a lookup value, both only execute for the first value of the range instead of cycling through the whole array.
I guess ideally the formula would
set an 1 x 5 array for the IDs,
populate a new 1 x 5 array based on the probabilities from the first table
multiply the new array by the existing 1x5 array for weights
Sum whatever is the result
[edit] So for the above example, the final result would be (50% x 1%)+(25% x 4%) + (15% x 1%) + (5% x 2%) + (5% x 2%) = 1.85%
The real tables are much, much bigger than the examples I've given so a simple Sum() function for individual vlookups is out.
Love to hear of any clever solutions?
Using the same ranges as given by Trương Ngọc Đăng Khoa:
=SUMPRODUCT(SUMIF(A1:A4,D1:D5,B1:B4),E1:E5)
Regards
You can use this formula :
{=SUM(LOOKUP(D1:D5;A1:A4;B1:B4)*E1:E5)}
With table in this :
A B C D E
1 A 1% A 50%
2 B 2% D 25%
3 C 3% A 15%
4 D 4% B 5%
5 B 5%
Great response, thanks guys!
XOR LX, your answer seemed to work in all cases, which is what I was looking for (and seems like it was much simpler than I'd originally thought). I think I misunderstood the way the SUMIF function works.
In case anyone is interested, I also found my own (stupidly complex) solution:
=SUM(IF(A1:A4=TRANSPOSE(D1:D5),1,0)*TRANSPOSE(E1:E5)*B1:B4)
Which basically works by transforming the thing into a 4 x 5 matrix instead. I think I still prefer the XOR LX solution for it's simplicity.
Appreciate the help, everyone!

A way to effectively remove outliers from a big array in matlab

So in my software that I am developing, at some point, I have a big array of around 250 elements. I am taking the average of those elements to obtain one mean value. The problem is I have outliers in this big array at the beginning and at the end. So for instance the array could be:
A = [150 200 250 300 1100 1106 1130 1132 1120 1125 1122 1121 1115 2100 2500 2400 2300]
So in this case I would like to remove 150 200 250 300 2100 2500 2400 2300 from the array...
I know I could set those indexes to zero but however, I need a way to automatically program the software to remove those outliers no matter how many there are at the start or and at the end.
Can anyone suggest a robust way of removing those outliers?
You can do something like:
A(A>(mean(A)-std(A)) & A<(mean(A)+std(A)))
> ans = 1100 1106 1130 1132 1120 1125 1122 1121 1115
Normally a robust estimator works better with outliers (https://en.wikipedia.org/wiki/Robust_statistics). The estimated mean and std will change a lot if the outliers are very large. I prefer to use the median and the median absolute deviation (https://en.wikipedia.org/wiki/Median_absolute_deviation).
med = median(A)
mad = median(abs(med-A))
out = (A <med - 3*mad) | (A > med + 3*mad)
A[out] = []
It depends too a lot in what your data represents and how the distribution looks (hist(A)). For example, if your data is skewed to large values you could remove the top 0.95 of the values or something similar. Sometimes do a transformation to make the distribution resemble a normal-distribution works better. For example if the distribution is skewed to the right use a log-transform.
I use a referral approach in this case. I can pick up e.g. 15 elements from a middle of the array, calculate average/median and than compare it to std or diff(A(end-1:end)). Actually try to use median instead of mean.

Calculating the probability of incorrect events within independent groups

I have the following structure:
T = struct('Time',{20, 40, 50, 80, 120, 150, 190, 210, 250, 260, 270, 320, 350, 380, 385, 390, 395},...
'Trial',{'correct','incorrect','incorrect','correct','correct','correct','incorrect','incorrect','correct','correct','correct','incorrect','incorrect','correct','correct','incorrect','incorrect'});
I would like to perform the following two tasks:
I want to get the probability of having an 'incorrect' per each 100 ms time window (interval).
For example, for the first time window, the first 100 ms, there is 4 trials and 2 are 'incorrect' out of 4 so it would be 2/4 = 0.5
I want to plot a bar graph of the probabilities for each 100 ms time window. The x axis would be time and each bar's width would be 100 ms and its height is the probability for that specific window.
I really appreciate any help.
This goes against my policy in answering questions without any effort made by the question poser, but this seems like an interesting question, so I'll make an exception.
First, split up each of the Time and Trial fields so that they're in separate arrays. For the Trial fields, I'm going to convert them into labels 1 and 2 to denote correct and incorrect for ease of implementation:
time = [T.Time].';
trial = {T.Trial}.';
[~,~,trial_ID] = unique(trial);
Next, what you can do is take each entry in the time array and divide by 100 while taking the floor. Values that belong to the same ID mean that they belong to a group of 100 ms. Note that we also need to add 1 for the next step... you'll see why:
groups = floor(time/100) + 1;
Now, here's probably one of the most beautiful functions you can ever use in MATLAB: accumarray. accumarray groups portions of an array based on an ID and you apply a function to all of the values per group. In our case, we want to group all of the correct and incorrect IDs based on the groups array, then from there we determine the total fraction of values that are incorrect per group.
Specifically, what we're going to do is for each group of values specified in groups, we will take a look at the correct and incorrect numeric labels and determine how many were incorrect by summing over how many were equal to 2 for each group, then dividing by how many there were per group. The groups need to start at index 1, which is why we had to add 1 to groups. Without it, the first group would actually start at 0, and MATLAB starts indexing at 1, hence the offset:
per = accumarray(groups, trial_ID, [], #(x) sum(x == 2) / numel(x));
per contains the fraction that were correct per group, and we get:
>> per
per =
0.5000
0.3333
0.2500
0.6667
Very nice! Doing a quick hand calculation will demonstrate that you get the correct results.
Now the last part is to plot the probabilities on a bar graph. That's very simply:
bar(100*(1:numel(per)), per);
xlabel('Time (ms)');
ylabel('Probability');
I create a vector that starts from 100 and goes up in multiples of 100 up until as many groups as we have. In our case, we have 4 as the time goes up to 395 ms.
As such, we get:

random number generator using odds

So we have a group project thats due at the end of the day and everyone has done their part except for one person. We dont have much time left and i havent heard from that person so i decided to just write that persons part myself incase they never upload it.
The problem is... i have no idea how to do this part.
there are three horses and i have to make it so that 1 horse wins the race randomly. Thats easy. just use
<time.h>
srand((unsigned)time(NULL));
1 + rand()%(3-1+1);
The problem is that each horse should have a different probability
horse 1 has a 45% chance of winning
horse 2 has 30%
horse 3 25%
(notice this add up to 100)
Can somebody please help me figure out how to make a horse randomily win using probability?
Due to the circumstances I have until the end of the day to figure this out on my own :(
Pick a random number in the range 1 to 100, uniformly distributed. 1 to 45 is horse 1, 46 to 75 is horse 2, and 76 to 100 is horse 3.
Adjust algorithm to zero based indexing if you prefer.
Generate a U(0,1), call it u. If u <= 0.45 horse 1 wins, else if u <= 0.75 (i.e., between 0.45 and 0.75) it's horse 2, else it's horse 3. This is conceptually like the integer-based solution proposed by David Heffernan, but can handle arbitrary probabilities that don't map nicely to integer sets, such as 1/pi or 1/e.
Split the range of numbers generated by the RNG into pieces with the respective sizes. Then figure out which piece a number lies in to decide which horse wins.
It sounds like the number of horses is small, so a simple if else chain will probably work fine. If you have many horses and speed is important, put the boundaries of the pieces in an array and find the correct one with binary search.
Try using the random number generator function.
Import the header file
#include
Or
#include
Assign the function to a variable in which the value will be stored
Eg
r=rand();
Then you can have each horse as an element of an array.
Sort this array using a sorting technique to find the largest value.

simplified _resample_ algorithm in matlab

I am generating a variable size rows of samples from a DSP algorithm.
I mean each of the row contains random number of elements(Well, depending on the input).
I would like to resize into a specific number of samples per row.
Ex: column count in each row: 15 24 41 09 27
Say I would like to make it 30 element in a row.
Each of the row is a digitized curve samples.
I'm interested in making it contain equisized sample elements.
I think you need to resample your row values, the idea is roughly like this:
interpolate each row to a continuous curve
quantize each curve to a fixed number of values (30)
Obviously, for row with >30 values, you will lose some information.

Resources