constraining values to a range in an array - arrays

I can limit an array to values less than or greater than using individual values but how can I limit an array of values to a specific range.
Example snippet of code below:
arrayphase_sort=sortrows(arrayphase,4); %sort by phase in deg low to high
arrayphase_sort_limit_idx=arrayphase_sort(:,4)<45;% idx to limit array to phase angles under 45 degree
arrayphase_sort_limit=arrayphase_sort(arrayphase_sort_limit_idx,:); %limit array to phase angles under 45 degree
but I tried adding &>10 to see if I could get the array to show everything greater than 10 and less than 45 example below: (but I get an error)
arrayphase_sort_limit_idx=arrayphase_sort(:,4)<45**&>10**;
I know it's a syntax issues but I'm not sure the proper syntax.
Any idea the proper syntax to accomplish what I'm trying to do.
Thanks

You do it like this:
A = round(180 * rand(10, 10))
A(A > 10 & A < 45)
First line generates a 10x10 matrix of random data, the second line extracts numbers between 10 and 45.

Related

How to count for 2 different arrays how many times the elements are repeated, in MATLAB?

I have array A (44x1) and B (41x1), and I want to count for both arrays how many times the elements are repeated. And if the repeated values are present in both arrays, I want their counting to be divided (for instance: value 0.5 appears 500 times in A and 350 times in B, so now divide 500 by 350).
I have to do this for bigger arrays as well, so I was thinking about using a looping (but no idea how to do it on MATLAB).
I got what I want on python:
import pandas as pd
data1 = pd.read_excel('C:/Users/Desktop/Python/data1.xlsx')
data2 = pd.read_excel('C:/Users/Desktop/Python/data2.xlsx')
for i in data1['Mag'].value_counts() & data2['Mag'].value_counts():
a = data1['Mag'].value_counts()/data2['Mag'].value_counts()
print(a)
break
Any idea of how to do the same on MATLAB? Thanks!
Since you can enumerate all valid earthquake magnitude values, you could use:
% Make up some data
A=randi([2 58],[100 1])/10;
B=randi([2 58],[20 1])/10;
% Round data to nearest tenth
%A=round(A,1); %uncomment if necessary
%B=round(B,1); %same
% Divide frequencies
validmags=0.2:0.1:5.8;
Afreqs=sum(double( abs(A-validmags)<1e-6 ),1); %relies on implicit expansion; A must be a column vector and validmags must be a row vector; dimension argument to sum() only to remind user; double() not really needed
Bfreqs=sum(double( abs(B-validmags)<1e-6 ),1); %same
Bfreqs./Afreqs, %for a fancier version: [{'Magnitude'} num2cell(validmags) ; {'Freq(B)/Freq(A)'} num2cell(Bfreqs./Afreqs)].'
The last line will produce NaN for 0/0, +Inf for nn/0, and 0 for 0/nn.
You could also use uniquetol, align the unique values of each vector, and divide the respective absolute frequencies. But I think the above approach is cleaner and easier to understand.

Why does this code generate random duplicates?

Let me start by saying I'm new to python/pyspark
I've got a dataframe of 100 items, I'm slicing that up into batches of 25 then for each batch I need to do work on each row. I'm getting duplicate values in the last do work step. I've verified my original list does not contain duplicates, my slice step generates 4 distinct lists
batchsize = 25
sliced = []
emailLog = []
for i in range(1,bc_df.count(),batchsize):
sliced.append({"slice":bc_df.filter(bc_df.Index >= i).limit(batchsize).rdd.collect()})
for s in sliced:
for r in s['slice']:
emailLog.append({"email":r['emailAddress']})
re = sc.parallelize(emailLog)
re_df = sqlContext.createDataFrame(re)
re_df.createOrReplaceTempView('email_logView')
%sql
select count(distinct(email)) from email_logView
My expectation is to have 100 distinct email addresses, I sometiems get 75, 52, 96, 100
Your issue is caused by this line because it is not deterministic and allows duplicates:
sliced.append({"slice":bc_df.filter(bc_df.Index >= i).limit(batchsize).rdd.collect()})
Let's take a closer look at what is happening (I assume that the index column ranges from 1 to 100).
Your range function generates four values for i (1,26,51 and 76).
During the first iteration you request all rows which index is 1 or greater (i.e. [1,100]) and take 25 of them.
During the second iteration you request all rows which index is 26 or greater (i.e. [26,100]) and take 25 of them.
During the third iteration you request all rows which index is 51 or greater (i.e. [51,100]) and take 25 of them.
During the fourth iteration you request all rows which index is 76 or greater (i.e. [76,100]) and take 25 of them.
You already see that the intervals are overlapping. That means that the email addresses of an iteration could also have been taken by previous iterations.
You can fix this by simply extending your filter with an upper limit. For example:
sliced.append({"slice":bc_df.filter((bc_df.Index >= i) & (bc_df.Index < i + batchsize)).rdd.collect()})
That is just a quick fix to solve your problem. As general advise I recommend you to avoid .collect() as often as possible because it does not scale horizontaly.

Using an Array to get the average of a selection of numbers

I have 12 different numbers on a row in an excel sheet (repeated few times on different row) and I would like to get the average of the 8 bigger numbers (4 numbers won't be used). I thought to use an array, but how to select the 8 bigger numbers in this array?
I would like to do it in VBA. Do you have any ideas/directions how to proceed, because I am not really sure how to start.
Example: 12,3,5,6,8,11,9,7,5,8,10,1 (Results=8.875), if I didn't make a mistake!
Thanks.
I know you asked for VBA, but here's a formula. If this works, you can quickly throw in VBA (use the macro recorder to see how if you're unsure).
If that's in row 1, you can use:
=AVERAGE(LARGE(A1:L1,{1,2,3,4,5,6,7,8}))
You should be able to change A1:l1 to be the range where your numbers are, i.e.
=AVERAGE(LARGE(A1:F2,{1,2,3,4,5,6,7,8}))
Edit: Per your comment, also no need for VBA. Just use conditional highlighting. Highlight your range of numbers, then go to Conditional Highlighting --> Top/Bottom Rules --> Top 10 Items. Then just change the 10 to 8, and choose a color/format. Then Apply!
(each row in the above is treated as it's own range).
Edt2: bah, my image has duplicate numbers which is why the top 8 rule is highlighting 9 numbers sometimes. If you request I'll try to edit and tweak
You can use a simple Excel function to compute the answer
=AVERAGE(LARGE(A1:L1, {1,2,3,4,5,6,7,8}))
It is a bit more complex in VBA, but here is a function that does the same computation (explicit typing per Mat's Mug):
Public Function AvgTop8(r As Range) As Double
Dim Sum As Double, j1 As Integer
For j1 = 1 To 12
If WorksheetFunction.Rank(r(1, j1), r) <= 8 Then
Sum = Sum + r(1, j1)
End If
Next j1
AvgTop8 = Sum / 8#
End Function
The Excel RANK function returns the rank of a number in a list of numbers, with the largest rank 1 and so on. So you can search through the 12 numbers for the 8 largest, sum them and divide by 8.
You don't need VBA for this. You can use:
=AVERAGE(LARGE(A1:L1,{1;2;3;4;5;6;7;8}))

Matlab- Create cell confusion matrix

I have the following cell matrix, which will be used as a confusion matrix:
confusion=cell(25,25);
Then, I have two other cell arrays, on which each line contains predicted labels (array output) and another cell matrix containing the real labels (array groundtruth).
whos output
Name Size Bytes Class Attributes
output 702250x1 80943902 cell
whos groundtruth
Name Size Bytes Class Attributes
groundtruth 702250x1 84270000 cell
Then, I created the following script to create the confusion matrix
function confusion=write_confusion_matrix(predict, groundtruth)
confusion=cell(25,25);
for i=1:size(predict,1)
confusion{groundtruth{i},predict{i}}=confusion{groundtruth{i}, predict{i}}+1;
end
end
But when I run it in matlab I have the following error:
Index exceeds matrix dimensions.
Error in write_confusion_matrix (line 4)
confusion{groundtruth{i},predict{i}}=confusion{groundtruth{i}, predict{i}}+1;
I was curious to print output's and groundtruth's values to see what was happening
output{1}
ans =
2
groundtruth{1}
ans =
1
So, nothing seems to be wrong with the values, so what is wrong here? is the confusion matrix's indexing right in the code?
The error occurs in a for loop. Checking the first iteration of the loop is not sufficient in this case. Index exceeds matrix dimensions means there exists an i in the range of 1:size(output,1) for which either groundtruth{i} or output{i} is greater than 25.
You can find out which one has at least one element bigger than the range:
% 0 means no, there is none above 25. 1 means yes, there exists at least one:
hasoutlier = any(cellfun(#(x) x > 25, groundtruth)) % similar for 'output'
Or you can count them:
outliercount = sum(cellfun(#(x) x > 25, groundtruth))
Maybe you also want to find these elements:
outlierindex = find(cellfun(#(x) x > 25, groundtruth))
By the way, I am wondering why are you working with cell arrays in this case? Why not numeric arrays?

Sorting elements of a single array into different subarrays

I have an 1000 element array with values ranging from 1 - 120. I want to split this array into 6 different subarrays with respect to the value range
for ex:
array1 with values from ranges 0-20.
array 2 with values from range 20-40........100-120 etc.
At the end I would like to plot a histogram with X-axis as the range and each bar depicting the number of elements in that range. I dont know of any other way for 'this' kind of plotting.
Thanks
In other words, you want to create a histogram. Matlab's hist() will do this for you.
If you only need the histogram, you can achieve the result using histc, like this:
edges = 0:20:120; % edges to compute histogram
n = histc(array,edges);
n = n(1:end-1); % remove last (no needed in your case; see "histc" doc)
bar(edges(1:end-1)+diff(edges)/2, n); % do the plot. For x axis use
% mean value of each bin

Resources