Why does this code generate random duplicates? - loops

Let me start by saying I'm new to python/pyspark
I've got a dataframe of 100 items, I'm slicing that up into batches of 25 then for each batch I need to do work on each row. I'm getting duplicate values in the last do work step. I've verified my original list does not contain duplicates, my slice step generates 4 distinct lists
batchsize = 25
sliced = []
emailLog = []
for i in range(1,bc_df.count(),batchsize):
sliced.append({"slice":bc_df.filter(bc_df.Index >= i).limit(batchsize).rdd.collect()})
for s in sliced:
for r in s['slice']:
emailLog.append({"email":r['emailAddress']})
re = sc.parallelize(emailLog)
re_df = sqlContext.createDataFrame(re)
re_df.createOrReplaceTempView('email_logView')
%sql
select count(distinct(email)) from email_logView
My expectation is to have 100 distinct email addresses, I sometiems get 75, 52, 96, 100

Your issue is caused by this line because it is not deterministic and allows duplicates:
sliced.append({"slice":bc_df.filter(bc_df.Index >= i).limit(batchsize).rdd.collect()})
Let's take a closer look at what is happening (I assume that the index column ranges from 1 to 100).
Your range function generates four values for i (1,26,51 and 76).
During the first iteration you request all rows which index is 1 or greater (i.e. [1,100]) and take 25 of them.
During the second iteration you request all rows which index is 26 or greater (i.e. [26,100]) and take 25 of them.
During the third iteration you request all rows which index is 51 or greater (i.e. [51,100]) and take 25 of them.
During the fourth iteration you request all rows which index is 76 or greater (i.e. [76,100]) and take 25 of them.
You already see that the intervals are overlapping. That means that the email addresses of an iteration could also have been taken by previous iterations.
You can fix this by simply extending your filter with an upper limit. For example:
sliced.append({"slice":bc_df.filter((bc_df.Index >= i) & (bc_df.Index < i + batchsize)).rdd.collect()})
That is just a quick fix to solve your problem. As general advise I recommend you to avoid .collect() as often as possible because it does not scale horizontaly.

Related

Python: Finding the row index of a value in 2D array when a condition is met

I have a 2D array PointAndTangent of dimension 8500 x 5. The data is row-wise with 8500 data rows and 5 data values for each row. I need to extract the row index of an element in 4th column when this condition is met, for any s:
abs(PointAndTangent[:,3] - s) <= 0.005
I just need the row index of the first match for the above condition. I tried using the following:
index = np.all([[abs(s - PointAndTangent[:, 3])<= 0.005], [abs(s - PointAndTangent[:, 3]) <= 0.005]], axis=0)
i = int(np.where(np.squeeze(index))[0])
which doesn't work. I get the follwing error:
i = int(np.where(np.squeeze(index))[0])
TypeError: only size-1 arrays can be converted to Python scalars
I am not so proficient with NumPy in Python. Any suggestions would be great. I am trying to avoid using for loop as this is small part of a huge simulation that I am trying.
Thanks!
Possible Solution
I used the following
idx = (np.abs(PointAndTangent[:,3] - s)).argmin()
It seems to work. It returns the row index of the nearest value to s in the 4th column.
You were almost there. np.where is one of the most abused functions in numpy. Half the time, you really want np.nonzero, and the other half, you want to use the boolean mask directly. In your case, you want np.flatnonzero or np.argmax:
mask = abs(PointAndTangent[:,3] - s) <= 0.005
mask is a 1D array with ones where the condition is met, and zeros elsewhere. You can get the indices of all the ones with flatnonzero and select the first one:
index = np.flatnonzero(mask)[0]
Alternatively, you can select the first one directly with argmax:
index = np.argmax(mask)
The solutions behave differently in the case when there are no rows meeting your condition. Three former does indexing, so will raise an error. The latter will return zero, which can also be a real result.
Both can be written as a one-liner by replacing mask with the expression that was assigned to it.

Issues with SUMPRODUCT in Excel: Trying to count the number of average subtractions above a given threshold

I have a fairly simple issue that I cannot seem to work out. It may be familiar to some of you now.
I have the following matrix (which I will refer to as two arrays):
F G H I J ... R S T U V
1 0 0 1 1
4 4 2 3 5 1 2 3 1 2
2 2 3 1 2 0 1
2 1 0 0 4 0 0 3 0 0
I would like to take the difference between the average of each row in array 1 (columns F:J) and the average of each row in array 2 (columns R:V). For example, the average of F2:J2 = 3.6, the average of R2:V2 = 1.8, and the overall difference = 1.8. I would then like to count the number of overall differences which exceed a given threshold (e.g., 1), but I want to ignore rows which have no entries (see R1:V1) and/or partially missing entries (see the 2nd entry in row F3:J3 and 4th and 5th entry in row R3:V3).
I was lucky enough to be introduced to array formulae by #Tom Sharpe, and have attempted to adapt his code for a similar issue I had, e.g.,:
=SUMPRODUCT(--((SUBTOTAL(1,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))-SUBTOTAL(1,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))>1)*(SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1))*(SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1))>0))
From what I understand, the code attempts to count the number of differences between the averages of each row in each array that exceed 1, so long as the product between the number of columns with full entries is >0 (i.e. has full data). However, it keeps throwing the #DIV/0! error, which I believe stems from that fact that it is still trying to subtract the average of F1:J1 and R1:V1 (e.g., the empty row), which would produce this kind of error. The correct answer for the current example is 1 (e.g., F2:J2 [3.6] - R2:V2 [1.8] = 1.8 == 1.8 > 1).
Does anyone have any ideas as to how the code can be attempted for the current purposes, and perhaps a v. brief explanation of what is going awry in the current code?
You're right, SUBTOTAL falls over when it's trying to find the average of an range containing only empty cells.
If you want to persevere and try and do it the same way as before with an array formula, you have to turn it round and put the condition for all the cells in both ranges to be non-blank in an if statement so that it doesn't try and take the average unless both ranges have no blanks:
=SUM(IF((SUBTOTAL(2,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))=COLUMNS(F1:J1))*(SUBTOTAL(2,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))=COLUMNS(R1:V1)),
--(SUBTOTAL(1,OFFSET(F1,ROW(F1:F4)-ROW(F1),0,1,COLUMNS(F1:J1)))-SUBTOTAL(1,OFFSET(R1,ROW(R1:R4)-ROW(R1),0,1,COLUMNS(R1:V1)))>1)))
This time unfortunately I found I couldn't SUMPRODUCT it - I think this is because of the presence of the IF statement - so you have to enter it as an array formula using CtrlShiftEnter
Will this work for you?
=IF(NOT(OR(IFERROR(MATCH(TRUE,ISBLANK(F1:J1),0),FALSE),IFERROR(MATCH(TRUE,ISBLANK(R1:V1),0),FALSE))), SUBTOTAL(1,F1:J1)-SUBTOTAL(1,R1:V1), "Missing Value(s)")
My approach was a little different from what you tried to adapt from #TomSharp in that I'm validating the cells have data (not blank) and then perform the calculation, othewise return an error message. This is still an array function call, so when you enter the formulas, press ctrl+shft+enter.
The condition part of the opening if() checks to see that each range's cells are not blank: if a match( true= isblank(cell))
means a cell is blank (bad), if no match ... ie no blank cells, Match will return an #NA "error" (good). False is good = Errors found ? No. ((ie no blank cells))

Range minimum queries when array is dynamic

I have an array say A(0 indexed) of size 1.
I want to find minimum in array A between indexes k1 (k1>=0) and A.size()-1(i.e the last element).
Then I would insert the value : (minimum element in given range + some "random" constant) at the end of the array.Then I have another query to find minimum between indexes k2 and A.size()-1. I find that, insert the value : (minimum in the given range + another "random" constant) at the end. I have to do many such queries.
Say, I have N queries. Naive approach would take O(N^2).
Cannot use segment trees as array is not static. But, a clever way to do is make segment tree for size N+1 array; beforehand and fill the unknown values with infinity. This would give me O(Nlog N) complexity.
Is there any other method for NlogN complexity or even N?
There is absolutely no need to use advanced data structures like tree here. Just a simple local variable and list will do it all:
Create an empty list(say minList).
Start from the end index and go till the start index of the initially given array, put the minimum values (till that index from the end) at the front of the list(i.e. do push_front).
Lets say the provided array is:
70 10 50 40 60 90 20 30
So the resultant minList will be:
10 10 20 20 20 20 20 30
After doing that, you only need to keep track of the minimum among newly appended elements in the continuously modifying array(say, minElemAppended).
Lets say you get k = 5 and randomConstant = -10, then
minElemAppended = minimum(minList[k-1] + randomConstant, minElemAppended)
By adopting this approach,
You don't need to traverse the appended part of or even the initial given array.
You have option not to append the elements at all.
Time Complexity: O(N) to process N queries.
Space Complexity: O(N) to store the minList

How do I check to see if two (or more) elements of an array/vector are the same?

For one of my homework problems, we had to write a function that creates an array containing n random numbers between 1 and 365. (Done). Then, check if any of these n birthdays are identical. Is there a shorter way to do this than doing several loops or several logical expressions?
Thank you!
CODE SO FAR, NOT DONE YET!!
function = [prob] bdayprob(N,n)
N = input('Please enter the number of experiments performed: N = ');
n = input('Please enter the sample size: n = ');
count = 0;
for(i=1:n)
x(i) = randi(365);
if(x(i)== x)
count = count + 1
end
return
If I'm interpreting your question properly, you want to check to see if generating n integers or days results in n unique numbers. Given your current knowledge in MATLAB, it's as simple as doing:
n = 30; %// Define sample size
N = 10; %// Define number of trials
%// Define logical array where each location tells you whether
%// birthdays were repeated for a trial
check = false(1, N);
%// For each trial...
for idx = 1 : N
%// Generate sample size random numbers
days = randi(365, n, 1);
%// Check to see if the total number of unique birthdays
%// are equal to the sample size
check(idx) = numel(unique(days)) == n;
end
Woah! Let's go through the code slowly shall we? We first define the sample size and the number of trials. We then specify a logical array where each location tells you whether or not there were repeated birthdays generated for that trial. Now, we start with a loop where for each trial, we generate random numbers from 1 to 365 that is of n or sample size long. We then use unique and figure out all unique integers that were generated from this random generation. If all of the birthdays are unique, then the total number of unique birthdays generated should equal the sample size. If we don't, then we have repeats. For example, if we generated a sample of [1 1 1 2 2], the output of unique would be [1 2], and the total number of unique elements is 2. Since this doesn't equal 5 or the sample size, then we know that the birthdays generated weren't unique. However, if we had [1 3 4 6 7], unique would give the same output, and since the output length is the same as the sample size, we know that all of the days are unique.
So, we check to see if this number is equal to the sample size for each iteration. If it is, then we output true. If not, we output false. When I run this code on my end, this is what I get for check. I set the sample size to 30 and the number of trials to be 10.
check =
0 0 1 1 0 0 0 0 1 0
Take note that if you increase the sample size, there is a higher probability that you will get duplicates, because randi can be considered as sampling with replacement. Therefore, the larger the sample size, the higher the chance of getting duplicate values. I made the sample size small on purpose so that we can see that it's possible to get unique days. However, if you set it to something like 100, or 200, you will most likely get check to be all false as there will most likely be duplicates per trial.
Here are some more approaches that avoid loops. Let
n = 20; %// define sample size
x = randi(365,n,1); %// generate n values between 1 and 365
Any of the following code snippets returns true (or 1) if there are two identical values in x, and false (or 0) otherwise:
Sort and then check if any two consecutive elements are the same:
result = any(diff(sort(x))==0);
Do all pairwise comparisons manually; remove self-pairs and duplicate pairs; and check if any of the remaining comparisons is true:
result = nnz(tril(bsxfun(#eq, x, x.'),-1))>0;
Compute the distance between distinct values, considering each pair just once, and then check if any distance is 0:
result = any(pdist(x(:))==0);
Find the number of occurrences of the most common value (mode):
[~, occurs] = mode(x);
result = occurs>1;
I don't know if I'm supposed to solve the problem for you, but perhaps a few hints may lead you in the right direction (besides I'm not a matlab expert so it will be in general terms):
Maybe not, but you have to ask yourself what they expect of you. The solution you propose requires you to loop through the array in two nested loops which will mean n*(n-1)/2 times through the loop (ie quadratic time complexity).
There are a number of ways you can improve the time complexity of the problem. The most straightforward would be to have a 365 element table where you can keep track if a particular number has been seen yet - which would require only a single loop (ie linear time complexity), but perhaps that's not what they're looking for either. But maybe that solution is a little bit ad-hoc? What we're basically looking for is a fast lookup if a particular number has been seen before - there exists more memory efficient structures that allows look up in O(1) time and O(log n) time (if you know these you have an arsenal of tools to use).
Then of course you could use the pidgeonhole principle to provide the answer much faster in some special cases (remember that you only asked to determine whether two or more numbers are equal or not).

constraining values to a range in an array

I can limit an array to values less than or greater than using individual values but how can I limit an array of values to a specific range.
Example snippet of code below:
arrayphase_sort=sortrows(arrayphase,4); %sort by phase in deg low to high
arrayphase_sort_limit_idx=arrayphase_sort(:,4)<45;% idx to limit array to phase angles under 45 degree
arrayphase_sort_limit=arrayphase_sort(arrayphase_sort_limit_idx,:); %limit array to phase angles under 45 degree
but I tried adding &>10 to see if I could get the array to show everything greater than 10 and less than 45 example below: (but I get an error)
arrayphase_sort_limit_idx=arrayphase_sort(:,4)<45**&>10**;
I know it's a syntax issues but I'm not sure the proper syntax.
Any idea the proper syntax to accomplish what I'm trying to do.
Thanks
You do it like this:
A = round(180 * rand(10, 10))
A(A > 10 & A < 45)
First line generates a 10x10 matrix of random data, the second line extracts numbers between 10 and 45.

Resources