NumPy change elements of an array with a given probability - arrays

I have an array of random numbers. I want to change only some of the elements based on a probability of say 0.07. Currently I am doing this using a for loop to iterate over every element. Is there a better way of doing this?

If the array in question is called a, you can select an average proportion of 0.07 of its values by
a[numpy.random.rand(*a.shape) < 0.07]
I don't know how you want to change these values. To multiply them by two, just do
a[numpy.random.rand(*a.shape) < 0.07] *= 2.0

Sven's answer is elegant. However, it is much faster to pick the elements you want to change with
n = numpy.random.binomial(len(a), 0.07)
a[numpy.random.randint(0, len(a), size=n)] *= 2.0
The first expression determines how many elements you want to sample (n is an integer between 0 and len(a), but on average 0.07), the second generates exactly the number of indices you want to retrieve. (Note, however, that you might get the same index several times.)
The difference to
a[numpy.random.rand(len(a)) < p]
becomes small as p approaches 1, but for small p, it might be a factor of 10 or more.

Related

How to count for 2 different arrays how many times the elements are repeated, in MATLAB?

I have array A (44x1) and B (41x1), and I want to count for both arrays how many times the elements are repeated. And if the repeated values are present in both arrays, I want their counting to be divided (for instance: value 0.5 appears 500 times in A and 350 times in B, so now divide 500 by 350).
I have to do this for bigger arrays as well, so I was thinking about using a looping (but no idea how to do it on MATLAB).
I got what I want on python:
import pandas as pd
data1 = pd.read_excel('C:/Users/Desktop/Python/data1.xlsx')
data2 = pd.read_excel('C:/Users/Desktop/Python/data2.xlsx')
for i in data1['Mag'].value_counts() & data2['Mag'].value_counts():
a = data1['Mag'].value_counts()/data2['Mag'].value_counts()
print(a)
break
Any idea of how to do the same on MATLAB? Thanks!
Since you can enumerate all valid earthquake magnitude values, you could use:
% Make up some data
A=randi([2 58],[100 1])/10;
B=randi([2 58],[20 1])/10;
% Round data to nearest tenth
%A=round(A,1); %uncomment if necessary
%B=round(B,1); %same
% Divide frequencies
validmags=0.2:0.1:5.8;
Afreqs=sum(double( abs(A-validmags)<1e-6 ),1); %relies on implicit expansion; A must be a column vector and validmags must be a row vector; dimension argument to sum() only to remind user; double() not really needed
Bfreqs=sum(double( abs(B-validmags)<1e-6 ),1); %same
Bfreqs./Afreqs, %for a fancier version: [{'Magnitude'} num2cell(validmags) ; {'Freq(B)/Freq(A)'} num2cell(Bfreqs./Afreqs)].'
The last line will produce NaN for 0/0, +Inf for nn/0, and 0 for 0/nn.
You could also use uniquetol, align the unique values of each vector, and divide the respective absolute frequencies. But I think the above approach is cleaner and easier to understand.

Find all possible distances from two arrays

Given two sorted array A and B length N. Each elements may contain natural number less than M. Determine all possible distances for all combinations elements A and B. In this case, if A[i] - B[j] < 0, then the distance is M + (A[i] - B[j]).
Example :
A = {0,2,3}
B = {1,2}
M = 5
Distances = {0,1,2,3,4}
Note: I know O(N^2) solution, but I need faster solution than O(N^2) and O(N x M).
Edit: Array A, B, and Distances contain distinct elements.
You can get a O(MlogM) complexity solution in the following way.
Prepare an array Ax of length M with Ax[i] = 1 if i belongs to A (and 0 otherwise)
Prepare an array Bx of length M with Bx[M-1-i] = 1 if i belongs to B (and 0 otherwise)
Use the Fast Fourier Transform to convolve these 2 sequences together
Inspect the output array, non-zero values correspond to possible distances
Note that the FFT is normally done with floating point numbers, so in step 4 you probably want to test if the output is greater than 0.5 to avoid potential rounding noise issues.
I possible done with optimized N*N.
If convert A to 0 and 1 array where 1 on positions which present in A (in range [0..M].
After convert this array into bitmasks, size of A array will be decreased into 64 times.
This will allow insert results by blocks of size 64.
Complexity still will be N*N but working time will be greatly decreased. As limitation mentioned by author 50000 for A and B sizes and M.
Expected operations count will be N*N/64 ~= 4*10^7. It will passed in 1 sec.
You can use bitvectors to accomplish this. Bitvector operations on large bitvectors is linear in the size of the bitvector, but is fast, easy to implement, and may work well given your 50k size limit.
Initialize two bitvectors of length M. Call these vectA and vectAnswer. Set the bits of vectA that correspond to the elements in A. Leave vectAnswer with all zeroes.
Define a method to rotate a bitvector by k elements (rotate down). I'll call this rotate(vect,k).
Then, for every element b of B, vectAnswer = vectAnswer | rotate(vectA,b).

Minimum Complexity of two lists element summation comparison

I have a question in algorithm design about arrays, which should be implement in C language.
Suppose that we have an array which has n elements. For simplicity n is power of '2' like 1, 2, 4, 8, 16 , etc. I want to separate this to 2 parts with (n/2) elements. Condition of separating is lowest absolute difference between sum of all elements in two arrays for example if I have this array (9,2,5,3,6,1,4,7) it will be separate to these arrays (9,5,1,3) and (6,7,4,2) . summation of first array's elements is 18 and the summation of second array's elements is 19 and the difference is 1 and these two arrays are the answer but two arrays like (9,5,4,2) and (7,6,3,1) isn't the answer because the difference of element summation is 4 and we have found 1 . so 4 isn't the minimum difference. How to solve this?
Thank you.
This is the Partition Problem, which is unfortunately NP-Hard.
However, since your numbers are integers, if they are relatively low, there is a pseudo polynomial O(W*n^2) solution using Dynamic Programming (where W is sum of all elements).
The idea is to create the DP matrix of size (W/2+1)*(n+1)*(n/2+1), based on the following recursive formula:
D(0,i,0) = true
D(0,i,k) = false k != 0
D(x,i,k) = false x < 0
D(x,0,k) = false x > 0
D(x,i,0) = false x > 0
D(x,i,k) = D(x,i-1,k) OR D(x-arr[i], i-1,k-1)
The above gives a 3d matrix, where each entry D(x,i,k) says if there is a subset containing exactly k elements, that sums to x, and uses the first i elements as candidates.
Once you have this matrix, you just need to find the highest x (that is smaller than SUM/2) such that D(x,n,n/2) = true
Later, you can get the relevant subset by going back on the table and "retracing" your choices at each step. This thread deals with how it is done on a very similar problem.
For small sets, there is also the alternative of a naive brute force solution, which basically splits the array to all possible halves ((2n)!/(n!*n!) of those), and picks the best one out of them.

How do I check to see if two (or more) elements of an array/vector are the same?

For one of my homework problems, we had to write a function that creates an array containing n random numbers between 1 and 365. (Done). Then, check if any of these n birthdays are identical. Is there a shorter way to do this than doing several loops or several logical expressions?
Thank you!
CODE SO FAR, NOT DONE YET!!
function = [prob] bdayprob(N,n)
N = input('Please enter the number of experiments performed: N = ');
n = input('Please enter the sample size: n = ');
count = 0;
for(i=1:n)
x(i) = randi(365);
if(x(i)== x)
count = count + 1
end
return
If I'm interpreting your question properly, you want to check to see if generating n integers or days results in n unique numbers. Given your current knowledge in MATLAB, it's as simple as doing:
n = 30; %// Define sample size
N = 10; %// Define number of trials
%// Define logical array where each location tells you whether
%// birthdays were repeated for a trial
check = false(1, N);
%// For each trial...
for idx = 1 : N
%// Generate sample size random numbers
days = randi(365, n, 1);
%// Check to see if the total number of unique birthdays
%// are equal to the sample size
check(idx) = numel(unique(days)) == n;
end
Woah! Let's go through the code slowly shall we? We first define the sample size and the number of trials. We then specify a logical array where each location tells you whether or not there were repeated birthdays generated for that trial. Now, we start with a loop where for each trial, we generate random numbers from 1 to 365 that is of n or sample size long. We then use unique and figure out all unique integers that were generated from this random generation. If all of the birthdays are unique, then the total number of unique birthdays generated should equal the sample size. If we don't, then we have repeats. For example, if we generated a sample of [1 1 1 2 2], the output of unique would be [1 2], and the total number of unique elements is 2. Since this doesn't equal 5 or the sample size, then we know that the birthdays generated weren't unique. However, if we had [1 3 4 6 7], unique would give the same output, and since the output length is the same as the sample size, we know that all of the days are unique.
So, we check to see if this number is equal to the sample size for each iteration. If it is, then we output true. If not, we output false. When I run this code on my end, this is what I get for check. I set the sample size to 30 and the number of trials to be 10.
check =
0 0 1 1 0 0 0 0 1 0
Take note that if you increase the sample size, there is a higher probability that you will get duplicates, because randi can be considered as sampling with replacement. Therefore, the larger the sample size, the higher the chance of getting duplicate values. I made the sample size small on purpose so that we can see that it's possible to get unique days. However, if you set it to something like 100, or 200, you will most likely get check to be all false as there will most likely be duplicates per trial.
Here are some more approaches that avoid loops. Let
n = 20; %// define sample size
x = randi(365,n,1); %// generate n values between 1 and 365
Any of the following code snippets returns true (or 1) if there are two identical values in x, and false (or 0) otherwise:
Sort and then check if any two consecutive elements are the same:
result = any(diff(sort(x))==0);
Do all pairwise comparisons manually; remove self-pairs and duplicate pairs; and check if any of the remaining comparisons is true:
result = nnz(tril(bsxfun(#eq, x, x.'),-1))>0;
Compute the distance between distinct values, considering each pair just once, and then check if any distance is 0:
result = any(pdist(x(:))==0);
Find the number of occurrences of the most common value (mode):
[~, occurs] = mode(x);
result = occurs>1;
I don't know if I'm supposed to solve the problem for you, but perhaps a few hints may lead you in the right direction (besides I'm not a matlab expert so it will be in general terms):
Maybe not, but you have to ask yourself what they expect of you. The solution you propose requires you to loop through the array in two nested loops which will mean n*(n-1)/2 times through the loop (ie quadratic time complexity).
There are a number of ways you can improve the time complexity of the problem. The most straightforward would be to have a 365 element table where you can keep track if a particular number has been seen yet - which would require only a single loop (ie linear time complexity), but perhaps that's not what they're looking for either. But maybe that solution is a little bit ad-hoc? What we're basically looking for is a fast lookup if a particular number has been seen before - there exists more memory efficient structures that allows look up in O(1) time and O(log n) time (if you know these you have an arsenal of tools to use).
Then of course you could use the pidgeonhole principle to provide the answer much faster in some special cases (remember that you only asked to determine whether two or more numbers are equal or not).

Randomize matrix elements between two values while keeping row and column sums fixed (MATLAB)

I have a bit of a technical issue, but I feel like it should be possible with MATLAB's powerful toolset.
What I have is a random n by n matrix of 0's and w's, say generated with
A=w*(rand(n,n)<p);
A typical value of w would be 3000, but that should not matter too much.
Now, this matrix has two important quantities, the vectors
c = sum(A,1);
r = sum(A,2)';
These are two row vectors, the first denotes the sum of each column and the second the sum of each row.
What I want to do next is randomize each value of w, for example between 0.5 and 2. This I would do as
rand_M = (0.5-2).*rand(n,n) + 0.5
A_rand = rand_M.*A;
However, I don't want to just pick these random numbers: I want them to be such that for every column and row, the sums are still equal to the elements of c and r. So to clean up the notation a bit, say we define
A_rand_c = sum(A_rand,1);
A_rand_r = sum(A_rand,2)';
I want that for all j = 1:n, A_rand_c(j) = c(j) and A_rand_r(j) = r(j).
What I'm looking for is a way to redraw the elements of rand_M in a sort of algorithmic fashion I suppose, so that these demands are finally satisfied.
Now of course, unless I have infinite amounts of time this might not really happen. I therefore accept these quantities to fall into a specific range: A_rand_c(j) has to be an element of [(1-e)*c(j),(1+e)*c(j)] and A_rand_r(j) of [(1-e)*r(j),(1+e)*r(j)]. This e I define beforehand, say like 0.001 or something.
Would anyone be able to help me in the process of finding a way to do this? I've tried an approach where I just randomly repick the numbers, but this really isn't getting me anywhere. It does not have to be crazy efficient either, I just need it to work in finite time for networks of size, say, n = 50.
To be clear, the final output is the matrix A_rand that satisfies these constraints.
Edit:
Alright, so after thinking a bit I suppose it might be doable with some while statement, that goes through every element of the matrix. The difficult part is that there are four possibilities: if you are in a specific element A_rand(i,j), it could be that A_rand_c(j) and A_rand_r(i) are both too small, both too large, or opposite. The first two cases are good, because then you can just redraw the random number until it is smaller than the current value and improve the situation. But the other two cases are problematic, as you will improve one situation but not the other. I guess it would have to look at which criteria is less satisfied, so that it tries to fix the one that is worse. But this is not trivial I would say..
You can take advantage of the fact that rows/columns with a single non-zero entry in A automatically give you results for that same entry in A_rand. If A(2,5) = w and it is the only non-zero entry in its column, then A_rand(2,5) = w as well. What else could it be?
You can alternate between finding these single-entry rows/cols, and assigning random numbers to entries where the value doesn't matter.
Here's a skeleton for the process:
A_rand=zeros(size(A)) is the matrix you are going to fill
entries_left = A>0 is a binary matrix showing which entries in A_rand you still need to fill
col_totals=sum(A,1) is the amount you still need to add in every column of A_rand
row_totals=sum(A,2) is the amount you still need to add in every row of A_rand
while sum( entries_left(:) ) > 0
% STEP 1:
% function to fill entries in A_rand if entries_left has rows/cols with one nonzero entry
% you will need to keep looping over this function until nothing changes
% update() A_rand, entries_left, row_totals, col_totals every time you loop
% STEP 2:
% let (i,j) be the indeces of the next non-zero entry in entries_left
% assign a random number to A_rand(i,j) <= col_totals(j) and <= row_totals(i)
% update() A_rand, entries_left, row_totals, col_totals
end
update()
A_rand(i,j) = random_value;
entries_left(i,j) = 0;
col_totals(j) = col_totals(j) - random_value;
row_totals(i) = row_totals(i) - random_value;
end
Picking the range for random_value might be a little tricky. The best I can think of is to draw it from a relatively narrow distribution centered around N*w*p where p is the probability of an entry in A being nonzero (this would be the average value of row/column totals).
This doesn't scale well to large matrices as it will grow with n^2 complexity. I tested it for a 200 by 200 matrix and it worked in about 20 seconds.

Resources