How do I check to see if two (or more) elements of an array/vector are the same? - arrays

For one of my homework problems, we had to write a function that creates an array containing n random numbers between 1 and 365. (Done). Then, check if any of these n birthdays are identical. Is there a shorter way to do this than doing several loops or several logical expressions?
Thank you!
CODE SO FAR, NOT DONE YET!!
function = [prob] bdayprob(N,n)
N = input('Please enter the number of experiments performed: N = ');
n = input('Please enter the sample size: n = ');
count = 0;
for(i=1:n)
x(i) = randi(365);
if(x(i)== x)
count = count + 1
end
return

If I'm interpreting your question properly, you want to check to see if generating n integers or days results in n unique numbers. Given your current knowledge in MATLAB, it's as simple as doing:
n = 30; %// Define sample size
N = 10; %// Define number of trials
%// Define logical array where each location tells you whether
%// birthdays were repeated for a trial
check = false(1, N);
%// For each trial...
for idx = 1 : N
%// Generate sample size random numbers
days = randi(365, n, 1);
%// Check to see if the total number of unique birthdays
%// are equal to the sample size
check(idx) = numel(unique(days)) == n;
end
Woah! Let's go through the code slowly shall we? We first define the sample size and the number of trials. We then specify a logical array where each location tells you whether or not there were repeated birthdays generated for that trial. Now, we start with a loop where for each trial, we generate random numbers from 1 to 365 that is of n or sample size long. We then use unique and figure out all unique integers that were generated from this random generation. If all of the birthdays are unique, then the total number of unique birthdays generated should equal the sample size. If we don't, then we have repeats. For example, if we generated a sample of [1 1 1 2 2], the output of unique would be [1 2], and the total number of unique elements is 2. Since this doesn't equal 5 or the sample size, then we know that the birthdays generated weren't unique. However, if we had [1 3 4 6 7], unique would give the same output, and since the output length is the same as the sample size, we know that all of the days are unique.
So, we check to see if this number is equal to the sample size for each iteration. If it is, then we output true. If not, we output false. When I run this code on my end, this is what I get for check. I set the sample size to 30 and the number of trials to be 10.
check =
0 0 1 1 0 0 0 0 1 0
Take note that if you increase the sample size, there is a higher probability that you will get duplicates, because randi can be considered as sampling with replacement. Therefore, the larger the sample size, the higher the chance of getting duplicate values. I made the sample size small on purpose so that we can see that it's possible to get unique days. However, if you set it to something like 100, or 200, you will most likely get check to be all false as there will most likely be duplicates per trial.

Here are some more approaches that avoid loops. Let
n = 20; %// define sample size
x = randi(365,n,1); %// generate n values between 1 and 365
Any of the following code snippets returns true (or 1) if there are two identical values in x, and false (or 0) otherwise:
Sort and then check if any two consecutive elements are the same:
result = any(diff(sort(x))==0);
Do all pairwise comparisons manually; remove self-pairs and duplicate pairs; and check if any of the remaining comparisons is true:
result = nnz(tril(bsxfun(#eq, x, x.'),-1))>0;
Compute the distance between distinct values, considering each pair just once, and then check if any distance is 0:
result = any(pdist(x(:))==0);
Find the number of occurrences of the most common value (mode):
[~, occurs] = mode(x);
result = occurs>1;

I don't know if I'm supposed to solve the problem for you, but perhaps a few hints may lead you in the right direction (besides I'm not a matlab expert so it will be in general terms):
Maybe not, but you have to ask yourself what they expect of you. The solution you propose requires you to loop through the array in two nested loops which will mean n*(n-1)/2 times through the loop (ie quadratic time complexity).
There are a number of ways you can improve the time complexity of the problem. The most straightforward would be to have a 365 element table where you can keep track if a particular number has been seen yet - which would require only a single loop (ie linear time complexity), but perhaps that's not what they're looking for either. But maybe that solution is a little bit ad-hoc? What we're basically looking for is a fast lookup if a particular number has been seen before - there exists more memory efficient structures that allows look up in O(1) time and O(log n) time (if you know these you have an arsenal of tools to use).
Then of course you could use the pidgeonhole principle to provide the answer much faster in some special cases (remember that you only asked to determine whether two or more numbers are equal or not).

Related

Define a vector with random steps

I want to create an array that has incremental random steps, I've used this simple code.
t_inici=(0:10*rand:100);
The problem is that the random number keeps unchangable between steps. Is there any simple way to change the seed of the random number within each step?
If you have a set number of points, say nPts, then you could do the following
nPts = 10; % Could use 'randi' here for random number of points
lims = [0, 10] % Start and end points
x = rand(1, nPts); % Create random numbers
% Sort and scale x to fit your limits and be ordered
x = diff(lims) * ( sort(x) - min(x) ) / diff(minmax(x)) + lims(1)
This approach always includes your end point, which a 0:dx:10 approach would not necessarily.
If you had some maximum number of points, say nPtsMax, then you could do the following
nPtsMax = 1000; % Max number of points
lims = [0,10]; % Start and end points
% Could do 10* or any other multiplier as in your example in front of 'rand'
x = lims(1) + [0 cumsum(rand(1, nPtsMax))];
x(x > lims(2)) = []; % remove values above maximum limit
This approach may be slower, but is still fairly quick and better represents the behaviour in your question.
My first approach to this would be to generate N-2 samples, where N is the desired amount of samples randomly, sort them, and add the extrema:
N=50;
endpoint=100;
initpoint=0;
randsamples=sort(rand(1, N-2)*(endpoint-initpoint)+initpoint);
t_inici=[initpoint randsamples endpoint];
However not sure how "uniformly random" this is, as you are "faking" the last 2 data, to have the extrema included. This will somehow distort pure randomness (I think). If you are not necessarily interested on including the extrema, then just remove the last line and generate N points. That will make sure that they are indeed random (or as random as MATLAB can create them).
Here is an alternative solution with "uniformly random"
[initpoint,endpoint,coef]=deal(0,100,10);
t_inici(1)=initpoint;
while(t_inici(end)<endpoint)
t_inici(end+1)=t_inici(end)+rand()*coef;
end
t_inici(end)=[];
In my point of view, it fits your attempts well with unknown steps, start from 0, but not necessarily end at 100.
From your code it seems you want a uniformly random step that varies between each two entries. This implies that the number of entries that the vector will have is unknown in advance.
A way to do that is as follows. This is similar to Hunter Jiang's answer but adds entries in batches instead of one by one, in order to reduce the number of loop iterations.
Guess a number of required entries, n. Any value will do, but a large value will result in fewer iterations and will probably be more efficient.
Initiallize result to the first value.
Generate n entries and concatenate them to the (temporary) result.
See if the current entries are already too many.
If they are, cut as needed and output (final) result. Else go back to step 3.
Code:
lower_value = 0;
upper_value = 100;
step_scale = 10;
n = 5*(upper_value-lower_value)/step_scale*2; % STEP 1. The number 5 here is arbitrary.
% It's probably more efficient to err with too many than with too few
result = lower_value; % STEP 2
done = false;
while ~done
result = [result result(end)+cumsum(step_scale*rand(1,n))]; % STEP 3. Include
% n new entries
ind_final = find(result>upper_value,1)-1; % STEP 4. Index of first entry exceeding
% upper_value, if any
if ind_final % STEP 5. If non-empty, we're done
result = result(1:ind_final-1);
done = true;
end
end

Minimum Complexity of two lists element summation comparison

I have a question in algorithm design about arrays, which should be implement in C language.
Suppose that we have an array which has n elements. For simplicity n is power of '2' like 1, 2, 4, 8, 16 , etc. I want to separate this to 2 parts with (n/2) elements. Condition of separating is lowest absolute difference between sum of all elements in two arrays for example if I have this array (9,2,5,3,6,1,4,7) it will be separate to these arrays (9,5,1,3) and (6,7,4,2) . summation of first array's elements is 18 and the summation of second array's elements is 19 and the difference is 1 and these two arrays are the answer but two arrays like (9,5,4,2) and (7,6,3,1) isn't the answer because the difference of element summation is 4 and we have found 1 . so 4 isn't the minimum difference. How to solve this?
Thank you.
This is the Partition Problem, which is unfortunately NP-Hard.
However, since your numbers are integers, if they are relatively low, there is a pseudo polynomial O(W*n^2) solution using Dynamic Programming (where W is sum of all elements).
The idea is to create the DP matrix of size (W/2+1)*(n+1)*(n/2+1), based on the following recursive formula:
D(0,i,0) = true
D(0,i,k) = false k != 0
D(x,i,k) = false x < 0
D(x,0,k) = false x > 0
D(x,i,0) = false x > 0
D(x,i,k) = D(x,i-1,k) OR D(x-arr[i], i-1,k-1)
The above gives a 3d matrix, where each entry D(x,i,k) says if there is a subset containing exactly k elements, that sums to x, and uses the first i elements as candidates.
Once you have this matrix, you just need to find the highest x (that is smaller than SUM/2) such that D(x,n,n/2) = true
Later, you can get the relevant subset by going back on the table and "retracing" your choices at each step. This thread deals with how it is done on a very similar problem.
For small sets, there is also the alternative of a naive brute force solution, which basically splits the array to all possible halves ((2n)!/(n!*n!) of those), and picks the best one out of them.

Randomize matrix elements between two values while keeping row and column sums fixed (MATLAB)

I have a bit of a technical issue, but I feel like it should be possible with MATLAB's powerful toolset.
What I have is a random n by n matrix of 0's and w's, say generated with
A=w*(rand(n,n)<p);
A typical value of w would be 3000, but that should not matter too much.
Now, this matrix has two important quantities, the vectors
c = sum(A,1);
r = sum(A,2)';
These are two row vectors, the first denotes the sum of each column and the second the sum of each row.
What I want to do next is randomize each value of w, for example between 0.5 and 2. This I would do as
rand_M = (0.5-2).*rand(n,n) + 0.5
A_rand = rand_M.*A;
However, I don't want to just pick these random numbers: I want them to be such that for every column and row, the sums are still equal to the elements of c and r. So to clean up the notation a bit, say we define
A_rand_c = sum(A_rand,1);
A_rand_r = sum(A_rand,2)';
I want that for all j = 1:n, A_rand_c(j) = c(j) and A_rand_r(j) = r(j).
What I'm looking for is a way to redraw the elements of rand_M in a sort of algorithmic fashion I suppose, so that these demands are finally satisfied.
Now of course, unless I have infinite amounts of time this might not really happen. I therefore accept these quantities to fall into a specific range: A_rand_c(j) has to be an element of [(1-e)*c(j),(1+e)*c(j)] and A_rand_r(j) of [(1-e)*r(j),(1+e)*r(j)]. This e I define beforehand, say like 0.001 or something.
Would anyone be able to help me in the process of finding a way to do this? I've tried an approach where I just randomly repick the numbers, but this really isn't getting me anywhere. It does not have to be crazy efficient either, I just need it to work in finite time for networks of size, say, n = 50.
To be clear, the final output is the matrix A_rand that satisfies these constraints.
Edit:
Alright, so after thinking a bit I suppose it might be doable with some while statement, that goes through every element of the matrix. The difficult part is that there are four possibilities: if you are in a specific element A_rand(i,j), it could be that A_rand_c(j) and A_rand_r(i) are both too small, both too large, or opposite. The first two cases are good, because then you can just redraw the random number until it is smaller than the current value and improve the situation. But the other two cases are problematic, as you will improve one situation but not the other. I guess it would have to look at which criteria is less satisfied, so that it tries to fix the one that is worse. But this is not trivial I would say..
You can take advantage of the fact that rows/columns with a single non-zero entry in A automatically give you results for that same entry in A_rand. If A(2,5) = w and it is the only non-zero entry in its column, then A_rand(2,5) = w as well. What else could it be?
You can alternate between finding these single-entry rows/cols, and assigning random numbers to entries where the value doesn't matter.
Here's a skeleton for the process:
A_rand=zeros(size(A)) is the matrix you are going to fill
entries_left = A>0 is a binary matrix showing which entries in A_rand you still need to fill
col_totals=sum(A,1) is the amount you still need to add in every column of A_rand
row_totals=sum(A,2) is the amount you still need to add in every row of A_rand
while sum( entries_left(:) ) > 0
% STEP 1:
% function to fill entries in A_rand if entries_left has rows/cols with one nonzero entry
% you will need to keep looping over this function until nothing changes
% update() A_rand, entries_left, row_totals, col_totals every time you loop
% STEP 2:
% let (i,j) be the indeces of the next non-zero entry in entries_left
% assign a random number to A_rand(i,j) <= col_totals(j) and <= row_totals(i)
% update() A_rand, entries_left, row_totals, col_totals
end
update()
A_rand(i,j) = random_value;
entries_left(i,j) = 0;
col_totals(j) = col_totals(j) - random_value;
row_totals(i) = row_totals(i) - random_value;
end
Picking the range for random_value might be a little tricky. The best I can think of is to draw it from a relatively narrow distribution centered around N*w*p where p is the probability of an entry in A being nonzero (this would be the average value of row/column totals).
This doesn't scale well to large matrices as it will grow with n^2 complexity. I tested it for a 200 by 200 matrix and it worked in about 20 seconds.

Brute force implementation for 0-1 Knapsack

I'm struggling with the given task for almost a week without success of finding solution so this site is my last hope.
I have 0-1 Knapsack problem which has 20 items with different values and weights, maximum weight of sack is 524. Now i need to implement brute force to find optimal solution subset of 20 items so that total weights <= 524 and maximum values of chosen items.
Could you please point me out or better give detailed implementation to analyze how it work!!
Thank you very much
The brute-force idea is easy:
Generate all possible subsets of your 20 items, saving only those which satisfy your weight constraint. If you want to be fancy, you can even only consider subsets to which you cannot add anything else without violating the weight constraint, since only these can possibly be the right answer. O(2^n)
Find the subset with maximum weight. linear in terms of the number of candidates, and since we have O(2^n) candidates, this is O(2^n).
Please comment if you'd like some pseudocode.
EDIT: What the hey, here's the pseudocode just in case.
GetCandidateSubsets(items[1..N], buffer, maxw)
1. addedSomething = false
2. for i = 1 to N do
3. if not buffer.contains(item[i]) and
weight(buffer) + weight(items[i]) <= maxw then
4. add items[i] to buffer
5. GetCandidateSubsets(items[1..N], buffer)
6. remove items[i] from buffer
7. addedSomething = true
8. if not addedSomething then
9. emit & store buffer
Note that the GetCandidateSubsets function is not very efficient, even for a brute force implementation. Thanks to amit for pointing that out. You could rework this to only walk the combinations, rather than the permutations, of the item set, as a first-pass optimization.
GetMaximalCandidate(candidates[1..M])
1. if M = 0 then return Null
2. else then
3. maxel = candidates[1]
4. for i = 2 to M do
5. if weight(candidates[i]) > weight(maxel) then
6. maxel = candidates[i]
7. return maxel

Why is the average number of steps for finding an item in an array N/2?

Could somebody explain why the average number of steps for finding an item in an unsorted array data-structure is N/2?
This really depends what you know about the numbers in the array. If they're all drawn from a distribution where all the probability mass is on a single value, then on expectation it will take you exactly 1 step to find the value you're looking for, since every value is the same, for example.
Let's now make a pretty strong assumption, that the array is filled with a random permutation of distinct values. You can think of this as picking some arbitrary sorted list of distinct elements and then randomly permuting it. In this case, suppose you're searching for some element in the array that actually exists (this proof breaks down if the element is not present). Then the number of steps you need to take is given by X, where X is the position of the element in the array. The average number of steps is then E[X], which is given by
E[X] = 1 Pr[X = 1] + 2 Pr[X = 2] + ... + n Pr[X = n]
Since we're assuming all the elements are drawn from a random permutation,
Pr[X = 1] = Pr[X = 2] = ... = Pr[X = n] = 1/n
So this expression is given by
E[X] = sum (i = 1 to n) i / n = (1 / n) sum (i = 1 to n) i = (1 / n) (n)(n + 1) / 2
= (n + 1) / 2
Which, I think, is the answer you're looking for.
The question as stated is just wrong. Linear search may perform better.
Perhaps a simpler example that shows why the average is N/2 is this:
Assume you have an unsorted array of 10 items: [5, 0, 9, 8, 1, 2, 7, 3, 4, 6]. This is all the digits [0..9].
Since the array is unsorted (i.e. you know nothing about the order of the items), the only way you can find a particular item in the array is by doing a linear search: start at the first item and go until you find what you're looking for, or you reach the end.
So let's count how many operations it takes to find each item. Finding the first item (5) takes only one operation. Finding the second item (0) takes two. Finding the last item (6) takes 10 operations. The total number of operations required to find all 10 items is 1+2+3+4+5+6+7+8+9+10, or 55. The average is 55/10, or 5.5.
The "linear search takes, on average, N/2 steps" conventional wisdom makes a number of assumptions. The two biggest are:
The item you're looking for is in the array. If an item isn't in the array, then it takes N steps to determine that. So if you're often looking for items that aren't there, then your average number of steps per search is going to be much higher than N/2.
On average, each item is searched for approximately as often as any other item. That is, you search for "6" as often as you search for "0", etc. If some items are looked up significantly more often than others, then the average number of steps per search is going to be skewed in favor of the items that are searched for more frequently. The number will be higher or lower than N/2, depending on the positions of the most frequently looked-up items.
While I think templatetypedef has the most instructive answer, in this case there is a much simpler one.
Consider permutations of the set {x1, x2, ..., xn} where n = 2m. Now take some element xi you wish to locate. For each permutation where xi occurs at index m - k, there is a corresponding mirror image permutation where xi occurs at index m + k. The mean of these possible indices is just [(m - k) + (m + k)]/2 = m = n/2. Therefore the mean of all all possible permutations of the set is n/2.
Consider a simple reformulation of the question:
What would be the limit of
lim (i->inf) of (sum(from 1 to i of random(n)) /i)
Or in C:
int sum = 0, i;
for (i = 0; i < LARGE_NUM; i++) sum += random(n);
sum /= LARGE_NUM;
If we assume that our random have even distribution of values (each value from 1 to n is equally likely to be produced), then the expected result would be (1+n)/2.

Resources