Efficient algorithm for merging consistent rows of multiple 2d arrays - arrays

I have an arbitrary number of 2d arrays of equal width but possibly non-equal height. They can consist of 1s, 0s, or a wildcard * which can match either a 1 or a 0. The wildcards are always in the same columns. I want to return all possible rows that are consistent with at least one row in every array simultaneously, and contain no wild cards.
For a concrete example, consider the three 2d arrays
1 0 1 * * 0 1 0 * 1 * 1
a = 1 1 1 * b = * 1 1 0 c = * 0 * 0
0 1 0 * * 0 0 1
A possible row in the solution might be 1 0 1 0. It is consistent with the top row of a, the top row of b, and the second row of c. By contrast, a row that should not be in the solution is 0 1 0 1, since it is not consistent with any row of b despite being consistent with the bottom row of a and the top row of c.
Beyond doing an inefficient brute-force check I'm rather stuck. It seems like there should be a faster way. Are there are any tricks that might help solve this problem efficiently?

Related

How to delete rows from a matrix that contain more than 50% zeros MATLAB

I want to remove the rows in an array that contain more than 50% of null elements.
eg:
if the input is
1 0 0 0 5 0
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
0 0 4 0 1 0
I want to remove rows 1 and 5, but retain the rest. The output should look like:
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
I want to do this using matlab
Use logical indexing into the rows, based on the mean of the rows of A negated:
t = .5; % threshold
A(mean(A==0,2) > t, :) = [];
What this does:
Compare A with 0: turns zeros into true, and nonzeros into false.
Compute the mean of each row.
Compare that to the desired threshold.
Use the result as a logical index to delete unwanted rows.
Equivalently, you can keep the wanted rows instead of removing the unwanted ones. This may be faster depending on the proportion of rows:
A = A(mean(A~=0,2) >= 1-t, :);
You can also use the standardizeMissing function and rmmissing function together to achieve this:
>> [~,tf] = rmmissing(standardizeMissing(A,0),'MinNumMissing',floor(0.5*size(A,2))+1);
>> A(~tf,:)
The call to standardizeMissing replaces the 0 values with NaN (the standard missing indicator for double), then the rmmissing call identifies in the logical vector tf the rows that have more than 50% of their entries as 0 (i.e., those rows that have more than floor(0.5*size(A,2))+1 0-valued entries. Then you can just negate the tf output and use it as an indexer. You can adapt the minimum number missing easily to satisfy whatever percentage criteria you want.
Also note that tf is a logical vector here that is only the size of the number of rows of A.
As I mentioned on Luis' answer, one downside to his approach is that it requires an intermediate logical array of the same size as A to be created, which can potentially incur a significant memory/performance penalty when working with large arrays.
An explicit looped approach with nnz (overly verbose, for clarity):
[nrows, ncols] = size(A);
maximum_ratio_of_zeros = 0.5;
minimum_ratio_of_nonzeros = 1 - maximum_ratio_of_zeros;
todelete = false(nrows, 1);
for ii = 1:nrows
if nnz(A(ii,:))/ncols < minimum_ratio_of_nonzeros
todelete(ii) = true;
end
end
A(todelete,:) = [];
Which returns the desired answer.

Generating a matrix to describe a two-dimensional feature

Let's say I have a vector A = [-1,2];
Each element in A is described by the actual number and sign. So each element has a 2 dimensional feature-set.
I would like to generate a matrix, in this case 2x2 where the columns correspond to the element, and rows correspond to the presence of a feature. The presence of a feature is described by 1's and 0's. So, if an element is positive, it is 1, if the element is the number 1, then the result is 1 as well. In the case above I would get:
Element 1 Element 2
Is this a 1? 1 0
Is this a positive number? 0 1
What is the smartest way to go about accomplishing this? Obviously if statements would work, but I feel that there should be a faster, much smarter way of going about this. I am coding this in matlab by the way, and I would appreciate any help.
#Benoit_11's solution is a fine one. Here's a similar but maybe simpler solution. You could try both and see which is faster if you care about speed.
features = [abs(A) == 1; A > 0];
this assumes A is a row vector in order to get the output in the format you specified.
Simple way using ismember for the first condition and logical operation for the 2nd condition. ismember outputs a logical array which you can plug into the output you need (here called DescribeA; and likewise when you check for values greater than 0 using the > operator.
%// Test array
A = [-1,2,1,-10,5,-3,1]
%// Initialize output
DescribeA = zeros(2,numel(A));
%// 1st condition. Check if values are 1 or -1
DescribeA(1,:) = ismember(A,1)|ismember(A,-1);
%// Check if they are > 0
DescribeA(2,:) = A>0;
Output in Command Window:
A =
-1 2 1 -10 5 -3 1
DescribeA =
1 0 1 0 0 0 1
0 1 1 0 1 0 1
I feel there is a smarter way for the 1st condition but I can't seem to find it.

graph representing the randomization of each column in a binary matrix

Imagine the following binary image exemplified by the matrix below. This is a simplified version of the images I'll be working with:
0 1 0 1
0 1 1 1
0 0 0 1
0 1 1 1
I want to construct a graph that will represent the randomness of each column. My thought is to develop a random index = the total transitions between each value in the column / by the total possible transitions. In the matrix above, each column could have a total possible of 3 transitions.
For the example above:
Column 1 would have a random index of 0% (0/3)
Column 2 would have a random index of 66.7% (2/3)
Column 3 = 100% (3/3)
Column 4 = 0% (0/3) even though they are 1's and not 0's. Doesn't matter, I just want the transitions.
Can I draw a boundary around all the 1 values and then have MATLAB sum all of the boundaries?
To calculate what you are suggesting you can just do:
sum( diff(A) ~= 0 )
The diff(A) will take the forward difference down the columns and the sum will count the number of non-zero changes. So if you do this you will get:
ans =
0 2 3 0
Let your image be defined as
im = [ 0 1 0 1
0 1 1 1
0 0 0 1
0 1 1 1 ];
The random index you want can be computed as
result = sum(diff(im)~=0) / (size(im,1)-1);
Explanation: diff computes the difference between consecutive elemtents down each column. The result is compared against zero (~=0), and all nonzero values within each row are added (with sum). Finally, the result is divided by the maximum number os transitions, which is the number of rows minus 1 (size(im,1)-1)
Equivalently, you could use xor between consecutive rows:
result = sum(xor(im(1:end-1,:), im(2:end,:))) / (size(im,1)-1)

Finding row with maximum no. of 1s if each row is sorted using logicalOR approach

Question similar to this may have been discussed before but I want to discuss a different approach to this.
Given a boolen 2D array where each row is sorted, find the rows with maximum number of 1s.
Input Matrix :
0 1 1 1
0 0 1 1
1 1 1 1
0 0 0 0
Output : 2
How about doing this approach...Logical OR for column 0 of each row and if answer is 1, return that row index and stop. Like in this case if I do (0 | 0 | 1 | 0) answer would be one and thereby return that row index. if the input matrix is something like :
Input matrix:
0 1 1 1
0 0 1 1
0 0 0 1
0 0 0 0
Ouput : 0
When I do logicalOR of column 0 of each row, answer would be zero...so I would move to column 1 of each row, the procedure is followed till the LogicalOR is 1.?I know other approaches to solve this problem but I would like to have view on this approach.
If it's:
0 ... 0 1
0 ... 0 0
0 ... 0 0
0 ... 0 0
0 ... 0 0
You'd have to search many columns.
The maximum amount of work involved would be linear in the number of cells (O(mn)), and the other approaches outperform this here.
Specifically the approach where:
You start at the top right and
Repeatedly:
Search left until you find a 0 and
Search down until you find a 1
And return the last row where you found a 1
Is linear in the number of rows plus columns (O(m + n)).
That would work since it's equivalent to finding the row for which the leftmost 1 is before (or at the same point as) any other row's leftmost 1. It would still be O(m * n) in the worst case:
Input Matrix :
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
Given that your rows are sorted, I would binary search for the position of the first one for each row, and return the row with the minimum position. This would be O(m * logn), although you might be able to do better.
Your approach is likely to be orders of magnitude slower than the naive "go through the rows, and count the zeros, and remember the row with the fewest zeros." The reason is that, assuming your bits are stored one-row-at-a-time, with the bools packed tightly, then memory for the row will be in cache all at once, and bit-counting will cache beautifully.
Contrast this to your proposed approach, where for each row, the cache line will be loaded, and a single bit will be read from it. By the time you've cycled through all the rows in your array, the memory for the first row will (probably, if you've got any reasonable number of rows), be out of the cache, and the row will have to be loaded again.
Approximately, assuming a 64B cache line, the first approach is going to need (1/64*8) memory accesses per bit in the array, compared to 1 memory access per bit in the array compared to yours. Since counting the bits and remembering the max is just a few cycles, it's reasonable to think that the memory access are going to dominate the running cost, which means the first approach will run approximately 64 * 8 = 512 times faster. Of course, you'll get some of that time back because your approach can terminate early, but the 512 times speed hit is a large cost to overcome.
If your rows are super-long, you may find that a hybrid between these two approaches works excellently: count the number of bits in the first cache-line's worth of data in each row (being careful to cache-line-align each row of your data in memory), and if every row has no bits set in the first cache-line, go to the second and so forth. This combines the cache-efficiency of the first approach with the early termination of the second approach.
As with all optimisations, you should measure results, and be sure that it's important that the code is fast. The efficient solution is likely to impose annoying restrictions (like 64-byte memory alignment for rows), and the code will be harder to read than a straightforward solution.

Look at each row separately in a matrix (Matlab)

I have a matrix in Matlab(2012) with 3 columns and X number of rows, X is defined by the user, so varies each time. For this example though I will use a fixed 5x3 matrix.
So I would like to perform an iterative function on each row within the matrix, while the value in the third column is below a certain value. Then store the new values within the same matrix, so overwrite the original values.
The code below is a simplified version of the problem.
M=[-2 -5 -3 -2 -4]; %Vector containing random values
Vf_X=M+1; %Defining the first column of the matrix
Vf_Y=M+2; %Defining the secound column of the matrix
Vf_Z=M; %Defining the third column of the matrix
Vf=[Vf_X',Vf_Y',Vf_Z']; %Creating the matrix
while Vf(:,3)<0
Vf=Vf+1;
end
disp(Vf)
The result I get is
1 2 0
-2 -1 -3
0 1 -1
1 2 0
-1 0 -2
Ideally I would like to get this result instead
1 2 0
1 2 0
1 2 0
1 2 0
1 2 0
The while will not start if any value is above zero to begin with and stops as soon as one value goes above zero.
I hope this makes sense and I have supplied enough information
Thank you for your time and help.
Your current problem is that you stop iterating the very moment any of the values in the third row break the condition. Correct me if I'm wrong, but what I think you want is to continue doing iterations on the remaining rows, until the conditions are broken by all third columns.
You could do that like this:
inds = true(size(Vf,1),1);
while any(inds)
Vf(inds,:) = Vf(inds,:)+1;
inds = Vf(:,3) < 0;
end
Of course, for the simple addition you provide, there is a better and faster way:
inds = Vf(:,3)<0;
Vf(inds,:) = bsxfun(#minus, Vf(inds,:), Vf(inds,3));
But for general functions, the while above will do the trick.

Resources