How to select the values greater than the mean in an array? - arrays

I want to apply feature selection on a dataset (lung.mat)
After loading the data, I computed the mean of distances between each feature with others by Jaccard measure. Then I sorted the distances descendingly in B1. And then I selected for example 25 number of all the features and saved the matrix in databs1.
I want to select the features that have distance values greater than the mean of the array (B1).
close all;
clc
load lung.mat
data=lung;
[n,m]=size(data);
for i=1:m-1
for j=i+1:m
t1(i,j)=fjaccard(data(:,i),data(:,j));
b1=sum(t1)/(m-1);
end
end
[B1,indB1]=sort(b1,'descend');
databs1=data(:,indB1(1:25));
databs1=[databs1,data(:,m)]; %jaccard
save('databs1.mat');
I’ll be grateful to have your opinions about how to define this in B1, selecting values of B1 which are greater than the mean of the array B1, It means cutting the rest of smaller values than the mean of B1.
I used this line,
B1(B1>mean(B1(:)))
after running, B1 still has the full number of features(column) equal to the full dataset, for example, lung.mat has 57 features and B1 by this line still has 57 columns,
I considered that by this line B1 will be cut to the number of features that are greater than the mean of B1.

the general answer to your question is here (this seems clear to you based on your code):
a=randi(10,1,10) %example data
a>mean(a) %get binary matrix of which elements are larger than mean
a(a>mean(a)) %select elements from a that are larger than mean
a =
1 9 10 7 8 8 4 7 2 8
ans =
1×10 logical array
0 1 1 1 1 1 0 1 0 1
ans =
9 10 7 8 8 7 8

Related

How should I selectively sum multiple axes of an array?

What is the preferred approach in J for selectively summing multiple axes of an array?
For instance, suppose that a is the following rank 3 array:
]a =: i. 2 3 4
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
My goal is to define a dyad "sumAxes" to sum over multiple axes of my choosing:
0 1 sumAxes a NB. 0+4+8+12+16+20 ...
60 66 72 78
0 2 sumAxes a NB. 0+1+2+3+12+13+14+15 ...
60 92 124
1 2 sumAxes a NB. 0+1+2+3+4+5+6+7+8+9+10+11 ...
66 210
The way that I am currently trying to implement this verb is to use the dyad |: to first permute the axes of a, and then ravel the items of the necessary rank using ,"n (where n is the number axes I want to sum over) before summing the resulting items:
sumAxes =: dyad : '(+/ # ,"(#x)) x |: y'
This appears to work as I want, but as a beginner in J I am unsure if I am overlooking some aspect of rank or particular verbs that would enable a cleaner definition. More generally I wonder whether permuting axes, ravelling and summing is idiomatic or efficient in this language.
For context, most of my previous experience with array programming is with Python's NumPy library.
NumPy does not have J's concept of rank and instead expects the user to explicitly label the axes of an array to reduce over:
>>> import numpy
>>> a = numpy.arange(2*3*4).reshape(2, 3, 4) # a =: i. 2 3 4
>>> a.sum(axis=(0, 2)) # sum over specified axes
array([ 60, 92, 124])
As a footnote, my current implementation of sumAxes has the disadvantage of working "incorrectly" compared to NumPy when just a single axis is specified (as rank is not interchangeable with "axis").
Motivation
J has incredible facilities for handling arbitrarily-ranked arrays. But there's one facet of the language which is simultaneously almost universally useful as well as justified, but also somewhat antithetical to this dimensionality-agnostic nature.
The major axis (in fact, leading axes in general) are implicitly privileged. This is the concept that underlies, e.g. # being the count of items (i.e. the dimension of the first axis), the understated elegance and generality of +/ without further modification, and a host of other beautiful parts of the language.
But it's also what accounts for the obstacles you're meeting in trying to solve this problem.
Standard approach
So the general approach to solving the problem is just as you have it: transpose or otherwise rearrange the data so the axes that interest you become leading axes. Your approach is classic and unimpeachable. You can use it in good conscience.
Alternative approaches
But, like you, it niggles me a bit that we are forced to jump through such hoops in similar circumstances. One clue that we're kind of working against the grain of the language is the dynamic argument to the conjunction "(#x); usually arguments to conjunctions are fixed, and calculating them at runtime often forces us to use either explicit code (as in your example) or dramatically more complicated code. When the language makes something hard to do, it's usually a sign you're cutting against the grain.
Another is that ravel (,). It's not just that we want to transpose some axes; it's that we want to focus on one specific axis, and then run all the elements trailing it into a flat vector. Though I actually think this reflects more a constraint imposed by how we're framing the problem, rather than one in the notation. More on in the final section of this post.
With that, we might feel justified in our desire to address a non-leading axis directly. And, here and there, J provides primitives that allow us to do exactly that, which might be a hint that the language's designers also felt the need to include certain exceptions to the primacy of leading axes.
Introductory examples
For example, dyadic |. (rotate) has ranks 1 _, i.e. it takes a vector on the left.
This is sometimes surprising to people who have been using it for years, never having passed more than a scalar on the left. That, along with the unbound right rank, is another subtle consequence of J's leading-axis bias: we think of the right argument as a vector of items, and the left argument as a simple, scalar rotation value of that vector.
Thus:
3 |. 1 2 3 4 5 6
4 5 6 1 2 3
and
1 |. 1 2 , 3 4 ,: 5 6
3 4
5 6
1 2
But in this latter case, what if we didn't want to treat the table as a vector of rows, but as a vector of columns?
Of course, the classic approach is to use rank, to explicitly denote the the axis we're interested in (because leaving it implicit always selects the leading axis):
1 |."1 ] 1 2 , 3 4 ,: 5 6
2 1
4 3
6 5
Now, this is perfectly idiomatic, standard, and ubiquitous in J code: J encourages us to think in terms of rank. No one would blink an eye on reading this code.
But, as described at the outset, in another sense it can feel like a cop-out, or manual adjustment. Especially when we want to dynamically choose the rank at runtime. Notationally, we are now no longer addressing the array as a whole, but addressing each row.
And this is where the left rank of |. comes in: it's one of those few primitives which can address non-leading axes directly.
0 1 |. 1 2 , 3 4 ,: 5 6
2 1
4 3
6 5
Look ma, no rank! Of course, we now have to specify a rotation value for each axis independently, but that's not only ok, it's useful, because now that left argument smells much more like something which can be calculated from the input, in true J spirit.
Summing non-leading axes directly
So, now that we know J lets us address non-leading axes in certain cases, we simply have to survey those cases and identify one which seems fit for our purpose here.
The primitive I've found most generally useful for non-leading-axis work is ;. with a boxed left-hand argument. So my instinct is to reach for that first.
Let's start with your examples, slightly modified to see what we're summing.
]a =: i. 2 3 4
sumAxes =: dyad : '(< # ,"(#x)) x |: y'
0 1 sumAxes a
+--------------+--------------+---------------+---------------+
|0 4 8 12 16 20|1 5 9 13 17 21|2 6 10 14 18 22|3 7 11 15 19 23|
+--------------+--------------+---------------+---------------+
0 2 sumAxes a
+-------------------+-------------------+---------------------+
|0 1 2 3 12 13 14 15|4 5 6 7 16 17 18 19|8 9 10 11 20 21 22 23|
+-------------------+-------------------+---------------------+
1 2 sumAxes a
+-------------------------+-----------------------------------+
|0 1 2 3 4 5 6 7 8 9 10 11|12 13 14 15 16 17 18 19 20 21 22 23|
+-------------------------+-----------------------------------+
The relevant part of the definition of for dyads derived from ;.1 and friends is:
The frets in the dyadic cases 1, _1, 2 , and _2 are determined by the 1s in boolean vector x; an empty vector x and non-zero #y indicates the entire of y. If x is the atom 0 or 1 it is treated as (#y)#x. In general, boolean vector >j{x specifies how axis j is to be cut, with an atom treated as (j{$y)#>j{x.
What this means is: if we're just trying to slice an array along its dimensions with no internal segmentation, we can simply use dyad cut with a left argument consisting solely of 1s and a:s. The number of 1s in the vector (ie. the sum) determines the rank of the resulting array.
Thus, to reproduce the examples above:
('';'';1) <#:,;.1 a
+--------------+--------------+---------------+---------------+
|0 4 8 12 16 20|1 5 9 13 17 21|2 6 10 14 18 22|3 7 11 15 19 23|
+--------------+--------------+---------------+---------------+
('';1;'') <#:,;.1 a
+-------------------+-------------------+---------------------+
|0 1 2 3 12 13 14 15|4 5 6 7 16 17 18 19|8 9 10 11 20 21 22 23|
+-------------------+-------------------+---------------------+
(1;'';'') <#:,;.1 a
+-------------------------+-----------------------------------+
|0 1 2 3 4 5 6 7 8 9 10 11|12 13 14 15 16 17 18 19 20 21 22 23|
+-------------------------+-----------------------------------+
Et voila. Also, notice the pattern in the left hand argument? The two aces are exactly at the indices of your original calls to sumAxe. See what I mean by the fact that providing a value for each dimension smelling like a good thing, in the J spirit?
So, to use this approach to provide an analog to sumAxe with the same interface:
sax =: dyad : 'y +/#:,;.1~ (1;a:#~r-1) |.~ - {. x -.~ i. r=.#$y' NB. Explicit
sax =: ] +/#:,;.1~ ( (] (-#{.#] |. 1 ; a: #~ <:#[) (-.~ i.) ) ##$) NB. Tacit
Results elided for brevity, but they're identical to your sumAxe.
Final considerations
There's one more thing I'd like to point out. The interface to your sumAxe call, calqued from Python, names the two axes you'd like "run together". That's definitely one way of looking at it.
Another way of looking at it, which draws upon the J philosophies I've touched on here, is to name the axis you want to sum along. The fact that this is our actual focus is confirmed by the fact that we ravel each "slice", because we do not care about its shape, only its values.
This change in perspective to talk about the thing you're interested in, has the advantage that it is always a single thing, and this singularity permits certain simplifications in our code (again, especially in J, where we usually talk about the [new, i.e. post-transpose] leading axis)¹.
Let's look again at our ones-and-aces vector arguments to ;., to illustrate what I mean:
('';'';1) <#:,;.1 a
('';1;'') <#:,;.1 a
(1;'';'') <#:,;.1 a
Now consider the three parenthesized arguments as a single matrix of three rows. What stands out to you? To me, it's the ones along the anti-diagonal. They are less numerous, and have values; by contrast the aces form the "background" of the matrix (the zeros). The ones are the true content.
Which is in contrast to how our sumAxe interface stands now: it asks us to specify the aces (zeros). How about instead we specify the 1, i.e. the axis that actually interests us?
If we do that, we can rewrite our functions thus:
xas =: dyad : 'y +/#:,;.1~ (-x) |. 1 ; a: #~ _1 + #$y' NB. Explicit
xas =: ] +/#:,;.1~ -#[ |. 1 ; a: #~ <:###$#] NB. Tacit
And instead of calling 0 1 sax a, you'd call 2 xas a, instead of 0 2 sax a, you'd call 1 xas a, etc.
The relative simplicity of these two verbs suggests J agrees with this inversion of focus.
¹ In this code I'm assuming you always want to collapse all axes except 1. This assumption is encoded in the approach I use to generate the ones-and-aces vector, using |..
However, your footnote sumAxes has the disadvantage of working "incorrectly" compared to NumPy when just a single axis is specified suggests sometimes you want to only collapse one axis.
That's perfectly possible and the ;. approach can take arbitrary (orthotopic) slices; we'd only need to alter the method by which we instruct it (generate the 1s-and-aces vector). If you provide a couple examples of generalizations you'd like, I'll update the post here. Probably just a matter of using (<1) x} a: #~ #$y or ((1;'') {~ (e.~ i.###$)) instead of (-x) |. 1 ; a:#~<:#$y.

Merge multiple arrays of unique occurrences

I want to merge multiple arrays of unique occurrences to a single array. To get the arrays in the first place I use this code, where image series is a slice from a tiff image imported using imread:
a = unique(img_series);
occu = [a,histc(img_series(:),a)];
I do that multiple times, because the tiff image I'm using has multiple hundred images stacked, which my RAM will not support to import at once. So each 'occu' looks something like this (first number is the unique value, second number is the number of occurrences):
occu1 occu2 .....
0 1 1 2
12 1 10 1
14 1 12 1
15 1 14 2
.. .. .. .. .....
Now I want to merge them all together, or better merge them in each iteration, when I'm reading another stacked image.
The merged results should be a 2D matrix similar to the one above. The number of occurrences of the same values should be added to one another, as this is the whole point of counting them. So the result of the above example should be this:
occu_total
0 1
1 2
10 1
12 2
14 3
15 1
.. ..
I found the join command, but that one does not seem to work here. I guess I could do it the long way of searching the matching number and add the occurrences together and so on, but there must be a quicker way of doing it.
A = [0 1;12 1; 14 1;15 1];B = [1 2;10 1;12 1;14 2];
tmp = [A;B]; %// merge arrays into a single one
tmp(:,1) = tmp(:,1)+1;%// remove zero occurrences by adding 1 to everything
C = accumarray(tmp(:,1),tmp(:,2)); %// add occurrences all up
D = [1:numel(C)].'; %// create numbered array
E = [D C];
E((C==0),:)=[]; %// get output
E(:,1) = E(:,1)-1;%// subtract the 1 again
E =
0 1
1 2
10 1
12 2
14 3
15 1
Job for accumarray. This takes the first argument as your dictionary key, and adds the values of the each key together. The addition and subtraction of 1 is done because 0 cannot be an index in MATLAB. To circumvent this (assuming you have no negative numbers), you can simply add 1 and remove that afterwards, shifting all your indices to positive integers. If you hit negative numbers, subtract tmp(:,1) = min(tmp(:,1)+1 and add E(:,1) = min(tmp(:,1)-1

Given 2 positions (x1,y1) and (x2,y2) print the sum of all elements within the area of rectangle in O(1) time

A 2D matrix is given to you. Now user will give 2 positions (x1,y1) and (x2,y2),
which is basically the upper left and lower right coordinate of a rectangle formed within the matrix.
You have to print sum of all the elements within the area of rectangle in O(1) running time.
Now you can do any pre computation with the matrix.
But when it is done you should answer all your queries in constant time.
Example : consider this 2D matrix
1 3 5 1 8
8 3 5 3 7
6 3 9 6 0
Now consider 2 points given by user (0, 2) and (2, 4).
Your solution should print: 44.
i.e., the enclosed area is
5 1 8
5 3 7
9 6 0
Since your question seems to be related to homeworks, i am just posting a clue... It is a formula that may inspire you :
What does this sum of terms represent ?
Rewrite your problem using mathematical tools, indexes like i and j ...
maybe this can also help:
courtosy of: http://www.techiedelight.com/calculate-sum-elements-sub-matrix-constant-time/

Matlab : Matrix indexing Logic

i am doing very simple Matrix indexing examples . where code is as give below
>> A=[ 1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ]
A =
1 2 3 4
5 6 7 8
9 10 11 12
>> A(end, end-2)
ans =
10
>> A(2:end, end:-2:1)
ans =
8 6
12 10
here i am bit confused . when i use A(end, end-2) it takes difference of two till first column and when there is just one column left there is no further processing , but when i use A(2:end, end:-2:1) it takes 6 10 but how does it print 8 12 while there is just one column left and we have to take difference of two from right to left , Pleas someone explain this simple point
The selection A(end, end-2) reads: take elements in last row of A that appear in column 4(end)-2=2.
The selection A(2:end, end:-2:1) similarly reads: take elements in rows 2 to 4(end) and starting from last column going backwards in jumps of two, i.e. 4 then 2.
To check the indexing, simply substitute the end with size(A,1) or size(A,2) if respectively found in the row and col position.
First the general stuff: end is just a placeholder for an index, namely the last position in a given array dimension. For instance, for an arbitrary array A(end,1) will pick the last element in column 1, and A(1,end) will pick the last element in the first row.
In your example, A(end, end-2) picks an element in the last row two columns before the last one.
To interpret a statement such as
A(2:end, end:-2:1)
it might help to substitute end with the actual index of the last row/column elements, so this is equivalent to
A(2:3, 4:-2:1)
Furthermore 4:-2:1 is equivalent to the list 4,2 since we are instructing to make the list starting at 4, decreasing in steps of 2, up to (minimum) 1. So this is equivalent to
A([2 3],[4 2])
Finally, the following combination of indices is implied by A([2 3],[4 2]):
A(2,4) A(2,2)
A(3,4) A(3,2)

Need a formula for cumulative moving averages in Open Office

I have data in column A, and would like to put the averages in column B like this:
a b
1 10 10
2 7 8.5
3 8 8.333
4 19 11
5 13 11.5
where b1 =average(a1), b2 =average(a1:a2), b3 =average(a1:a3)....
Using average() is alright for small amounts of data, but I have over 1500 data entries. I would like to find a more efficient way of doing this.
Make your initial range reference absolute, while the other is relative, i.e.:
b4 = average($a$1:a4)
You can paste that 1500 times an it will always increment the end of the range while keeping the beginning pinned to A1 due to the dollar signs in that reference.

Resources