extract "N" sized sequences from an array in R - arrays

Suppose I have the following array:
a <- sample(letters,100,replace=TRUE)
Then suppose those letters are ordered in a sequence, I want to extract all possible 'n' sized sequences from that array. For example:
For n=2 I would do: paste0(a[1:99],"->",a[2:100])
for n=3 I would do: paste0(a[1:98],"->",a[2:99],"->",a[3:100])
you get the point. Now, my goal is to create a function that would take as input n and would give me back the corresponding set of sequences of the given length from array a
I was able to do it using loops and all that but I was hoping for a high performance one liner.
I am a bit new to R so I'm not aware of all existing functions.

You can use embed. For embed(a, 3), this gives a matrix with columns
a[3:100]
a[2:99]
a[1:98]
in that order.
To reverse the column order use matrix syntax m[rows, cols]:
res = embed(a, 3)[, 3:1]
If you want arrows printed between the columns, then
do.call(paste, c(split(res, col(res)), sep = " -> "))
is one way. This is probably better than apply(res, 1, something), performance-wise, since this is vectorized while apply would loop over rows.
As pointed out by #DavidArenburg, this can similarly be done with data.table:
library(data.table)
do.call(paste, c(shift(a, 2:0), sep = " -> "))[-(1:2)]
shift is like embed, except it ...
returns a list instead of a matrix, so we don't need to split by col to paste
pads with missing values to keep the full length, so we need to drop with -(1:2)
I was hoping to say something useful about how to find obscure functions in R, but came up mostly blank on how embed might be found. Maybe...
Go to any HTML help page
Click the "Index" hyperlink at the bottom
Read every single page
?

Related

Shorter way to index array

I have a 2D array called my_array. To select the 1st, 13th, 14th, 15th, and the 16th element from each row I use the following line
desired_elements = my_array[:,[0,12,13,14,15]]
This works, but I'm pretty sure that the [0,12,13,14,15] part can be written more compactly. I have tried to look for a way, but until now I have been unable to do so.
Question: Is there a shorter way to write
desired_elements = my_array[:,[0,12,13,14,15]]
This is not shorter but equal in length for your specific input, but perhaps it is what you're looking for. You could be using np.r_, which translates slice objects to concatenation along the first axis.
It's a simple way to build up arrays quickly when you have multiple slices to select.
Here's how you would do with your example:
desired_elements = my_array[:, np.r_[0, 12:16]]
Now, if you wanted to select more slices, you would probably end up with something shorter than the approach you take, for instance:
desired_elements = my_array[:, np.r_[0, 4:8, 11:14]]
May I ask why it is so critical to shorten your input?

Excel: creating an array with n times a constant

I have been looking around for a while but unable to find an answer to my question.
In Excel, what compact formula can I use to create an array made up of a single element repeated n times, where n is an input (potentially hard-coded)?
For example, something that would look like this (the formula below does not work but gives an idea of what I am looking for):
{={"Constant"}*3}
Note: I am not looking for a VBA-based solution.
EDIT Reading #AxelRichter answer, I see I should also indicate that the formulas below assume Constant is a number. If Constant is text, then this solution will not work.
Volatile:
=ROW(INDIRECT("1:" & Repts))/ROW(INDIRECT("1" & ":" & Repts)) * Constant
non-Volatile:
=ROW(INDEX($1:$65535,1,1):INDEX($1:$65535,Repts,1))/ROW(INDEX($1:$65535,1,1):INDEX($1:$65535,Repts,1))*Constant
If
Constant = 14
Repts = 3
then
Result = {14;14;14}
The first part of the formulas create an array of 1's repeated Repts times. Then we multiply that array by Constant to get the desired result.
And after reading #MacroMarc's comment, the following non-volatile formula shouyld also work for numbers:
=(ROW($A$1:INDEX($A:$A,Repts))>0)*Constant
One could concatenate 1:n empty cells to the "Constant" to create a string array having n items "Constant":
"Constant"&INDEX(XFD:XFD,1):INDEX(XFD:XFD,3)
There 3 is n.
Used in Formula
=INDEX("Constant"&INDEX(XFD:XFD,1):INDEX(XFD:XFD,3),0)
Evaluate Formula shows that it works:
Here column XFD is used because in most cases this column will be empty and a column which is guaranteed to be empty is needed for this solution.
If used
"Constant"&T(ROW($A$1:INDEX($A:$A,3)))
=INDEX("Constant"&T(ROW($A$1:INDEX($A:$A,3))),0)
the need of an empty column disappears. The function ROW returns numbers but the T returns an empty string if its parameter is not text. So empty strings will be concatenated for each 1:3 (n).
Thanks to #MacroMarc for the hint.
Try:
REPT("Constant", SEQUENCE(3,1,1,0))
Or, if the reference is to a dynamic array:
REPT("Constant", SEQUENCE(A1#,1,1,0))
The dynamic array spills, and has your constant repeated one time.
Using SEQUENCE with a step of 0 is a much cleaner way to make an array of constants. You can choose whether you want rows or columns (or both!) as well.
=SEQUENCE(Repts,1,Constant,0)
I will generally use a sequence (like Claire (above) said). But if you want to provide an output of text objects, I would do it this way:
=IF(SEQUENCE(A1,A2,1,0),A3)
Where:
A1 has the number of rows
A2 has the number of columns
A3 has the thing you want repeated into an array
The sequence will create a matrix of 1's, which the IF statement will default to the TRUE expression (being the contents of A3).
So, if you wanted a vertical list of 3 items that says "Constant", this would do it:
=IF(SEQUENCE(3,,1,0),"Constant")
If you would prefer it be arranged horizontally instead of vertically, just amend the SEQUENCE function:
=IF(SEQUENCE(,3,1,0),"Constant")

Returning multiple adjacent cell results from an min array which may include multiple duplicate values

I'm trying to setup a formula that will return the contents of an related cell (my related cell is on another sheet) from the smallest 2 results in an array. This is what I'm using right now.
=INDEX('Sheet1'!$A$40:'Sheet1'!$A$167,MATCH(SMALL(F1:F128,1),F1:F128,0),1)
And
=INDEX('Sheet1'!$A$40:'Sheet1:!$A$167,MATCH(SMALL(F1:F128,2),F1:F128,0),1)
The problem I've run into is twofold.
First, if there are multiple lowest results I get whichever one appears first in the array for both entries.
Second, if the second lowest result is duplicated but the first is not I get whichever one shows up on the list first, but any subsequent duplicates are ignored. I would like to be able to display the names associated with the duplicated scores.
You will have to adjust the k parameter of the SMALL function to raise the k according to duplicates. The COUNTIF function should be sufficient for this. Once all occurrences of the top two scores are retrieved, standard 'lookup multiple values' formulas can be applied. Retrieving successive row positions with the AGGREGATE¹ function and passing those into an INDEX of the names works well.
    
The formulas in H2:I2 are,
=IF(SMALL(F$40:F$167, ROW(1:1))<=SMALL(F$40:F$167, 1+COUNTIF(F$40:F$167, MIN(F$40:F$167))), SMALL(F$40:F$167, ROW(1:1)), "") '◄ H2
=IF(LEN(H40), INDEX(A$40:A$167, AGGREGATE(15, 6, ROW($1:$128)/(F$40:F$167=H40), COUNTIF(H$40:H40, H40))), "") '◄ I2
Fill down as necessary. The scores are designed to terminate after the last second place so it would be a good idea to fill down several rows more than is immediately necessary for future duplicates.
¹ The AGGREGATE function was introduced with Excel 2010². It is not available in earlier versions.
² Related article for pre-xl2010 functions - see Multiple Ranked Returns from INDEX().
The following formula will do what I think you want:
=IF(OR(ROW(1:1)=1,COUNTIF($E$1:$E1,INDEX(Sheet1!$A$40:$A$167,MATCH(SMALL($F$1:$F$128,ROW(1:1)),$F$1:$F$128,0)))>0,ROW(1:1)=2),INDEX(Sheet1!$A$40:$A$167,MATCH(1,INDEX(($F$1:$F$128=SMALL($F$1:$F$128,ROW(1:1)))*(COUNTIF($E$1:$E1,Sheet1!$A$40:$A$167)=0),),0)),"")
NOTE:
This is an array formula and must be confirmed with Ctrl-Shift-Enter.
There are two references $E$1:$E1. This formula assumes that it will be entered in E2 and copied down. If it is going in a different column Change these two references. It must go in the second row or it will through a circular reference.
What it will do
If there is a tie for first place it will only list those teams that are tied for first.
If there is only one first place but multiple tied for second places it will list all those in second.
So make sure you copy the formula down far enough to cover all possible ties. It will put "" in any that do not fill, so err on the high side.
To get the Scores use this simple formula, I put mine in Column F:
=IF(E2<>"",SMALL($F$1:$F$128,ROW(1:1)),"")
Again change the E reference to the column you use for the output.
I did a small test:

How to change all elements of struct array, which have certain field value?

Imagine that we have an array of structures:
S=repmat(struct('a1',0,'a2', 0, 'a3', 0, ...), N, 1 );
I need to change all elements with specific field value (e.g. field a1 = k) to elements with another value of this field (e.g. field a1 = m). In other words, if S(i).a1 == k => S(i).a1 = m. And I need to do it really fast, so no loop suits me. I tried to find a solution and here is what I found. Command:
S([S.a1]==k)
returns an array containing all elements with field a1 equals k. However, if I change something in this array, of course, nothing will happen in initial array S. So I tried to do obvious move:
S([S.a1]==k).a1 = m
Unfortunately, MATLAB doesn't understand this:
Insufficient outputs from right hand side to satisfy comma separated list expansion on left hand side. Missing [] are the most likely cause.
(I have tried to put brackets everywhere - no help)
Is there any way to do this without loop (ideally, it should work as fast as possible)? With something like structfun maybe?
Thanks in advance.
The same way you wrapped [S.a1] with brackets to concatenate the multiple outputs into a vector, you need to wrap S([S.a1]==k).a1. Then, with help from deal function, you can copy a single input m to multiple outputs. The final solution with the correct synthax looks like this:
[S([S(:).a1]==k).a1]=deal(m)

Concatenate subcells through one dimension of a cell array without using loops in MATLAB

I have a cell array. Each cell contains a vector of variable length. For example:
example_cell_array=cellfun(#(x)x.*rand([length(x),1]),cellfun(#(x)ones(x,1), num2cell(ceil(10.*rand([7,4]))), 'UniformOutput', false), 'UniformOutput', false)
I need to concatenate the contents of the cells down through one dimension then perform an operation on each concatenated vector generating scalar for each column in my cell array (like sum() for example - the actual operation is complex, time consuming, and not naturally vectorisable - especially for diffent length vecotrs).
I can do this with loops easily (for my concatenated vector sum example) as follows:
[M N]=size(example_cell_array);
result=zeros(1,N);
cat_cell_array=cell(1,N);
for n=1:N
cat_cell_array{n}=[];
for m=1:M
cat_cell_array{n}=[cat_cell_array{n};example_cell_array{m,n}];
end
end
result=cell2mat(cellfun(#(x)sum(x), cat_cell_array, 'UniformOutput', false))
Unfortunately this is WAY too slow. (My cell array is 1Mx5 with vectors in each cell ranging in length from 100-200)
Is there a simple way to produce the concatenated cell array where the vectors contained in the cells have been concatenated down one dimension?
Something like:
dim=1;
cat_cell_array=(?concatcells?(dim,example_cell_array);
Edit:
Since so many people have been testing the solutions: Just FYI, the function I'm applying to each concatenated vector is circ_kappa(x) available from Circular Statistics Toolbox
Some approaches might suggest you to unpack the numeric data from example_cell_array using {..} and then after concatenation pack it back into bigger sized cells to form your cat_cell_array. Then, again you need to unpack numeric data from that concatenated cell array to perform your operation on each cell.
Now, in my view, this multiple unpacking and packing approaches won't be efficient ones if example_cell_array isn't one of your intended outputs. So, considering all these, let me suggest two approaches here.
Loopy approach
The first one is a for-loop code -
data1 = vertcat(example_cell_array{:}); %// extract all numeric data for once
starts = [1 sum(cellfun('length',example_cell_array),1)]; %// intervals lengths
idx = cumsum(starts); %// get indices to work on intervals basis
result = zeros(1,size(example_cell_array,2));
%// replace this with "result(size(example_cell_array,2))=0;" for performance
for k1 = 1:numel(idx)-1
result(k1) = sum(data1(idx(k1):idx(k1+1)-1));
end
So, you need to edit sum with your actual operation.
Almost-vectorized approach
If example_cell_array has a lot of columns, my second suggestion would be an almost vectorized approach, though it doesn't perform badly either with a small number of columns. Now this code uses cellfun at the first line to get the lengths for each cell in concatenated version. cellfun is basically a wrapper to a loop code, but this is not very expensive in terms of runtime and that's why I categorized this approach as an almost vectorized one.
The code would be -
lens = sum(cellfun('length',example_cell_array),1); %// intervals lengths
maxlens = max(lens);
numlens = numel(lens);
array1(maxlens,numlens)=0;
array1(bsxfun(#ge,lens,[1:maxlens]')) = vertcat(example_cell_array{:}); %//'
result = sum(array1,1);
The thing you need to do now, is to make your operation run on column basis with array1 using the mask created by the bsxfun implementation. Thus, if array1 is a M x 5 sized array, you need to select the valid elements from each column using the mask and then do the operation on those elements. Let me know if you need more info on the masking issue.
Hope one of these approaches would work for you!
Quick Tests: Using a 250000x5 sized example_cell_array, quick tests show that both these approaches for the sum operation perform very well and give about 400x speedup over the code in the question at my end.
For the concatenation itself, it sounds like you might want the functional form of cat:
for n=1:N
cat_cell_array{n} = cat(1, example_cell_array{:,n});
end
This will concatenate all the arrays in the cells in each column in the original input array.
You can define a function like this:
cellcat = #(C) arrayfun(#(k) cat(1, C{:, k}), 1:size(C,2), 'uni', 0);
And then just use
>> cellcat(example_cell_array)
ans =
[42x1 double] [53x1 double] [51x1 double] [47x1 double]
I think you are looking to generate cat_cell_array without using for loops. If so, you can do it as follows:
cat_cell_array=cellfun(#(x) cell2mat(x),num2cell(example_cell_array,1),'UniformOutput',false);
The above line can replace your entire for loop according to me. Then you can calculate your complex function over this cat_cell_array.
If only result is important to you and you do not want to store cat_cell_array, then you can do everything in a single line (not recommended for readability):
result=cell2mat(cellfun(#(x)sum(x), cellfun(#(x) cell2mat(x),num2cell(example_cell_array,1),'Uni',false), 'Uni', false));

Resources