Divide data array in more efficient way Matlab - arrays

I have a big array of data (it can be of thousands and tens of thousands of values). This data is a results of experiments collected in one array:
data = [2.204000000000000
2.202000000000000
2.206000000000000
2.201000000000000
...
]
And I have time array t of the same size:
t = [1 2 3 ... 65 66 1 2 3 4 ... 72 73 1 2 3 ... 75]';
This t is a time when the data was collected. So t = 1:66 - is a first experiment, and then t values again begins from 1 - it is data of 2 experiment and etc.
What I want to do: divide data by the specific time intervals:
t<=1
1<t<=4
4<t<=6
t>6
I go this way
part1 = []; part2 = []; part3 = []; part4 = [];
for ii = 1: size(data,1)
if (t(ii)) <=1 % collect all data corresponds to t<=1
part1 = [part1; ii];
elseif (t(ii) >1 && t(ii) <=4 )
part2 = [part2; ii];
elseif (t(ii) >4 && t(ii) <=6 )
part3 = [part3; ii];
else
part4 = [part4; ii];
end
end
data1 = data(part1);
data2 = data(part2);
data3 = data(part3);
data4 = data(part4);
That works perfect but it's slow because of:
I can't preallocate part1 part2 part3 part4 - I don't know their sizes;
It use for loop.
Can we do it in more elegant and fast way?
Now I have an idea of using one cell array instead of 4 different. Now I use part{1} part{2} ... part{4}. So I can preallocate it as part = cell(4,1);

You can improve your code using logical indexing.
I strongly encourage you to read the following references:
Mathworks documentation page: Find Array Elements That Meet a Condition
Loren on the Art of MATLAB blog entry: Logical Indexing – Multiple
Conditions
The following code uses logical indexing to do what you want without any loop, and thus without the need to preallocate any arrays:
data1 = data(t <= 1);
data2 = data((t > 1) && (t <= 4));
data3 = data((t > 4) && (t <= 6));
data4 = data(t > 6);
Logical indexing is like a traffic light: It allows the elements of an array that have a value of 1 to continue while stopping those elements that have a value of 0.
Matlab is very powerful in this kind of tasks.

Related

Restarting a loop after a specific number of iterations?

I'm new to coding but is being asked to create a simulation of a 10-year experiment 1000 times. I have it at a number lower than 1000 to speed up the testing process. This is a partial copy of my code, I can get the other parameters of the task to work but instead of stopping at and restarting for every 10-years, it seems to accumulate the results of the previous years'.
For example, the code is supposed to compound money earned the year following a 'Success,' while I can get it to compound, my code seems to compound into year 11 and 12 instead of stopping at 10 and essentially restarting at year 1.
I tried .count() to keep track of how many elements I'm iterating through and also tried the while xyz function but I can't seem to get either to work.
for sim in range(5):
for yr in range(10):
experiment = "Success" if np.random.random() <= 0.1 else "Failure"
expense = 25000
margin = 0
results1.append(experiment)
expenses1.append(expense)
margins1.append(margin)
iter = 0
if iter < 10:
for i in range(len(results1)):
if i + 1 < len(results1) and i - 1 >= 0:
if results1[i] == 'Success':
expenses1[i + 1] = 0
margins1[i + 1] = 10000
if results1[i - 1] == 'Success':
expenses1[i] = 0
if margins1[i] != 10000:
margins1[i] = 10000
if expenses1[i - 1] == 0:
expenses1[i] = 0
expenses1[i + 1] = 0
if margins1[i] >= 10000:
margins1[i + 1] = margins1[i] * 1.2
iter += 1
else:
continue
iter = 0
all_data1 = zip(results1, expenses1, margins1)
df1 = pd.DataFrame(all_data1, columns=["Results", "R&D", "Margins"])
Let's look at the following simplified code snippet of your implementation.
import numpy as np
results1 = []
for sim in range(10):
for yr in range(10):
experiment = "Success" if np.random.random() <= 0.1 else "Failure"
results1.append(experiment)
What is the problem here? Well, what I assume you want is that for every year we add a result to a list for that simulation. However, what is the difference between what you currently have, and this following code?
import numpy as np
results1 = []
for yr in range(10 * 10):
experiment = "Success" if np.random.random() <= 0.1 else "Failure"
results1.append(experiment)
Well, unless you left something out of your included code, it looks like not much! You want the simulation to reset after n years (in this example 10), but you don't change anything in your code! Here is an example of how this reset could be represented:
import numpy as np
results1 = []
for sim in range(10):
sim_results1 = []
for yr in range(10):
experiment = "Success" if np.random.random() <= 0.1 else "Failure"
sim_results1.append(experiment)
results1.append(sim_results1)
Now, your results1 list will contain m lists (where m is the number of simulations) with each sublist showing whether the experiment was a success or failure over n years. In short: if you just add each experimental result to a big list that is the same across all simulation runs, it's not a surprise that it looks like you are simulating into year 11, 12, etc. What is actually happening is year 11 is really year 1 of simulation number 2, but you do not separate the simulations currently.

In MATLAB how can I write out a multidimensional array as a string that looks like a raw numpy array?

The Goal
(Forgive me for length of this, it's mostly background and detail.)
I'm contributing to a TOML encoder/decoder for MATLAB and I'm working with numerical arrays right now. I want to input (and then be able to write out) the numerical array in the same format. This format is the nested square-bracket format that is used by numpy.array. For example, to make multi-dimensional arrays in numpy:
The following is in python, just to be clear. It is a useful example though my work is in MATLAB.
2D arrays
>> x = np.array([1,2])
>> x
array([1, 2])
>> x = np.array([[1],[2]])
>> x
array([[1],
[2]])
3D array
>> x = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
>> x
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
4D array
>> x = np.array([[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]])
>> x
array([[[[ 1, 2],
[ 3, 4]],
[[ 5, 6],
[ 7, 8]]],
[[[ 9, 10],
[11, 12]],
[[13, 14],
[15, 16]]]])
The input is a logical construction of the dimensions by nested brackets. Turns out this works pretty well with the TOML array structure. I can already successfully parse and decode any size/any dimension numeric array with this format from TOML to MATLAB numerical array data type.
Now, I want to encode that MATLAB numerical array back into this char/string structure to write back out to TOML (or whatever string).
So I have the following 4D array in MATLAB (same 4D array as with numpy):
>> x = permute(reshape([1:16],2,2,2,2),[2,1,3,4])
x(:,:,1,1) =
1 2
3 4
x(:,:,2,1) =
5 6
7 8
x(:,:,1,2) =
9 10
11 12
x(:,:,2,2) =
13 14
15 16
And I want to turn that into a string that has the same format as the 4D numpy input (with some function named bracketarray or something):
>> str = bracketarray(x)
str =
'[[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]]'
I can then write out the string to a file.
EDIT: I should add, that the function numpy.array2string() basically does exactly what I want, though it adds some other whitespace characters. But I can't use that as part of the solution, though it is basically the functionality I'm looking for.
The Problem
Here's my problem. I have successfully solved this problem for up to 3 dimensions using the following function, but I cannot for the life of me figure out how to extend it to N-dimensions. I feel like it's an issue of the right kind of counting for each dimension, making sure to not skip any and to nest the brackets correctly.
Current bracketarray.m that works up to 3D
function out = bracketarray(in, internal)
in_size = size(in);
in_dims = ndims(in);
% if array has only 2 dimensions, create the string
if in_dims == 2
storage = cell(in_size(1), 1);
for jj = 1:in_size(1)
storage{jj} = strcat('[', strjoin(split(num2str(in(jj, :)))', ','), ']');
end
if exist('internal', 'var') || in_size(1) > 1 || (in_size(1) == 1 && in_dims >= 3)
out = {strcat('[', strjoin(storage, ','), ']')};
else
out = storage;
end
return
% if array has more than 2 dimensions, recursively send planes of 2 dimensions for encoding
else
out = cell(in_size(end), 1);
for ii = 1:in_size(end) %<--- this doesn't track dimensions or counts of them
out(ii) = bracketarray(in(:,:,ii), 'internal'); %<--- this is limited to 3 dimensions atm. and out(indexing) need help
end
end
% bracket the final bit together
if in_size(1) > 1 || (in_size(1) == 1 && in_dims >= 3)
out = {strcat('[', strjoin(out, ','), ']')};
end
end
Help me Obi-wan Kenobis, y'all are my only hope!
EDIT 2: Added test suite below and modified current code a bit.
Test Suite
Here is a test suite to use to see if the output is what it should be. Basically just copy and paste it into the MATLAB command window. For my current posted code, they all return true except the ones more than 3D. My current code outputs as a cell. If your solution output differently (like a string), then you'll have to remove the curly brackets from the test suite.
isequal(bracketarray(ones(1,1)), {'[1]'})
isequal(bracketarray(ones(2,1)), {'[[1],[1]]'})
isequal(bracketarray(ones(1,2)), {'[1,1]'})
isequal(bracketarray(ones(2,2)), {'[[1,1],[1,1]]'})
isequal(bracketarray(ones(3,2)), {'[[1,1],[1,1],[1,1]]'})
isequal(bracketarray(ones(2,3)), {'[[1,1,1],[1,1,1]]'})
isequal(bracketarray(ones(1,1,2)), {'[[[1]],[[1]]]'})
isequal(bracketarray(ones(2,1,2)), {'[[[1],[1]],[[1],[1]]]'})
isequal(bracketarray(ones(1,2,2)), {'[[[1,1]],[[1,1]]]'})
isequal(bracketarray(ones(2,2,2)), {'[[[1,1],[1,1]],[[1,1],[1,1]]]'})
isequal(bracketarray(ones(1,1,1,2)), {'[[[[1]]],[[[1]]]]'})
isequal(bracketarray(ones(2,1,1,2)), {'[[[[1],[1]]],[[[1],[1]]]]'})
isequal(bracketarray(ones(1,2,1,2)), {'[[[[1,1]]],[[[1,1]]]]'})
isequal(bracketarray(ones(1,1,2,2)), {'[[[[1]],[[1]]],[[[1]],[[1]]]]'})
isequal(bracketarray(ones(2,1,2,2)), {'[[[[1],[1]],[[1],[1]]],[[[1],[1]],[[1],[1]]]]'})
isequal(bracketarray(ones(1,2,2,2)), {'[[[[1,1]],[[1,1]]],[[[1,1]],[[1,1]]]]'})
isequal(bracketarray(ones(2,2,2,2)), {'[[[[1,1],[1,1]],[[1,1],[1,1]]],[[[1,1],[1,1]],[[1,1],[1,1]]]]'})
isequal(bracketarray(permute(reshape([1:16],2,2,2,2),[2,1,3,4])), {'[[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]]'})
isequal(bracketarray(ones(1,1,1,1,2)), {'[[[[[1]]]],[[[[1]]]]]'})
I think it would be easier to just loop and use join. Your test cases pass.
function out = bracketarray_matlabbit(in)
out = permute(in, [2 1 3:ndims(in)]);
out = string(out);
dimsToCat = ndims(out);
if iscolumn(out)
dimsToCat = dimsToCat-1;
end
for i = 1:dimsToCat
out = "[" + join(out, ",", i) + "]";
end
end
This also seems to be faster than the route you were pursing:
>> x = permute(reshape([1:16],2,2,2,2),[2,1,3,4]);
>> tic; for i = 1:1e4; bracketarray_matlabbit(x); end; toc
Elapsed time is 0.187955 seconds.
>> tic; for i = 1:1e4; bracketarray_cris_luengo(x); end; toc
Elapsed time is 5.859952 seconds.
The recursive function is almost complete. What is missing is a way to index the last dimension. There are several ways to do this, the neatest, I find, is as follows:
n = ndims(x);
index = cell(n-1, 1);
index(:) = {':'};
y = x(index{:}, ii);
It's a little tricky at first, but this is what happens: index is a set of n-1 strings ':'. index{:} is a comma-separated list of these strings. When we index x(index{:},ii) we actually do x(:,:,:,ii) (if n is 4).
The completed recursive function is:
function out = bracketarray(in)
n = ndims(in);
if n == 2
% Fill in your n==2 code here
else
% if array has more than 2 dimensions, recursively send planes of 2 dimensions for encoding
index = cell(n-1, 1);
index(:) = {':'};
storage = cell(size(in, n), 1);
for ii = 1:size(in, n)
storage(ii) = bracketarray(in(index{:}, ii)); % last dimension automatically removed
end
end
out = { strcat('[', strjoin(storage, ','), ']') };
Note that I have preallocated the storage cell array, to prevent it from being resized in every loop iteration. You should do the same in your 2D case code. Preallocating is important in MATLAB for performance reasons, and the MATLAB Editor should warm you about this too.

How many random requests do I need to make to a set of records to get 80% of the records?

Suppose I have an array of 100_000 records ( this is Ruby code, but any language will do)
ary = ['apple','orange','dog','tomato', 12, 17,'cat','tiger' .... ]
results = []
I can only make random calls to the array ( I cannot traverse it in any way)
results << ary.sample
# in ruby this will pull a random record from the array, and
# push into results array
How many random calls like that, do I need to make, to get least 80% of records from ary. Or expressed another way - what should be the size of results so that results.uniq will contain around 80_000 records from ary.
From my rusty memory of Stats class in college, I think it's needs to be 2*result set size = or around 160_000 requests ( assuming random function is random, and there is no some other underlying issue) . My testing seems to confirm this.
ary = [*1..100_000];
result = [];
160_000.times{result << ary.sample};
result.uniq.size # ~ 80k
This is stats, so we are talking about probabilities, not guaranteed results. I just need a reasonable guess.
So the question really, what's the formula to confirm this?
I would just perform a quick simulation study. In R,
N = 1e5
# Simulate 300 times
s = replicate(300, sample(x = 1:N, size = 1.7e5, replace = TRUE))
Now work out when you hit your target
f = function(i) which(i == unique(i)[80000])[1]
stats = apply(s, 2, f)
To get
summary(stats)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 159711 160726 161032 161037 161399 162242
So in 300 trials, the maximum number of simulations needed was 162242 with an average number of 161032.
With Fisher-Yates shuffle you could get 80K items from exactly 80K random calls
Have no knowledge of Ruby, but looking at https://gist.github.com/mindplace/3f3a08299651ebf4ab91de3d83254fbc and modifying it
def shuffle(array, counter)
#counter = array.length - 1
while counter > 0
# item selected from the unshuffled part of array
random_index = rand(counter)
# swap the items at those locations
array[counter], array[random_index] = array[random_index], array[counter]
# de-increment counter
counter -= 1
end
array
end
indices = [0, 1, 2, 3, ...] # up to 99999
counter = 80000
shuffle(indices, 80000)
i = 0
while counter > 0
res[i] = ary[indices[i]]
counter -= 1
i += 1
UPDATE
Packing sampled indices into custom RNG (bear with me, know nothing about Ruby)
class FYRandom
_indices = indices
_max = 80000
_idx = 0
def rand()
if _idx > _max
return -1.0
r = _indices[idx]
_idx += 1
return r.to_f / max.to_f
end
end
And code for sample would be
rng = FYRandom.new
results << ary.sample(random: rng)

Reverse lookup with non-unique values

What I'm trying to do
I have an array of numbers:
>> A = [2 2 2 2 1 3 4 4];
And I want to find the array indices where each number can be found:
>> B = arrayfun(#(x) {find(A==x)}, 1:4);
In other words, this B should tell me:
>> for ii=1:4, fprintf('Item %d in location %s\n',ii,num2str(B{ii})); end
Item 1 in location 5
Item 2 in location 1 2 3 4
Item 3 in location 6
Item 4 in location 7 8
It's like the 2nd output argument of unique, but instead of the first (or last) occurrence, I want all the occurrences. I think this is called a reverse lookup (where the original key is the array index), but please correct me if I'm wrong.
How can I do it faster?
What I have above gives the correct answer, but it scales terribly with the number of unique values. For a real problem (where A has 10M elements with 100k unique values), even this stupid for loop is 100x faster:
>> B = cell(max(A),1);
>> for ii=1:numel(A), B{A(ii)}(end+1)=ii; end
But I feel like this can't possibly be the best way to do it.
We can assume that A contains only integers from 1 to the max (because if it doesn't, I can always pass it through unique to make it so).
That's a simple task for accumarray:
out = accumarray(A(:),(1:numel(A)).',[],#(x) {x}) %'
out{1} = 5
out{2} = 3 4 2 1
out{3} = 6
out{4} = 8 7
However accumarray suffers from not being stable (in the sense of unique's feature), so you might want to have a look here for a stable version of accumarray, if that's a problem.
Above solution also assumes A to be filled with integers, preferably with no gaps in between. If that is not the case, there is no way around a call of unique in advance:
A = [2.1 2.1 2.1 2.1 1.1 3.1 4.1 4.1];
[~,~,subs] = unique(A)
out = accumarray(subs(:),(1:numel(A)).',[],#(x) {x})
To sum up, the most generic solution, working with floats and returning a sorted output could be:
[~,~,subs] = unique(A)
[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1)); %// optional
vals = 1:numel(A);
vals = vals(I); %// optional
out = accumarray(subs, vals , [],#(x) {x});
out{1} = 5
out{2} = 1 2 3 4
out{3} = 6
out{4} = 7 8
Benchmark
function [t] = bench()
%// data
a = rand(100);
b = repmat(a,100);
A = b(randperm(10000));
%// functions to compare
fcns = {
#() thewaywewalk(A(:).');
#() cst(A(:).');
};
% timeit
t = zeros(2,1);
for ii = 1:100;
t = t + cellfun(#timeit, fcns);
end
format long
end
function out = thewaywewalk(A)
[~,~,subs] = unique(A);
[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));
idx = 1:numel(A);
out = accumarray(subs, idx(I), [],#(x) {x});
end
function out = cst(A)
[B, IX] = sort(A);
out = mat2cell(IX, 1, diff(find(diff([-Inf,B,Inf])~=0)));
end
0.444075509687511 %// thewaywewalk
0.221888202987325 %// CST-Link
Surprisingly the version with stable accumarray is faster than the unstable one, due to the fact that Matlab prefers sorted arrays to work on.
This solution should work in O(N*log(N)) due sorting, but is quite memory intensive (requires 3x the amount of input memory):
[U, X] = sort(A);
B = mat2cell(X, 1, diff(find(diff([Inf,U,-Inf])~=0)));
I am curious about the performance though.

Finding molar mass with MATLAB

Alrighty, so I have a pretty 'simple' problem on my hands. I am given two inputs for my function: a string that gives the formula of the equation and a structure that contains the information I need and looks like this:
Name
Symbol
AtomicNumber
AtomicWeight
To find the molecular weight, I have to take all of the elements in the formula, find their total mass and add them all together. For example, let's say that I have to find the molecular weight of oxygen. The formula would look like:
H2,O
The molecular weight will thus be
2*(Hydrogen's weight) + (Oxygen's weight), which evaluates to 18.015.
There will always be a comma separating the different elements in a formula. What I am having trouble with right now, is taking the number out of the string(the formula). I feel like I'm over-complicating how I am going about extracting it. If there's a number, I know it can be in positions 2 or 3 (depending on the element name). I tried to use isnumeric, I tried to do some really weird, coding stuff (which you'll see below), but I am having difficulties.
test case:
mass5 = molarMass('C,H2,Br,C,H2,Br', table)
mass5 => 187.862
table:
Name Symbol AtomicNumber AtomicWeight
'Carbon' 'C' 6 12.0110000000000
'Hydrogen' 'H' 1 1.00800000000000
'Nitrogen' 'N' 7 14.0070000000000
'Oxygen' 'O' 8 15.9990000000000
'Phosphorus''P' 15 30.9737619980000
'Sulfur' 'S' 16 32.0600000000000
'Chlorine' 'Cl' 17 35.4500000000000
'Bromine' 'Br' 35 79.9040000000000
'Sodium' 'Na' 11 22.9897692800000
'Magnesium' 'Mg' 12 24.3050000000000
My code so far is:
function[molar_mass] = molarMass(formula, information)
Names = []; %// Creates a Name array
[~,c] = size(information); %Finds the rows and columns of the table
for i = 1:c %Reads through the columns
Molecules = getfield(information(:,i), 'Name'); %Finds the numbers in the 'Name' area
Names = [Names {Molecules}];
end
Symbols = [];
[~, c2] = size(information);
for i = 1:c2 %Reads through the columns
Symbs = getfield(information(:,i), 'Symbol'); %Finds the numbers in the 'Symbol'
Symbols = [Symbols {Symbs}];
end
AN = [];
[~, c3] = size(information);
for i = 1:c3 %Reads through the columns
Atom = getfield(information(:,i), 'AtomicNumber'); %Finds the numbers in the 'AtomicWeight' area
AN = [AN {Atom}];
end
Wt = [information(:).AtomicWeight];
formula_parts = strsplit(formula, ','); % cell array of strings
total_mass = 0;
multi = [];
atoms = [];
Indices = [];
for ipart = 1:length(formula_parts)
part = formula_parts{ipart}; % Takes in the string
isdigit = (part >= '0') & (part <= '9'); % A boolean array
atom = part(~isdigit); % Select all chars that are not digits
Indixes = find(strcmp(Symbols, atom));
Indices = [Indices {Indixes}];
mole = atom;
atoms = [atoms {mole}];
natoms = part(isdigit); % Select all chars that are digits
% Convert natoms string to numbers, default to 1 if missing
if length(natoms) == 0
natoms = '1';
multi = [multi {natoms}];
else
natoms = num2str(natoms);
multi = [multi {natoms}];
end
end
multi = char(multi);
multi = str2num(multi); %Creates a number array with my multipliers
f=56;
Molecule_Wt = Wt{Indices};
duck = 62;
total_mass = total_mass + atom_weight * multi;
end
Thanks to Bas Swinckels I can now extract the numbers from the formulas, but what I'm struggling with now is how to pull out the weights associated with the symbols. I created my own weight_chart, but strcmp won't work there. Neither will strfind or strmatch. What I want to do is find the formulas in my input, in the chart. Then index it from that index, to the column (so add 1 I believe). How do I find the indices though? I'd prefer to find them in the order the strings appear in my input, since I can then apply my 'multi' array to it.
Any help/suggestions would be appreciated :)
Given the string, you can pull out the part that is a digit character with the isstrprop function. Then use that to address your string to get just those characters, then cast that as a double with str2double.
PartialString = 'H12';
Subscript = str2double (PartialString (isstrprop (PartialString, 'digit')));
This should get you started, there is still some parts that need to be filled in:
formula_parts = strsplit(formula, ','); % cell array of strings
total_mass = 0;
for ipart = 1:length(formula_parts)
part = formula_parts{ipart}; % string like 'H2'
isdigit = isstrprop(part, 'digit'); % boolean array
atom = part(~isdigit); % select all chars that are not digits
natoms = part(isdigit); % select all chars that are digits
% convert natoms string to int, default to 1 if missing
if length(natoms) == 0
natoms = 1;
else
natoms = num2str(natoms);
end
% calculate weight
atom_weight = lookup_weight(atom); % somehow look up value in table
total_mass = total_mass + atom_weight * natoms;
end
See this old question about how to extract letters or digits from a string.

Resources