The Goal
(Forgive me for length of this, it's mostly background and detail.)
I'm contributing to a TOML encoder/decoder for MATLAB and I'm working with numerical arrays right now. I want to input (and then be able to write out) the numerical array in the same format. This format is the nested square-bracket format that is used by numpy.array. For example, to make multi-dimensional arrays in numpy:
The following is in python, just to be clear. It is a useful example though my work is in MATLAB.
2D arrays
>> x = np.array([1,2])
>> x
array([1, 2])
>> x = np.array([[1],[2]])
>> x
array([[1],
[2]])
3D array
>> x = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
>> x
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
4D array
>> x = np.array([[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]])
>> x
array([[[[ 1, 2],
[ 3, 4]],
[[ 5, 6],
[ 7, 8]]],
[[[ 9, 10],
[11, 12]],
[[13, 14],
[15, 16]]]])
The input is a logical construction of the dimensions by nested brackets. Turns out this works pretty well with the TOML array structure. I can already successfully parse and decode any size/any dimension numeric array with this format from TOML to MATLAB numerical array data type.
Now, I want to encode that MATLAB numerical array back into this char/string structure to write back out to TOML (or whatever string).
So I have the following 4D array in MATLAB (same 4D array as with numpy):
>> x = permute(reshape([1:16],2,2,2,2),[2,1,3,4])
x(:,:,1,1) =
1 2
3 4
x(:,:,2,1) =
5 6
7 8
x(:,:,1,2) =
9 10
11 12
x(:,:,2,2) =
13 14
15 16
And I want to turn that into a string that has the same format as the 4D numpy input (with some function named bracketarray or something):
>> str = bracketarray(x)
str =
'[[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]]'
I can then write out the string to a file.
EDIT: I should add, that the function numpy.array2string() basically does exactly what I want, though it adds some other whitespace characters. But I can't use that as part of the solution, though it is basically the functionality I'm looking for.
The Problem
Here's my problem. I have successfully solved this problem for up to 3 dimensions using the following function, but I cannot for the life of me figure out how to extend it to N-dimensions. I feel like it's an issue of the right kind of counting for each dimension, making sure to not skip any and to nest the brackets correctly.
Current bracketarray.m that works up to 3D
function out = bracketarray(in, internal)
in_size = size(in);
in_dims = ndims(in);
% if array has only 2 dimensions, create the string
if in_dims == 2
storage = cell(in_size(1), 1);
for jj = 1:in_size(1)
storage{jj} = strcat('[', strjoin(split(num2str(in(jj, :)))', ','), ']');
end
if exist('internal', 'var') || in_size(1) > 1 || (in_size(1) == 1 && in_dims >= 3)
out = {strcat('[', strjoin(storage, ','), ']')};
else
out = storage;
end
return
% if array has more than 2 dimensions, recursively send planes of 2 dimensions for encoding
else
out = cell(in_size(end), 1);
for ii = 1:in_size(end) %<--- this doesn't track dimensions or counts of them
out(ii) = bracketarray(in(:,:,ii), 'internal'); %<--- this is limited to 3 dimensions atm. and out(indexing) need help
end
end
% bracket the final bit together
if in_size(1) > 1 || (in_size(1) == 1 && in_dims >= 3)
out = {strcat('[', strjoin(out, ','), ']')};
end
end
Help me Obi-wan Kenobis, y'all are my only hope!
EDIT 2: Added test suite below and modified current code a bit.
Test Suite
Here is a test suite to use to see if the output is what it should be. Basically just copy and paste it into the MATLAB command window. For my current posted code, they all return true except the ones more than 3D. My current code outputs as a cell. If your solution output differently (like a string), then you'll have to remove the curly brackets from the test suite.
isequal(bracketarray(ones(1,1)), {'[1]'})
isequal(bracketarray(ones(2,1)), {'[[1],[1]]'})
isequal(bracketarray(ones(1,2)), {'[1,1]'})
isequal(bracketarray(ones(2,2)), {'[[1,1],[1,1]]'})
isequal(bracketarray(ones(3,2)), {'[[1,1],[1,1],[1,1]]'})
isequal(bracketarray(ones(2,3)), {'[[1,1,1],[1,1,1]]'})
isequal(bracketarray(ones(1,1,2)), {'[[[1]],[[1]]]'})
isequal(bracketarray(ones(2,1,2)), {'[[[1],[1]],[[1],[1]]]'})
isequal(bracketarray(ones(1,2,2)), {'[[[1,1]],[[1,1]]]'})
isequal(bracketarray(ones(2,2,2)), {'[[[1,1],[1,1]],[[1,1],[1,1]]]'})
isequal(bracketarray(ones(1,1,1,2)), {'[[[[1]]],[[[1]]]]'})
isequal(bracketarray(ones(2,1,1,2)), {'[[[[1],[1]]],[[[1],[1]]]]'})
isequal(bracketarray(ones(1,2,1,2)), {'[[[[1,1]]],[[[1,1]]]]'})
isequal(bracketarray(ones(1,1,2,2)), {'[[[[1]],[[1]]],[[[1]],[[1]]]]'})
isequal(bracketarray(ones(2,1,2,2)), {'[[[[1],[1]],[[1],[1]]],[[[1],[1]],[[1],[1]]]]'})
isequal(bracketarray(ones(1,2,2,2)), {'[[[[1,1]],[[1,1]]],[[[1,1]],[[1,1]]]]'})
isequal(bracketarray(ones(2,2,2,2)), {'[[[[1,1],[1,1]],[[1,1],[1,1]]],[[[1,1],[1,1]],[[1,1],[1,1]]]]'})
isequal(bracketarray(permute(reshape([1:16],2,2,2,2),[2,1,3,4])), {'[[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]]'})
isequal(bracketarray(ones(1,1,1,1,2)), {'[[[[[1]]]],[[[[1]]]]]'})
I think it would be easier to just loop and use join. Your test cases pass.
function out = bracketarray_matlabbit(in)
out = permute(in, [2 1 3:ndims(in)]);
out = string(out);
dimsToCat = ndims(out);
if iscolumn(out)
dimsToCat = dimsToCat-1;
end
for i = 1:dimsToCat
out = "[" + join(out, ",", i) + "]";
end
end
This also seems to be faster than the route you were pursing:
>> x = permute(reshape([1:16],2,2,2,2),[2,1,3,4]);
>> tic; for i = 1:1e4; bracketarray_matlabbit(x); end; toc
Elapsed time is 0.187955 seconds.
>> tic; for i = 1:1e4; bracketarray_cris_luengo(x); end; toc
Elapsed time is 5.859952 seconds.
The recursive function is almost complete. What is missing is a way to index the last dimension. There are several ways to do this, the neatest, I find, is as follows:
n = ndims(x);
index = cell(n-1, 1);
index(:) = {':'};
y = x(index{:}, ii);
It's a little tricky at first, but this is what happens: index is a set of n-1 strings ':'. index{:} is a comma-separated list of these strings. When we index x(index{:},ii) we actually do x(:,:,:,ii) (if n is 4).
The completed recursive function is:
function out = bracketarray(in)
n = ndims(in);
if n == 2
% Fill in your n==2 code here
else
% if array has more than 2 dimensions, recursively send planes of 2 dimensions for encoding
index = cell(n-1, 1);
index(:) = {':'};
storage = cell(size(in, n), 1);
for ii = 1:size(in, n)
storage(ii) = bracketarray(in(index{:}, ii)); % last dimension automatically removed
end
end
out = { strcat('[', strjoin(storage, ','), ']') };
Note that I have preallocated the storage cell array, to prevent it from being resized in every loop iteration. You should do the same in your 2D case code. Preallocating is important in MATLAB for performance reasons, and the MATLAB Editor should warm you about this too.
Suppose I have an array of 100_000 records ( this is Ruby code, but any language will do)
ary = ['apple','orange','dog','tomato', 12, 17,'cat','tiger' .... ]
results = []
I can only make random calls to the array ( I cannot traverse it in any way)
results << ary.sample
# in ruby this will pull a random record from the array, and
# push into results array
How many random calls like that, do I need to make, to get least 80% of records from ary. Or expressed another way - what should be the size of results so that results.uniq will contain around 80_000 records from ary.
From my rusty memory of Stats class in college, I think it's needs to be 2*result set size = or around 160_000 requests ( assuming random function is random, and there is no some other underlying issue) . My testing seems to confirm this.
ary = [*1..100_000];
result = [];
160_000.times{result << ary.sample};
result.uniq.size # ~ 80k
This is stats, so we are talking about probabilities, not guaranteed results. I just need a reasonable guess.
So the question really, what's the formula to confirm this?
I would just perform a quick simulation study. In R,
N = 1e5
# Simulate 300 times
s = replicate(300, sample(x = 1:N, size = 1.7e5, replace = TRUE))
Now work out when you hit your target
f = function(i) which(i == unique(i)[80000])[1]
stats = apply(s, 2, f)
To get
summary(stats)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 159711 160726 161032 161037 161399 162242
So in 300 trials, the maximum number of simulations needed was 162242 with an average number of 161032.
With Fisher-Yates shuffle you could get 80K items from exactly 80K random calls
Have no knowledge of Ruby, but looking at https://gist.github.com/mindplace/3f3a08299651ebf4ab91de3d83254fbc and modifying it
def shuffle(array, counter)
#counter = array.length - 1
while counter > 0
# item selected from the unshuffled part of array
random_index = rand(counter)
# swap the items at those locations
array[counter], array[random_index] = array[random_index], array[counter]
# de-increment counter
counter -= 1
end
array
end
indices = [0, 1, 2, 3, ...] # up to 99999
counter = 80000
shuffle(indices, 80000)
i = 0
while counter > 0
res[i] = ary[indices[i]]
counter -= 1
i += 1
UPDATE
Packing sampled indices into custom RNG (bear with me, know nothing about Ruby)
class FYRandom
_indices = indices
_max = 80000
_idx = 0
def rand()
if _idx > _max
return -1.0
r = _indices[idx]
_idx += 1
return r.to_f / max.to_f
end
end
And code for sample would be
rng = FYRandom.new
results << ary.sample(random: rng)
What I'm trying to do
I have an array of numbers:
>> A = [2 2 2 2 1 3 4 4];
And I want to find the array indices where each number can be found:
>> B = arrayfun(#(x) {find(A==x)}, 1:4);
In other words, this B should tell me:
>> for ii=1:4, fprintf('Item %d in location %s\n',ii,num2str(B{ii})); end
Item 1 in location 5
Item 2 in location 1 2 3 4
Item 3 in location 6
Item 4 in location 7 8
It's like the 2nd output argument of unique, but instead of the first (or last) occurrence, I want all the occurrences. I think this is called a reverse lookup (where the original key is the array index), but please correct me if I'm wrong.
How can I do it faster?
What I have above gives the correct answer, but it scales terribly with the number of unique values. For a real problem (where A has 10M elements with 100k unique values), even this stupid for loop is 100x faster:
>> B = cell(max(A),1);
>> for ii=1:numel(A), B{A(ii)}(end+1)=ii; end
But I feel like this can't possibly be the best way to do it.
We can assume that A contains only integers from 1 to the max (because if it doesn't, I can always pass it through unique to make it so).
That's a simple task for accumarray:
out = accumarray(A(:),(1:numel(A)).',[],#(x) {x}) %'
out{1} = 5
out{2} = 3 4 2 1
out{3} = 6
out{4} = 8 7
However accumarray suffers from not being stable (in the sense of unique's feature), so you might want to have a look here for a stable version of accumarray, if that's a problem.
Above solution also assumes A to be filled with integers, preferably with no gaps in between. If that is not the case, there is no way around a call of unique in advance:
A = [2.1 2.1 2.1 2.1 1.1 3.1 4.1 4.1];
[~,~,subs] = unique(A)
out = accumarray(subs(:),(1:numel(A)).',[],#(x) {x})
To sum up, the most generic solution, working with floats and returning a sorted output could be:
[~,~,subs] = unique(A)
[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1)); %// optional
vals = 1:numel(A);
vals = vals(I); %// optional
out = accumarray(subs, vals , [],#(x) {x});
out{1} = 5
out{2} = 1 2 3 4
out{3} = 6
out{4} = 7 8
Benchmark
function [t] = bench()
%// data
a = rand(100);
b = repmat(a,100);
A = b(randperm(10000));
%// functions to compare
fcns = {
#() thewaywewalk(A(:).');
#() cst(A(:).');
};
% timeit
t = zeros(2,1);
for ii = 1:100;
t = t + cellfun(#timeit, fcns);
end
format long
end
function out = thewaywewalk(A)
[~,~,subs] = unique(A);
[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));
idx = 1:numel(A);
out = accumarray(subs, idx(I), [],#(x) {x});
end
function out = cst(A)
[B, IX] = sort(A);
out = mat2cell(IX, 1, diff(find(diff([-Inf,B,Inf])~=0)));
end
0.444075509687511 %// thewaywewalk
0.221888202987325 %// CST-Link
Surprisingly the version with stable accumarray is faster than the unstable one, due to the fact that Matlab prefers sorted arrays to work on.
This solution should work in O(N*log(N)) due sorting, but is quite memory intensive (requires 3x the amount of input memory):
[U, X] = sort(A);
B = mat2cell(X, 1, diff(find(diff([Inf,U,-Inf])~=0)));
I am curious about the performance though.
Alrighty, so I have a pretty 'simple' problem on my hands. I am given two inputs for my function: a string that gives the formula of the equation and a structure that contains the information I need and looks like this:
Name
Symbol
AtomicNumber
AtomicWeight
To find the molecular weight, I have to take all of the elements in the formula, find their total mass and add them all together. For example, let's say that I have to find the molecular weight of oxygen. The formula would look like:
H2,O
The molecular weight will thus be
2*(Hydrogen's weight) + (Oxygen's weight), which evaluates to 18.015.
There will always be a comma separating the different elements in a formula. What I am having trouble with right now, is taking the number out of the string(the formula). I feel like I'm over-complicating how I am going about extracting it. If there's a number, I know it can be in positions 2 or 3 (depending on the element name). I tried to use isnumeric, I tried to do some really weird, coding stuff (which you'll see below), but I am having difficulties.
test case:
mass5 = molarMass('C,H2,Br,C,H2,Br', table)
mass5 => 187.862
table:
Name Symbol AtomicNumber AtomicWeight
'Carbon' 'C' 6 12.0110000000000
'Hydrogen' 'H' 1 1.00800000000000
'Nitrogen' 'N' 7 14.0070000000000
'Oxygen' 'O' 8 15.9990000000000
'Phosphorus''P' 15 30.9737619980000
'Sulfur' 'S' 16 32.0600000000000
'Chlorine' 'Cl' 17 35.4500000000000
'Bromine' 'Br' 35 79.9040000000000
'Sodium' 'Na' 11 22.9897692800000
'Magnesium' 'Mg' 12 24.3050000000000
My code so far is:
function[molar_mass] = molarMass(formula, information)
Names = []; %// Creates a Name array
[~,c] = size(information); %Finds the rows and columns of the table
for i = 1:c %Reads through the columns
Molecules = getfield(information(:,i), 'Name'); %Finds the numbers in the 'Name' area
Names = [Names {Molecules}];
end
Symbols = [];
[~, c2] = size(information);
for i = 1:c2 %Reads through the columns
Symbs = getfield(information(:,i), 'Symbol'); %Finds the numbers in the 'Symbol'
Symbols = [Symbols {Symbs}];
end
AN = [];
[~, c3] = size(information);
for i = 1:c3 %Reads through the columns
Atom = getfield(information(:,i), 'AtomicNumber'); %Finds the numbers in the 'AtomicWeight' area
AN = [AN {Atom}];
end
Wt = [information(:).AtomicWeight];
formula_parts = strsplit(formula, ','); % cell array of strings
total_mass = 0;
multi = [];
atoms = [];
Indices = [];
for ipart = 1:length(formula_parts)
part = formula_parts{ipart}; % Takes in the string
isdigit = (part >= '0') & (part <= '9'); % A boolean array
atom = part(~isdigit); % Select all chars that are not digits
Indixes = find(strcmp(Symbols, atom));
Indices = [Indices {Indixes}];
mole = atom;
atoms = [atoms {mole}];
natoms = part(isdigit); % Select all chars that are digits
% Convert natoms string to numbers, default to 1 if missing
if length(natoms) == 0
natoms = '1';
multi = [multi {natoms}];
else
natoms = num2str(natoms);
multi = [multi {natoms}];
end
end
multi = char(multi);
multi = str2num(multi); %Creates a number array with my multipliers
f=56;
Molecule_Wt = Wt{Indices};
duck = 62;
total_mass = total_mass + atom_weight * multi;
end
Thanks to Bas Swinckels I can now extract the numbers from the formulas, but what I'm struggling with now is how to pull out the weights associated with the symbols. I created my own weight_chart, but strcmp won't work there. Neither will strfind or strmatch. What I want to do is find the formulas in my input, in the chart. Then index it from that index, to the column (so add 1 I believe). How do I find the indices though? I'd prefer to find them in the order the strings appear in my input, since I can then apply my 'multi' array to it.
Any help/suggestions would be appreciated :)
Given the string, you can pull out the part that is a digit character with the isstrprop function. Then use that to address your string to get just those characters, then cast that as a double with str2double.
PartialString = 'H12';
Subscript = str2double (PartialString (isstrprop (PartialString, 'digit')));
This should get you started, there is still some parts that need to be filled in:
formula_parts = strsplit(formula, ','); % cell array of strings
total_mass = 0;
for ipart = 1:length(formula_parts)
part = formula_parts{ipart}; % string like 'H2'
isdigit = isstrprop(part, 'digit'); % boolean array
atom = part(~isdigit); % select all chars that are not digits
natoms = part(isdigit); % select all chars that are digits
% convert natoms string to int, default to 1 if missing
if length(natoms) == 0
natoms = 1;
else
natoms = num2str(natoms);
end
% calculate weight
atom_weight = lookup_weight(atom); % somehow look up value in table
total_mass = total_mass + atom_weight * natoms;
end
See this old question about how to extract letters or digits from a string.