Is there an easy way to match values of a list to array in R? - arrays

I have several (named) vectors in a list:
data = list(a=runif(n = 50, min = 1, max = 10), b=runif(n = 50, min = 1, max = 10), c=runif(n = 50, min = 1, max = 10), d=runif(n = 50, min = 1, max = 10))
I want to play around with different combinations of them depending on what an array tells me, for example I want to sum across the different combinations in combs:
var <- letters[1:length(data)]
combs <- do.call(expand.grid, lapply(var, function(x) c("", x)))[-1,]
And get the sums for each row of these combinations. So the results for the first 8 rows would look like this:
res = rbind(a=sum(data[["a"]]), b=sum(data[["b"]]), ab = sum(c(data[["a"]], data[["b"]])), c = sum(data[["c"]]), ac = sum(c(data[["a"]], data[["c"]])), bc = sum(c(data[["b"]], data[["c"]])), abc = sum(c(data[["a"]], data[["b"]], data[["c"]])), d=sum(data[["d"]]))
I think it is possible by extracting the list of data, by looping through each rows and each columns (I would have a variable number of columns though), but this seems quite clunky and slow, is there a better way that I am not seeing?
Thanks so much!
Fra

Related

Bootstrapping the uncertainty on an RMSE estimate of a location-scale generalized additive model

I have height data (numeric height data in cm; Height) of plants measured over time (numeric data expressed in days of the year; Doy). These data is grouped per genotype (factor data; Genotype) and individual plant (Factor data; Individual). I've managed to calculate the RMSE of the location-scale GAM but I can't figure out how to bootstrap the uncertainty estimate on the RMSE calculation given it is a hierarchical location-scale generalized additive model.
The code to extract the RMSE value looks something like this:
# The GAM
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype,
~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype),
family = gaulss(), # Gaussian location-scale
method = "REML",
data = data)
# Extract the model formula
form <- formula.gam(model)
# Cross-validation for the location
CV <- CVgam(form[[1]], data, nfold = 10, debug.level = 0, method = "GCV.Cp",
printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
# The root mean square error is given by taking the square root of the MSE
sqrt(CV$cvscale[1])`
There is only one height measurement per Individual per day of the year. I figure this is problematic in maintaining the exact same formulation of the GAM. In thsi regard, I was thinking of making sure that the same few Individuals of each genotype (let's say n = 4) were randomly sampled over each day of the year. I can't figure out how to proceed though. Any ideas?
I've tried several methods, such as the boot package and for loops. An example of one of things I've tried is:
lm=list();counter=0
lm2=list()
loops = 3
for (i in 1:loops){
datax <- data %>%
group_by(Doy, Genotype) %>%
slice_sample(prop = 0.6, replace = T)
datax
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype,
~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype),
family = gaulss(),
method = "REML",
data = datax)
# Extract the model formula
form <- formula.gam(model)
# Cross-validation for the location
CV <- CVgam(form[[1]], datax, nfold = 10, debug.level = 0, method = "GCV.Cp",
printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
RMSE[i] <- sqrt(CV$cvscale[c(1)])
}
RMSE
This loop runs very slow and just returns me 3 times the same RMSE values; Surely, there is an issue with the sampling.
Unfortunately, I can't share my data but maybe somebody has an idea on how to proceed?
Many thanks!

Generate a matrix of combinations (permutation) without repetition (array exceeds maximum array size preference)

I am trying to generate a matrix, that has all unique combinations of [0 0 1 1], I wrote this code for this:
v1 = [0 0 1 1];
M1 = unique(perms([0 0 1 1]),'rows');
• This isn't ideal, because perms() is seeing each vector element as unique and doing:
4! = 4 * 3 * 2 * 1 = 24 combinations.
• With unique() I tried to delete all the repetitive entries so I end up with the combination matrix M1 →
only [4!/ 2! * (4-2)!] = 6 combinations!
Now, when I try to do something very simple like:
n = 15;
i = 1;
v1 = [zeros(1,n-i) ones(1,i)];
M = unique(perms(vec_1),'rows');
• Instead of getting [15!/ 1! * (15-1)!] = 15 combinations, the perms() function is trying to do
15! = 1.3077e+12 combinations and it's interrupted.
• How would you go about doing in a much better way? Thanks in advance!
You can use nchoosek to return the indicies which should be 1, I think in your heart you knew this must be possible because you were using the definition of nchoosek to determine the expected final number of permutations! So we can use:
idx = nchoosek( 1:N, k );
Where N is the number of elements in your array v1, and k is the number of elements which have the value 1. Then it's simply a case of creating the zeros array and populating the ones.
v1 = [0, 0, 1, 1];
N = numel(v1); % number of elements in array
k = nnz(v1); % number of non-zero elements in array
colidx = nchoosek( 1:N, k ); % column index for ones
rowidx = repmat( 1:size(colidx,1), k, 1 ).'; % row index for ones
M = zeros( size(colidx,1), N ); % create output
M( rowidx(:) + size(M,1) * (colidx(:)-1) ) = 1;
This works for both of your examples without the need for a huge intermediate matrix.
Aside: since you'd have the indicies using this approach, you could instead create a sparse matrix, but whether that's a good idea or not would depend what you're doing after this point.

How many random requests do I need to make to a set of records to get 80% of the records?

Suppose I have an array of 100_000 records ( this is Ruby code, but any language will do)
ary = ['apple','orange','dog','tomato', 12, 17,'cat','tiger' .... ]
results = []
I can only make random calls to the array ( I cannot traverse it in any way)
results << ary.sample
# in ruby this will pull a random record from the array, and
# push into results array
How many random calls like that, do I need to make, to get least 80% of records from ary. Or expressed another way - what should be the size of results so that results.uniq will contain around 80_000 records from ary.
From my rusty memory of Stats class in college, I think it's needs to be 2*result set size = or around 160_000 requests ( assuming random function is random, and there is no some other underlying issue) . My testing seems to confirm this.
ary = [*1..100_000];
result = [];
160_000.times{result << ary.sample};
result.uniq.size # ~ 80k
This is stats, so we are talking about probabilities, not guaranteed results. I just need a reasonable guess.
So the question really, what's the formula to confirm this?
I would just perform a quick simulation study. In R,
N = 1e5
# Simulate 300 times
s = replicate(300, sample(x = 1:N, size = 1.7e5, replace = TRUE))
Now work out when you hit your target
f = function(i) which(i == unique(i)[80000])[1]
stats = apply(s, 2, f)
To get
summary(stats)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 159711 160726 161032 161037 161399 162242
So in 300 trials, the maximum number of simulations needed was 162242 with an average number of 161032.
With Fisher-Yates shuffle you could get 80K items from exactly 80K random calls
Have no knowledge of Ruby, but looking at https://gist.github.com/mindplace/3f3a08299651ebf4ab91de3d83254fbc and modifying it
def shuffle(array, counter)
#counter = array.length - 1
while counter > 0
# item selected from the unshuffled part of array
random_index = rand(counter)
# swap the items at those locations
array[counter], array[random_index] = array[random_index], array[counter]
# de-increment counter
counter -= 1
end
array
end
indices = [0, 1, 2, 3, ...] # up to 99999
counter = 80000
shuffle(indices, 80000)
i = 0
while counter > 0
res[i] = ary[indices[i]]
counter -= 1
i += 1
UPDATE
Packing sampled indices into custom RNG (bear with me, know nothing about Ruby)
class FYRandom
_indices = indices
_max = 80000
_idx = 0
def rand()
if _idx > _max
return -1.0
r = _indices[idx]
_idx += 1
return r.to_f / max.to_f
end
end
And code for sample would be
rng = FYRandom.new
results << ary.sample(random: rng)

numpy binned mean, conserving extra axes

It seems I am stuck on the following problem with numpy.
I have an array X with shape: X.shape = (nexp, ntime, ndim, npart)
I need to compute binned statistics on this array along npart dimension, according to the values in binvals (and some bins), but keeping all the other dimensions there, because I have to use the binned statistic to remove some bias in the original array X. Binning values have shape binvals.shape = (nexp, ntime, npart).
A complete, minimal example, to explain what I am trying to do. Note that, in reality, I am working on large arrays and with several hunderds of bins (so this implementation takes forever):
import numpy as np
np.random.seed(12345)
X = np.random.randn(24).reshape(1,2,3,4)
binvals = np.random.randn(8).reshape(1,2,4)
bins = [-np.inf, 0, np.inf]
nexp, ntime, ndim, npart = X.shape
cleanX = np.zeros_like(X)
for ne in range(nexp):
for nt in range(ntime):
indices = np.digitize(binvals[ne, nt, :], bins)
for nd in range(ndim):
for nb in range(1, len(bins)):
inds = indices==nb
cleanX[ne, nt, nd, inds] = X[ne, nt, nd, inds] - \
np.mean(X[ne, nt, nd, inds], axis = -1)
Looking at the results of this may make it clearer?
In [8]: X
Out[8]:
array([[[[-0.20470766, 0.47894334, -0.51943872, -0.5557303 ],
[ 1.96578057, 1.39340583, 0.09290788, 0.28174615],
[ 0.76902257, 1.24643474, 1.00718936, -1.29622111]],
[[ 0.27499163, 0.22891288, 1.35291684, 0.88642934],
[-2.00163731, -0.37184254, 1.66902531, -0.43856974],
[-0.53974145, 0.47698501, 3.24894392, -1.02122752]]]])
In [10]: cleanX
Out[10]:
array([[[[ 0. , 0.67768523, -0.32069682, -0.35698841],
[ 0. , 0.80405255, -0.49644541, -0.30760713],
[ 0. , 0.92730041, 0.68805503, -1.61535544]],
[[ 0.02303938, -0.02303938, 0.23324375, -0.23324375],
[-0.81489739, 0.81489739, 1.05379752, -1.05379752],
[-0.50836323, 0.50836323, 2.13508572, -2.13508572]]]])
In [12]: binvals
Out[12]:
array([[[ -5.77087303e-01, 1.24121276e-01, 3.02613562e-01,
5.23772068e-01],
[ 9.40277775e-04, 1.34380979e+00, -7.13543985e-01,
-8.31153539e-01]]])
Is there a vectorized solution? I thought of using scipy.stats.binned_statistic, but I seem to be unable to understand how to use it for this aim. Thanks!
import numpy as np
np.random.seed(100)
nexp = 3
ntime = 4
ndim = 5
npart = 100
nbins = 4
binvals = np.random.rand(nexp, ntime, npart)
X = np.random.rand(nexp, ntime, ndim, npart)
bins = np.linspace(0, 1, nbins + 1)
d = np.digitize(binvals, bins)[:, :, np.newaxis, :]
r = np.arange(1, len(bins)).reshape((-1, 1, 1, 1, 1))
m = d[np.newaxis, ...] == r
counts = np.sum(m, axis=-1, keepdims=True).clip(min=1)
means = np.sum(X[np.newaxis, ...] * m, axis=-1, keepdims=True) / counts
cleanX = X - np.choose(d - 1, means)
Ok, I think I got it, mainly based on the answer by #jdehesa.
clean2 = np.zeros_like(X)
d = np.digitize(binvals, bins)
for i in range(1, len(bins)):
m = d == i
minds = np.where(m)
sl = [*minds[:2], slice(None), minds[2]]
msum = m.sum(axis=-1)
clean2[sl] = (X - \
(np.sum(X * m[...,np.newaxis,:], axis=-1) /
msum[..., np.newaxis])[..., np.newaxis])[sl]
Which gives the same results as my original code.
On the small arrays I have in the example here, this solution is approximately three times as fast as the original code. I expect it to be way faster on larger arrays.
Update:
Indeed it's faster on larger arrays (didn't do any formal test), but despite this, it just reaches the level of acceptable in terms of performance... any further suggestion on extra vectoriztaions would be very welcome.

Finding molar mass with MATLAB

Alrighty, so I have a pretty 'simple' problem on my hands. I am given two inputs for my function: a string that gives the formula of the equation and a structure that contains the information I need and looks like this:
Name
Symbol
AtomicNumber
AtomicWeight
To find the molecular weight, I have to take all of the elements in the formula, find their total mass and add them all together. For example, let's say that I have to find the molecular weight of oxygen. The formula would look like:
H2,O
The molecular weight will thus be
2*(Hydrogen's weight) + (Oxygen's weight), which evaluates to 18.015.
There will always be a comma separating the different elements in a formula. What I am having trouble with right now, is taking the number out of the string(the formula). I feel like I'm over-complicating how I am going about extracting it. If there's a number, I know it can be in positions 2 or 3 (depending on the element name). I tried to use isnumeric, I tried to do some really weird, coding stuff (which you'll see below), but I am having difficulties.
test case:
mass5 = molarMass('C,H2,Br,C,H2,Br', table)
mass5 => 187.862
table:
Name Symbol AtomicNumber AtomicWeight
'Carbon' 'C' 6 12.0110000000000
'Hydrogen' 'H' 1 1.00800000000000
'Nitrogen' 'N' 7 14.0070000000000
'Oxygen' 'O' 8 15.9990000000000
'Phosphorus''P' 15 30.9737619980000
'Sulfur' 'S' 16 32.0600000000000
'Chlorine' 'Cl' 17 35.4500000000000
'Bromine' 'Br' 35 79.9040000000000
'Sodium' 'Na' 11 22.9897692800000
'Magnesium' 'Mg' 12 24.3050000000000
My code so far is:
function[molar_mass] = molarMass(formula, information)
Names = []; %// Creates a Name array
[~,c] = size(information); %Finds the rows and columns of the table
for i = 1:c %Reads through the columns
Molecules = getfield(information(:,i), 'Name'); %Finds the numbers in the 'Name' area
Names = [Names {Molecules}];
end
Symbols = [];
[~, c2] = size(information);
for i = 1:c2 %Reads through the columns
Symbs = getfield(information(:,i), 'Symbol'); %Finds the numbers in the 'Symbol'
Symbols = [Symbols {Symbs}];
end
AN = [];
[~, c3] = size(information);
for i = 1:c3 %Reads through the columns
Atom = getfield(information(:,i), 'AtomicNumber'); %Finds the numbers in the 'AtomicWeight' area
AN = [AN {Atom}];
end
Wt = [information(:).AtomicWeight];
formula_parts = strsplit(formula, ','); % cell array of strings
total_mass = 0;
multi = [];
atoms = [];
Indices = [];
for ipart = 1:length(formula_parts)
part = formula_parts{ipart}; % Takes in the string
isdigit = (part >= '0') & (part <= '9'); % A boolean array
atom = part(~isdigit); % Select all chars that are not digits
Indixes = find(strcmp(Symbols, atom));
Indices = [Indices {Indixes}];
mole = atom;
atoms = [atoms {mole}];
natoms = part(isdigit); % Select all chars that are digits
% Convert natoms string to numbers, default to 1 if missing
if length(natoms) == 0
natoms = '1';
multi = [multi {natoms}];
else
natoms = num2str(natoms);
multi = [multi {natoms}];
end
end
multi = char(multi);
multi = str2num(multi); %Creates a number array with my multipliers
f=56;
Molecule_Wt = Wt{Indices};
duck = 62;
total_mass = total_mass + atom_weight * multi;
end
Thanks to Bas Swinckels I can now extract the numbers from the formulas, but what I'm struggling with now is how to pull out the weights associated with the symbols. I created my own weight_chart, but strcmp won't work there. Neither will strfind or strmatch. What I want to do is find the formulas in my input, in the chart. Then index it from that index, to the column (so add 1 I believe). How do I find the indices though? I'd prefer to find them in the order the strings appear in my input, since I can then apply my 'multi' array to it.
Any help/suggestions would be appreciated :)
Given the string, you can pull out the part that is a digit character with the isstrprop function. Then use that to address your string to get just those characters, then cast that as a double with str2double.
PartialString = 'H12';
Subscript = str2double (PartialString (isstrprop (PartialString, 'digit')));
This should get you started, there is still some parts that need to be filled in:
formula_parts = strsplit(formula, ','); % cell array of strings
total_mass = 0;
for ipart = 1:length(formula_parts)
part = formula_parts{ipart}; % string like 'H2'
isdigit = isstrprop(part, 'digit'); % boolean array
atom = part(~isdigit); % select all chars that are not digits
natoms = part(isdigit); % select all chars that are digits
% convert natoms string to int, default to 1 if missing
if length(natoms) == 0
natoms = 1;
else
natoms = num2str(natoms);
end
% calculate weight
atom_weight = lookup_weight(atom); % somehow look up value in table
total_mass = total_mass + atom_weight * natoms;
end
See this old question about how to extract letters or digits from a string.

Resources