Matrix-like operations for 3D arrays in R? - arrays

I'm currently estimating a model in R using optim, but it's really slow, on the order of 30 minutes if I initialize it with zeroes. When I profile the whole thing, I find that apply is taking the most time, which makes sense. So that leads me to my question:
x.arr <- array(1:9, c(3, 10, 3))
b <- 1:3
f <- function(x, b) {
exp(x%*%b)
}
u.mat <- apply(x.arr, 2, f, b = b)
Is there a more efficient way to do this? x.arr is a 3D array, so it seems like there ought to be some way to use matrix operations to solve the same goal.
Additionally, I run Linux, so I assume that I can also easily do something with mclapply or something, but every time that I've made the attempt, I've managed to hang my entire R session.
There's also a package, tensor but everything I've tried from it so far was so far from what I was actually looking for that I wasn't even sure what I was getting back.
My linear algebra isn't the best, but something tells me there ought to be some sort of good option without using apply.

As these things go, I found a solution that speeds it up considerably using the tensor package. (I spent 4 hours on this yesterday, but apparently today things just clicked.)
require(tensor)
x.arr <- array(1:9, c(3, 10, 3))
b <- 1:3
u.mat <- exp(tensor(x.arr, b, alongA = 3, alongB = 1))
Which now takes me from a time of ~ 30 minutes to a time of around ~ 10 minutes.
I'm still interested if anyone has an idea of how to make it faster, of course, but maybe if someone else finds this question, this will at least be a satisfactory answer for them.

Related

Indexing Julia's DataArrays with included NA values

I am wondering why indexing Julia's DataArrays with NA values is not possible.
Excuting the snipped below results in an error(NAException("cannot index an array with a DataArray containing NA values")):
dm = data([1 4 7; 2 5 8; 3 1 9])
dm[dm .== 5] = NA
dm[dm .< 3] = 1 #Error
dm[(!isna(dm)) & (dm .< 3)] = 1 #Working
There is a solutions to ignore NA's in a DataFrame with isna(), like answered here. At a first glance it works like it should and ignoring NA's in DataFrames is the same approach like for the DataArrays, because each column of a DataFrame is a DataArray, stated here. But in my opinion ignoring missing values with !isna() on each condition is not the best solution.
For me it's not clear why the DataFrame Module throws an error if NA's are included. If the boolean Array needed for indexing, has NA's values, this values should convert to false like MATLAB® or Pythons Pandas does. In the DataArray modules sourcecode(shown below) in indexing.jl, there is an explicit function to throw the NAException:
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
If you change the snippet by setting the NA's to false ...
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
A[A.na] = false
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
... dm[dm .< 3] = 1 works like it should(like in MATLAB® or Pandas).
For me it make no sense to automatically throw error if NA's are included on indexing. There should leastwise be a parameter creating the DataArray to let the user choose if NA's are ignored. There are two siginificant reasons: On the one hand it's not very pleasent for writing and reading code, when you have formulas with a lot of indexing and NA values (e.g calculating meteorological grid models) and on the other hand there is a noticeable loss of performance, which this timetest is showing:
#timeit dm[(!isna(dm)) & (dm .< 3)] = 1 #14.55 µs per loop
#timeit dm[dm .< 3] = 1 #754.79 ns per loop
What is the reason that the developers make use of this exception and is there another simpler approach as the !isna() for ignoring NA's in DataArrays?
Suppose you have three rabbits. You want to put the female rabbit(s) in a separate cage from the males. You look at the first rabbit, and it looks like a male, so you leave it where it is. You look at the second rabbit, and it looks like a female, so you move it to the separate cage. You can't really get a good look at the third rabbit. What should you do?
It depends. Maybe you're fine with leaving the rabbit of unknown sex behind. But if you're separating out the rabbits because you don't want them to make baby rabbits, then you might want your analysis software to tell you that it doesn't know the sex of the third rabbit.
Situations like this arise often when analyzing data. In the most pathological cases, data is missing systematically rather than at random. If you were to survey a bunch of people about how fluffy rabbits are and whether they should be eaten more, you could compare mean(fluffiness[should_be_eaten_more]) and mean(fluffiness[!should_be_eaten_more]). But, if people who really like rabbits are incensed that you're talking about eating them at all, they might leave that second question blank. If you ignore that, you will underestimate the mean fluffiness rating among people who don't think rabbits should be eaten more, which would be a grave mistake. This is why fluffiness[!should_be_eaten_more] will throw an error if there are missing values: It is a sign that whatever you are trying to do with your data may not give the right results. This situation is bad enough that people write entire papers about it, e.g. this one.
Enough about rabbits. It is possible that there should be (and may someday be) a more concise way to drop/keep all missing values when indexing, but it will always be explicit rather than implicit for the reason described above. As far as performance goes, while there is a slowdown for isna(x) & (x < 3) vs x < 3, the overhead of repeatedly indexing into an array is also high, and DataArrays adds additional overhead on top of that. The relative overhead decreases as the array gets larger. If this is a bottleneck in your code, your best bet is to write it differently.

What is the most efficient way to read a CSV file into an Accelerate (or Repa) Array?

I am interested in playing around with the Accelerate library, and I would like to perform some operations on data stored inside of a CSV file. I've read this excellent introduction to Accelerate, but I'm not sure how I can go about reading CSVs into Accelerate efficiently. I've thought about this, and the only thing I can think of is to parse the entire CSV file into one long list, and then feed the entire list into Accelerate.
My data sets will be quite large, and it doesn't seem efficient to read a 1 gb+ file into memory only to copy somewhere else. I noticed there was a CSV Enumerator package on Hackage, but I'm not sure how to use it with Accelerate's generate function. Another constraint is that it seems the dimensions of the Array, or at least number of elements, must be known before generating an array using Accelerate.
Has anyone dealt with this kind of problem before?
Thanks!
I am not sure if this is 100% applicable to accelerate or repa, but here is one way I've handled this for Vector in the past:
-- | A hopefully-efficient sink that incrementally grows a vector from the input stream
sinkVector :: (PrimMonad m, GV.Vector v a) => Int -> ConduitM a o m (Int, v a)
sinkVector by = do
v <- lift $ GMV.new by
go 0 v
where
-- i is the index of the next element to be written by go
-- also exactly the number of elements in v so far
go i v = do
res <- await
case res of
Nothing -> do
v' <- lift $ GV.freeze $ GMV.slice 0 i v
return $! (i, v')
Just x -> do
v' <- case GMV.length v == i of
True -> lift $ GMV.grow v by
False -> return v
lift $ GMV.write v' i x
go (i+1) v'
It basically allocates by empty slots and proceeds to fill them. Once it hits the ceiling, it grows the underlying vector once again. I haven't benchmarked anything, but it appears to perform OK in practice. I am curious to see if there will be other more efficient answers here.
Hope this helps in some way. I do see there's a fromVector function in repa and perhaps that's your golden ticket in combination with this method.
I haven't tried reading CSV files into repa but I recommend using cassava (http://hackage.haskell.org/package/cassava). Iirc I had a 1.5G file which I used to create my stats. With cassava, my program ran in a surprisingly small amount of memory. Here's an extended example of usage:
http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-haskell/
In the case of repa, if you add rows incrementally to an array (which it sounds like you want to do) then one would hope the space usage would also grow incrementally. It certainly is worth an experiment. And possibly also contacting the repa folks. Please report back on your results :-)

1D Number Array Clustering

So let's say I have an array like this:
[1,1,2,3,10,11,13,67,71]
Is there a convenient way to partition the array into something like this?
[[1,1,2,3],[10,11,13],[67,71]]
I looked through similar questions yet most people suggested using k-means to cluster points, like scipy, which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers?
Some people also suggest rigid range partitioning, but it doesn't always render the results as
expected
Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.
In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.
You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.
With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.
See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):
This simple algorithm works:
points = [0.1, 0.31, 0.32, 0.45, 0.35, 0.40, 0.5 ]
clusters = []
eps = 0.2
points_sorted = sorted(points)
curr_point = points_sorted[0]
curr_cluster = [curr_point]
for point in points_sorted[1:]:
if point <= curr_point + eps:
curr_cluster.append(point)
else:
clusters.append(curr_cluster)
curr_cluster = [point]
curr_point = point
clusters.append(curr_cluster)
print(clusters)
The above example clusters points into a group, such that each element in a group is at most eps away from another element in the group. This is like the clustering algorithm DBSCAN with eps=0.2, min_samples=1. As others noted, 1d data allows you to solve the problem directly, instead of using the bigger guns like DBSCAN.
The above algorithm is 10-100x faster for some small datasets with <1000 elements I tested.
You may look for discretize algorithms. 1D discretization problem is a lot similar to what you are asking. They decide cut-off points, according to frequency, binning strategy etc.
weka uses following algorithms in its , discretization process.
weka.filters.supervised.attribute.Discretize
uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion
weka.filters.unsupervised.attribute.Discretize
uses simple binning
CKwrap is a fast and straightforward k-means clustering function, though a bit light on documentation.
Example Usage
pip install ckwrap
import ckwrap
nums= np.array([1,1,2,3,10,11,13,67,71])
km = ckwrap.ckmeans(nums,3)
print(km.labels)
# [0 0 0 0 1 1 1 2 2]
buckets = [[],[],[]]
for i in range(len(nums)):
buckets[km.labels[i]].append(nums[i])
print(buckets)
# [[1, 1, 2, 3], [10, 11, 13], [67, 71]]
exit()
I expect the authors intended you to make use of the nd array functionality rather than create a list of lists.
other measures:
km.centers
km.k
km.sizes
km.totss
km.betweenss
km.withinss
The underlying algorithm is based on this article.
Late response and just for the record. You can partition a 1D array using Ckmeans.1d.dp.
This method guarantees optimality and it is O(n^2), where n is the num of observations. The implementation is in C++ and there is a wrapper in R.
The code for Has QUIT--Anony-Mousse's answer to Clustering values by their proximity in python (machine learning?)
When you have 1-dimensional data, sort it, and look for the largest
gaps
I only added that gaps need to be relatively large
import numpy as np
from scipy.signal import argrelextrema
# lst = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
lst = [1,1,2,3,10,11,13,67,71]
lst.sort()
diff = [lst[i] - lst[i-1] for i in range(1, len(lst))]
rel_diff = [diff[i]/lst[i] for i in range(len(diff))]
arg = argrelextrema(np.array(rel_diff), np.greater)[0]
last = 0
for x in arg:
print(f'{last}:{x + 1} {lst[last:x + 1]}')
last = x + 1
print(f'{last}: {lst[last:]}')
output:
0:2 [1, 1]
2:4 [2, 3]
4:7 [10, 11, 13]
7: [67, 71]

what to do with a flawed C++ skills test

In the following gcc.gnu.org post, Nathan Myers says that a C++ skills test at SANS Consulting Services contained three errors in nine questions:
Looking around, one of fthe first on-line C++ skills tests I ran across was:
http://www.geekinterview.com/question_details/13090
I looked at question 1...
find(int x,int y)
{ return ((x<y)?0:(x-y)):}
call find(a,find(a,b)) use to find
(a) maximum of a,b
(b) minimum of a,b
(c) positive difference of a,b
(d) sum of a,b
... immediately wondering why would anyone write anything so obtuse. Getting past the absurdity, I didn't really like any of the answers, immediately eliminating (a) and (b) because you can get back zero (which is neither a nor b) in a variety of circumstances. Sum or difference seemed more likely, except that you could also get zero regardless of the magnitudes of a and b. So... I put Matlab to work (code below) and found: when either a or b is negative you get zero; when b > a you get a; otherwise you get b, so the answer is (b) min(a,b), if a and b are positive, though strictly speaking the answer should be none of the above because there are no range restrictions on either variable. That forces test takers into a dilemma - choose the best available answer and be wrong in 3 of 4 quadrants, or don't answer, leaving the door open to the conclusion that the grader thinks you couldn't figure it out.
The solution for test givers is to fix the test, but in the interim, what's the right course of action for test takers? Complain about the questions?
function z = findfunc(x,y)
for i=1:length(x)
if x(i) < y(i)
z(i) = 0;
else
z(i) = x(i) - y(i);
end
end
end
function [b,d1,z] = plotstuff()
k = 50;
a = [-k:1:k];
b = (2*k+1) * rand(length(a),1) - k;
d1 = findfunc(a,b);
z = findfunc(a,d1);
plot( a, b, 'r.', a, d1, 'g-', a, z, 'b-');
end
Why are you wasting your time taking tests such as the online one you linked to? That one is so bad that words are not enough to describe the horror.
What you're supposed to do in this case is wash your eyes with soap, get drunk and hope you won't remember anything in the morning...
I had the same issue on a test a few years ago.
The options were A, B, C, or D.
I wrote in option E with my answer and then clearly explained why the other four were wrong.
The test was taken remotely and got a call for an on-site interview the same day.
...you can take it for what it's worth.
I prefer to write notes on the test explaining where the test is invalid. I am also willing to discuss these items with interviewers.
I like to stand by my convictions against horrible code and especially code fragments on tests that are never used or very seldom used in the real world.

Minimize function in adjacent items of an array

I have an array (arr) of elements, and a function (f) that takes 2 elements and returns a number.
I need a permutation of the array, such that f(arr[i], arr[i+1]) is as little as possible for each i in arr. (and it should loop, ie. it should also minimize f(arr[arr.length - 1], arr[0]))
Also, f works sort of like a distance, so f(a,b) == f(b,a)
I don't need the optimum solution if it's too inefficient, but one that works reasonable well and is fast since I need to calculate them pretty much in realtime (I don't know what to length of arr is, but I think it could be something around 30)
What does "such that f(arr[i], arr[i+1]) is as little as possible for each i in arr" mean? Do you want minimize the sum? Do you want to minimize the largest of those? Do you want to minimize f(arr[0],arr[1]) first, then among all solutions that minimize this, pick the one that minimizes f(arr[1],arr[2]), etc., and so on?
If you want to minimize the sum, this is exactly the Traveling Salesman Problem in its full generality (well, "metric TSP", maybe, if your f's indeed form a metric). There are clever optimizations to the naive solution that will give you the exact optimum and run in reasonable time for about n=30; you could use one of those, or one of the heuristics that give you approximations.
If you want to minimize the maximum, it is a simpler problem although still NP-hard: you can do binary search on the answer; for a particular value d, draw edges for pairs which have f(x,y)
If you want to minimize it lexiocographically, it's trivial: pick the pair with the shortest distance and put it as arr[0],arr[1], then pick arr[2] that is closest to arr[1], and so on.
Depending on where your f(,)s are coming from, this might be a much easier problem than TSP; it would be useful for you to mention that as well.
You're not entirely clear what you're optimizing - the sum of the f(a[i],a[i+1]) values, the max of them, or something else?
In any event, with your speed limitations, greedy is probably your best bet - pick an element to make a[0] (it doesn't matter which due to the wraparound), then choose each successive element a[i+1] to be the one that minimizes f(a[i],a[i+1]).
That's going to be O(n^2), but with 30 items, unless this is in an inner loop or something that will be fine. If your f() really is associative and commutative, then you might be able to do it in O(n log n). Clearly no faster by reduction to sorting.
I don't think the problem is well-defined in this form:
Let's instead define n fcns g_i : Perms -> Reals
g_i(p) = f(a^p[i], a^p[i+1]), and wrap around when i+1 > n
To say you want to minimize f over all permutations really implies you can pick a value of i and minimize g_i over all permutations, but for any p which minimizes g_i, a related but different permatation minimizes g_j (just conjugate the permutation). So therefore it makes no sense to speak minimizing f over permutations for each i.
Unless we know something more about the structure of f(x,y) this is an NP-hard problem. Given a graph G and any vertices x,y let f(x,y) be 1 if there is no edge and 0 if there is an edge. What the problem asks is an ordering of the vertices so that the maximum f(arr[i],arr[i+1]) value is minimized. Since for this function it can only be 0 or 1, returning a 0 is equivalent to finding a Hamiltonian path in G and 1 is saying that no such path exists.
The function would have to have some sort of structure that disallows this example for it to be tractable.

Resources