Cross between "dotimes" and "for" functionality? - loops

I frequently find myself wanting to efficiently run a Clojure function multiple times with an integer index (like "dotimes") but also get the results out as a ready-made sequence/list (like "for").
i.e. I'd like to do something like this:
(fortimes [i 10] (* i i))
=> (0 1 4 9 16 25 36 49 64 81)
Clearly it would be possible to do:
(for [i (range 10)] (* i i))
But I'd like to avoid creating and throwing away the temporary range list if at all possible.
What's the best way to achieve this in Clojure?

Generating a range in a for loop, as you show in your second example, is the idiomatic solution for solving this problem in Clojure.
Since Clojure is grounded in the functional paradigm, programming in Clojure, by default, will generate temporary data structures like this. However, since both the "range" and the "for" command operate with lazy sequences, writing this code does not force the entire temporary range data structure to exist in memory at once. If used properly, there is therefore a very low memory overhead for lazy seqs as used in this example. Also, the computational overhead for your example is modest and should only grow linearly with the size of the range. This is considered an acceptable overhead for typical Clojure code.
The appropriate way to completely avoid this overhead, if the temporary range list is absolutely, positively unacceptable for your situation, is to write your code using atoms or transients: http://clojure.org/transients. It you do this, however, you will give up many of the advantages of the Clojure programming model in exchange for slightly better performance.

I've written an iteration macro that can do this and other types of iteration very efficiently. The package is called clj-iterate, both on github and clojars. For example:
user> (iter {for i from 0 to 10} {collect (* i i)})
(0 1 4 9 16 25 36 49 64 81 100)
This will not create a temporary list.

I'm not sure why you're concerned with "creating and throwing away" the lazy sequence created by the range function. The bounded iteration done by dotimes is likely more efficient, it being an inline increment and compare with each step, but you may pay an additional cost to express your own list concatenation there.
The typical Lisp solution is to prepend new elements to a list that you build as you go, then reverse that built-up list destructively to yield the return value. Other techniques to allow appending to a list in constant time are well known, but they do not always prove to be more efficient than the prepend-then-reverse approach.
In Clojure, you can use transients to get there, relying on the destructive behavior of the conj! function:
(let [r (transient [])]
(dotimes [i 10]
(conj! r (* i i))) ;; destructive
(persistent! r))
That seems to work, but the documentation on transients warns that one should not use conj! to "bash values in place"—that is, to count on destructive behavior in lieu of catching the return value. Hence, that form needs to be rewritten.
In order to rebind r above to the new value yielded by each call to conj!, we'd need to use an atom to introduce one more level of indirection. At that point, though, we're just fighting against dotimes, and it would be better to write your own form using loop and recur.
It would be nice to be able to preallocate the vector to be of the same size as the iteration bound. I don't see a way to do so.

(defmacro fortimes [[i end] & code]
`(let [finish# ~end]
(loop [~i 0 results# '()]
(if (< ~i finish#)
(recur (inc ~i) (cons ~#code results#))
(reverse results#)))))
example:
(fortimes [x 10] (* x x))
gives:
(0 1 4 9 16 25 36 49 64 81)

Hmm, can't seem to answer your comment because I wasn't registered. However, clj-iterate uses a PersistentQueue, which is part of the runtime library, but not exposed through the reader.
It's basically a list on which you can conj to the end.

Related

How to get maximum value of matrix containing linear expressions?

I hope that someone can help me.
For the solution of an optimisation problem I have to get the maximum of a Matrix containing linear expressions to minimize this value in a second step.
For example I have the unbounded decision variables x and y
x.append(m.addVar(vtype=GRB.CONTINUOUS, lb=-GRB.INFINITY, ub=+GRB.INFINITY, name="x")))
y.append(m.addVar(vtype=GRB.CONTINUOUS, lb=-GRB.INFINITY, ub=+GRB.INFINITY, name="y")))
and the Matrix M = [0.25*x,0.25*x+y].
The maximum of the Matrix should be saved as M_max. Later the objective is to minimize M_max --> m.setObjective( M_max , GRB.MINIMIZE)
When I try it by typing in M_max = amax(M) I always get back the first element, here 0.25x. What operation returns the "real" maximum value? (Of Course my model is more complicated but I hope that you can understand my problem)
Thanks a lot for your help!
The manual approach would be:
introduce aux-var z (-inf, inf, cont)
add constraints
0.25*x <= z
0.25*x+y <= z
minimize (z)
Not sure if gurobi nowadays provide some automatic way.
Edit It seems newer gurobi-versions provide this functionality (of automatic reformulation) like explained here (python-docs; you need to check if those are available for your interface too; which could be python)
max_ ( variables )
Used to set a decision variable equal to the maximum of a list of decision variables (or constants). You can pass the arguments as a Python list or as a comma-separated list.
# example probably based on the assumption:
# import: from gurobipy import *
m.addConstr(z == max_(x, y, 3))
m.addConstr(z == max_([x, y, 3]))
You did not show what amax is you used. If it's numpy's amax or anything outside of gurobi, you cannot use it! Gurobi-vars don't behave as classic fp-variables and every operation on those variable-objects need to be backed by gurobi (often hidden through operator-overloading) or else gurobi can't make sure it's formalizing a valid mathematical-model.

computing function of neighbors efficiently on lattice

I'm studying the Ising model, and I'm trying to efficiently compute a function H(σ) where σ is the current state of an LxL lattice (that is, σ_ij ∈ {+1, -1} for i,j ∈ {1,2,...,L}). To compute H for a particular σ, I need to perform the following calculation:
where ⟨i j⟩ indicates that sites σ_i and σ_j are nearest neighbors and (suppose) J is a constant.
A couple of questions:
Should I store my state σ as an LxL matrix or as an L2 list? Is one better than the other for memory accessing in RAM (which I guess depends on the way I'm accessing elements...)?
In either case, how can I best compute H?
Really I think this boils down to how can I access (and manipulate) the neighbors of every state most efficiently.
Some thoughts:
I see that if I loop through each element in the list or matrix that I'll be double counting, so is there a "best" way to return the unique neighbors?
Is there a better data structure that I'm not thinking of?
Your question is a bit broad and a bit confusing for me, so excuse me if my answer is not the one you are looking for, but I hope it will help (a bit).
An array is faster than a list when it comes to indexing. A matrix is a 2D array, like this for example (where N and M are both L for you):
That means that you first access a[i] and then a[i][j].
However, you can avoid this double access, by emulating a 2D array with a 1D array. In that case, if you want to access element a[i][j] in your matrix, you would now do, a[i * L + j].
That way you load once, but you multiply and add your variables, but this may still be faster in some cases.
Now as for the Nearest Neighbor question, it seems that you are using a square-lattice Ising model, which means that you are working in 2 dimensions.
A very efficient data structure for Nearest Neighbor Search in low dimensions is the kd-tree. The construction of that tree takes O(nlogn), where n is the size of your dataset.
Now you should think if it's worth it to build such a data structure.
PS: There is a plethora of libraries implementing the kd-tree, such as CGAL.
I encountered this problem during one of my school assignments and I think the solution depends on which programming language you are using.
In terms of efficiency, there is no better way than to write a for loop to sum neighbours(which are actually the set of 4 points{ (i+/-1,j+/-1)} for a given (i,j). However, when simd(sse etc) functions are available, you can re-express this as a convolution with a 2d kernel {0 1 0;1 0 1;0 1 0}. so if you use a numerical library which exploits simd functions you can obtain significant performance increase. You can see the example implementation of this here(https://github.com/zawlin/cs5340/blob/master/a1_code/denoiseIsingGibbs.py) .
Note that in this case, the performance improvement is huge because to evaluate it in python I need to write an expensive for loop.
In terms of work, there is in fact some waste as the unecessary multiplications and sum with zeros at corners and centers. So whether you can experience performance improvement depends quite a bit on your programming environment( if you are already in c/c++, it can be difficult and you need to use mkl etc to obtain good improvement)

Indexing Julia's DataArrays with included NA values

I am wondering why indexing Julia's DataArrays with NA values is not possible.
Excuting the snipped below results in an error(NAException("cannot index an array with a DataArray containing NA values")):
dm = data([1 4 7; 2 5 8; 3 1 9])
dm[dm .== 5] = NA
dm[dm .< 3] = 1 #Error
dm[(!isna(dm)) & (dm .< 3)] = 1 #Working
There is a solutions to ignore NA's in a DataFrame with isna(), like answered here. At a first glance it works like it should and ignoring NA's in DataFrames is the same approach like for the DataArrays, because each column of a DataFrame is a DataArray, stated here. But in my opinion ignoring missing values with !isna() on each condition is not the best solution.
For me it's not clear why the DataFrame Module throws an error if NA's are included. If the boolean Array needed for indexing, has NA's values, this values should convert to false like MATLAB® or Pythons Pandas does. In the DataArray modules sourcecode(shown below) in indexing.jl, there is an explicit function to throw the NAException:
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
If you change the snippet by setting the NA's to false ...
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
A[A.na] = false
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
... dm[dm .< 3] = 1 works like it should(like in MATLAB® or Pandas).
For me it make no sense to automatically throw error if NA's are included on indexing. There should leastwise be a parameter creating the DataArray to let the user choose if NA's are ignored. There are two siginificant reasons: On the one hand it's not very pleasent for writing and reading code, when you have formulas with a lot of indexing and NA values (e.g calculating meteorological grid models) and on the other hand there is a noticeable loss of performance, which this timetest is showing:
#timeit dm[(!isna(dm)) & (dm .< 3)] = 1 #14.55 µs per loop
#timeit dm[dm .< 3] = 1 #754.79 ns per loop
What is the reason that the developers make use of this exception and is there another simpler approach as the !isna() for ignoring NA's in DataArrays?
Suppose you have three rabbits. You want to put the female rabbit(s) in a separate cage from the males. You look at the first rabbit, and it looks like a male, so you leave it where it is. You look at the second rabbit, and it looks like a female, so you move it to the separate cage. You can't really get a good look at the third rabbit. What should you do?
It depends. Maybe you're fine with leaving the rabbit of unknown sex behind. But if you're separating out the rabbits because you don't want them to make baby rabbits, then you might want your analysis software to tell you that it doesn't know the sex of the third rabbit.
Situations like this arise often when analyzing data. In the most pathological cases, data is missing systematically rather than at random. If you were to survey a bunch of people about how fluffy rabbits are and whether they should be eaten more, you could compare mean(fluffiness[should_be_eaten_more]) and mean(fluffiness[!should_be_eaten_more]). But, if people who really like rabbits are incensed that you're talking about eating them at all, they might leave that second question blank. If you ignore that, you will underestimate the mean fluffiness rating among people who don't think rabbits should be eaten more, which would be a grave mistake. This is why fluffiness[!should_be_eaten_more] will throw an error if there are missing values: It is a sign that whatever you are trying to do with your data may not give the right results. This situation is bad enough that people write entire papers about it, e.g. this one.
Enough about rabbits. It is possible that there should be (and may someday be) a more concise way to drop/keep all missing values when indexing, but it will always be explicit rather than implicit for the reason described above. As far as performance goes, while there is a slowdown for isna(x) & (x < 3) vs x < 3, the overhead of repeatedly indexing into an array is also high, and DataArrays adds additional overhead on top of that. The relative overhead decreases as the array gets larger. If this is a bottleneck in your code, your best bet is to write it differently.

Updating values in an array with logical indexing with a non-constant value

A common problem I encounter when I want to write concise/readable code:
I want to update all the values of a vector matching a logical expression with a value that depends on the previous value.
For example, double all even entries:
weights = [10 7 4 8 3];
weights(mod(weights,2)==0) = weights(mod(weights,2)==0) * 2;
% weights = [20 7 8 16 3]
Is it possible to write the second line in a more concise fashion (i.e. avoiding the double use of the logical expression, something like i+=3 for i=i+3 in other languages). If I often use this kind of vector operation in different contexts/variables, and I have long conditionals, I feel that my code is less concise and readable than it could be.
Thanks!
How about
ind = mod(weights,2)==0;
weights(ind) = weights(ind)*2;
This way you avoid calculating the indices twice and it's easy to read.
Starting your other comment to Wauzl, such powerful operation capabilities is the Fortran side. This is purely matlab's design that is quickly getting obsolete. Let's use this horribleness further:
for i=1:length(weights),if (mod(weights(i),2)==0)weights(i)=weights(i)*2;end,end
It is even slightly faster than your two liner because you are doing the conditional indexing twice on both sides. In general, consider switching to Python3.
Well, I after more searching around, I found this link that deals with this issue (I used search before posting, I swear!), and there is interesting further discussion regarding this topic in the links in that thread. So apparently there are issues with ambiguity when introducing such an operator.
Looks like that is the price we have to pay in terms of syntactic limitations for having such powerful matrix operation capabilities.
Thanks a lot anyway, Wauzl!

What is the most efficient way to read a CSV file into an Accelerate (or Repa) Array?

I am interested in playing around with the Accelerate library, and I would like to perform some operations on data stored inside of a CSV file. I've read this excellent introduction to Accelerate, but I'm not sure how I can go about reading CSVs into Accelerate efficiently. I've thought about this, and the only thing I can think of is to parse the entire CSV file into one long list, and then feed the entire list into Accelerate.
My data sets will be quite large, and it doesn't seem efficient to read a 1 gb+ file into memory only to copy somewhere else. I noticed there was a CSV Enumerator package on Hackage, but I'm not sure how to use it with Accelerate's generate function. Another constraint is that it seems the dimensions of the Array, or at least number of elements, must be known before generating an array using Accelerate.
Has anyone dealt with this kind of problem before?
Thanks!
I am not sure if this is 100% applicable to accelerate or repa, but here is one way I've handled this for Vector in the past:
-- | A hopefully-efficient sink that incrementally grows a vector from the input stream
sinkVector :: (PrimMonad m, GV.Vector v a) => Int -> ConduitM a o m (Int, v a)
sinkVector by = do
v <- lift $ GMV.new by
go 0 v
where
-- i is the index of the next element to be written by go
-- also exactly the number of elements in v so far
go i v = do
res <- await
case res of
Nothing -> do
v' <- lift $ GV.freeze $ GMV.slice 0 i v
return $! (i, v')
Just x -> do
v' <- case GMV.length v == i of
True -> lift $ GMV.grow v by
False -> return v
lift $ GMV.write v' i x
go (i+1) v'
It basically allocates by empty slots and proceeds to fill them. Once it hits the ceiling, it grows the underlying vector once again. I haven't benchmarked anything, but it appears to perform OK in practice. I am curious to see if there will be other more efficient answers here.
Hope this helps in some way. I do see there's a fromVector function in repa and perhaps that's your golden ticket in combination with this method.
I haven't tried reading CSV files into repa but I recommend using cassava (http://hackage.haskell.org/package/cassava). Iirc I had a 1.5G file which I used to create my stats. With cassava, my program ran in a surprisingly small amount of memory. Here's an extended example of usage:
http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-haskell/
In the case of repa, if you add rows incrementally to an array (which it sounds like you want to do) then one would hope the space usage would also grow incrementally. It certainly is worth an experiment. And possibly also contacting the repa folks. Please report back on your results :-)

Resources