Indexing Julia's DataArrays with included NA values - arrays

I am wondering why indexing Julia's DataArrays with NA values is not possible.
Excuting the snipped below results in an error(NAException("cannot index an array with a DataArray containing NA values")):
dm = data([1 4 7; 2 5 8; 3 1 9])
dm[dm .== 5] = NA
dm[dm .< 3] = 1 #Error
dm[(!isna(dm)) & (dm .< 3)] = 1 #Working
There is a solutions to ignore NA's in a DataFrame with isna(), like answered here. At a first glance it works like it should and ignoring NA's in DataFrames is the same approach like for the DataArrays, because each column of a DataFrame is a DataArray, stated here. But in my opinion ignoring missing values with !isna() on each condition is not the best solution.
For me it's not clear why the DataFrame Module throws an error if NA's are included. If the boolean Array needed for indexing, has NA's values, this values should convert to false like MATLAB® or Pythons Pandas does. In the DataArray modules sourcecode(shown below) in indexing.jl, there is an explicit function to throw the NAException:
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
If you change the snippet by setting the NA's to false ...
# Indexing with NA throws an error
function Base.to_index(A::DataArray)
A[A.na] = false
any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
Base.to_index(A.data)
end
... dm[dm .< 3] = 1 works like it should(like in MATLAB® or Pandas).
For me it make no sense to automatically throw error if NA's are included on indexing. There should leastwise be a parameter creating the DataArray to let the user choose if NA's are ignored. There are two siginificant reasons: On the one hand it's not very pleasent for writing and reading code, when you have formulas with a lot of indexing and NA values (e.g calculating meteorological grid models) and on the other hand there is a noticeable loss of performance, which this timetest is showing:
#timeit dm[(!isna(dm)) & (dm .< 3)] = 1 #14.55 µs per loop
#timeit dm[dm .< 3] = 1 #754.79 ns per loop
What is the reason that the developers make use of this exception and is there another simpler approach as the !isna() for ignoring NA's in DataArrays?

Suppose you have three rabbits. You want to put the female rabbit(s) in a separate cage from the males. You look at the first rabbit, and it looks like a male, so you leave it where it is. You look at the second rabbit, and it looks like a female, so you move it to the separate cage. You can't really get a good look at the third rabbit. What should you do?
It depends. Maybe you're fine with leaving the rabbit of unknown sex behind. But if you're separating out the rabbits because you don't want them to make baby rabbits, then you might want your analysis software to tell you that it doesn't know the sex of the third rabbit.
Situations like this arise often when analyzing data. In the most pathological cases, data is missing systematically rather than at random. If you were to survey a bunch of people about how fluffy rabbits are and whether they should be eaten more, you could compare mean(fluffiness[should_be_eaten_more]) and mean(fluffiness[!should_be_eaten_more]). But, if people who really like rabbits are incensed that you're talking about eating them at all, they might leave that second question blank. If you ignore that, you will underestimate the mean fluffiness rating among people who don't think rabbits should be eaten more, which would be a grave mistake. This is why fluffiness[!should_be_eaten_more] will throw an error if there are missing values: It is a sign that whatever you are trying to do with your data may not give the right results. This situation is bad enough that people write entire papers about it, e.g. this one.
Enough about rabbits. It is possible that there should be (and may someday be) a more concise way to drop/keep all missing values when indexing, but it will always be explicit rather than implicit for the reason described above. As far as performance goes, while there is a slowdown for isna(x) & (x < 3) vs x < 3, the overhead of repeatedly indexing into an array is also high, and DataArrays adds additional overhead on top of that. The relative overhead decreases as the array gets larger. If this is a bottleneck in your code, your best bet is to write it differently.

Related

Code optimization - cicle

I want to do a large block of code until at least one of the elements of array 1 is equal to 1 of the elements of array 2.
I'm asking the community to share, if possible, the best(fastest to process) ways possible to do this "while"
Sum up:
while (none of the elements from arr1 is equal to any of arr2)
{
(code)
}
Reason: In my code, depending on some dimensions set by the user, my program may need to make this n^2 complexity comparison a lot of times, so I'm looking for a way to make it as light as I can.
I'm sorry in advance, and please let me know, if this type of questions are not suitable for StackOverflow.
Edit: My bad in not giving information about the arrays. As I said, it's dimension may vary based on what the user choses, but each one's size should be between 3 and 1000. Both arrays of integers.
Their values do change, the bigger the dimensions, the more it can happen.
The comments mention a hash and I agree, it could work, but it's very size-dependent. O(n^2)'s overhead will be negligible a lot of the time.
Otherwise, just add all the elements of arr1 into the hashset, then go through arr2's elements to see if they're in there. You'll get O(n) time. Honestly though, unless you're working with elements in the hundred thousands, even millions, I don't think the pay-out will be that tangible, but it's machine dependent and I haven't tested it myself.
C++ has std::unordered_set in the standard. If you're using pure C, I'm sure there's implementations available online.

Optimal Selection in Ruby

Given an array of values,
arr = [8,10,4,5,3,7,6,0,1,9,13,2]
X is an array of values can be chosen at a time where X.length != 0 and X.length < arr.length
The chosen values are then fed into a function, score(), which will return a score based on the array of select values.
Example 1:
X = [8]
score(X) = 71
Example 2:
X = [4]
score(X) = 36
Example 3:
X = [8,10,7]
score(X) = 51
Example 4:
X = [5,9,0]
score(X) = 4
The function score() here is a blackbox and we can't modify how the function works, we just provide an input and the function will return the score output.
My problem: How to get the lowest score for each set of numbers?
Meaning, if X is an array that has only 1 value, and I feed all the different values in arr, each value will return me a different score value, and I find which arr value provides the lowest score.
If X is an array of 3 values, I feed a combination of all the different possible values in arr, with each different set of 3 values returning a different score and finding the lowest score.
This is simple enough to do if my arr is small. However if I have an array of 50 or even 100 values, how can I create an algorithm that would provide the lowest score based on the number of input values
tl;dr: If you don't know anything about score, then you can't speed it up.
In order to optimize score itself, you would have to know how it works. After all "optimizing" simply means "does the same thing more efficient", but how can you know if it really does "the same thing" if you don't know what "the same thing" is? Plus, speeding up score will not help you with the combinatorial explosion anyway. The number of combinations grows so fast, that any speedups to score will be quickly eaten up by slightly larger inputs.
In order to optimize how you apply score, you would again need to know something about it. If you knew something about score, you could, for example, only generate combinations that you know will yield different values, or combinations that you know will only yield larger values. In other words, you could exploit some structure in the output of score in order to reduce the input size. However, we don't know the structure of the output of score, in fact, we don't even know if there is some structure at all! So we can't exploit it. Plus, there would have to be some extreme redundancy and regularity in the structure, in order for a significant reduction in input size.
In his comment, #ndn suggested applying some form of machine learning to discover structure in the output.. How well this works depends on what kind of structure the output has. And of course, this again assumes that there even is some structure to discover, which we don't know. And again, even if there were some structure, it would have to very redundant and regular to make up for the combinatorial explosion of the input space.
Really, brute force is the only way. Our last straw is going to be parallelization. Maybe, if we distribute the problem across enough CPU cores, we can tackle it? Unfortunately, the combinatorial explosion in the input space is still really going to hurt you:
If we assume that we have a 10THz CPU (i.e. a thousand times faster than the fastest currently available CPU), and we assume that we can compute score in a single clock cycle, and we assume that we have a computer with 10 million cores (again, that's a thousand times larger than the largest supercomputers), it's still going to take over 400 years to find the optimal selection for an input array as small as 100 numbers. And even if we make our CPU a billion times faster and the computer a billion times bigger, simply doubling the size of the array to 200 items will increase the runtime to 500 trillion years.
There is a reason why we call combinatorial explosion "combinatorial explosion", after all.

Passing arrays from an Excel VBA function to a spreadsheet

I've passed arrays back and forth from spreadsheets to VBA functions many times. I recently "upgraded" to Excel 365, and now I can't get it to work. For example:
Public Function test(x As Range)
Dim y()
ReDim y(3)
y(1) = 1
y(2) = 2
y(3) = 3
test = y
End Function
Then I highlight three cells, for example, b1:b3, and in the top cell I enter =test(a1:a2) and hit shift-enter. This fills the range with an array formula that is supposed to receive y() from the test function.
However, the cells that reference the function are all zeroes. I put in debugging lines and I can tell the function is running as intended. It's just not passing the array to the spreadsheet.
What's up with that? Has anyone else had this experience?
#RDHS, #tim-williams and #garys-student - thank you for your spot-on answers. And Gary's student - thanks for the incredibly quick response. I'd vote everyone up but I can't 'cuz i'm noob.
But... for completeness' sake -- Your answer raise another question (of a more theoretical type): I SHOULD BE able to coerce a one-dimensional array into a range column directly, and vice versa.
Obviously it's easy enough to check the shape of the range and transform it accordingly (well, it's easy now that that you've shown me how!) But it's so sloppy:
using the above example, instead of just writing
test = y
I need to write:
if x.rows.count=1 then
test = y
else
test = worksheetfunction.transpose(y)
end if
I don't know about you but I'd take Door # 1 (test=y). The other way is SOOOO sloppy.
But MS is holding out on us - excel doesn't force you to do those gymnastics when using built-in spreadsheet array functions like index, match, etc... Index(C1:C10,3) and index(a3:k3,3) both return the value in C3, which is the third ITEM in each ARRAY. Index is smart enough to figure out which is the third item. Surely if you can do it on a worksheet, there must be a way to do it in VBA??
My favorite Comp. Sci. professor - one of the founders of the field of computer science - used to say, "A programming language is low level when its programs require attention to the irrelevant."
He actually made a lot of insightful observations, which he distributed over the ARPANET, making him one of the world's first bloggers (Google Alan Perlis). For twenty years, every programmer had a list of Perlisisms taped above his VT100 -- like:
"In computing, turning the obvious into the useful is a living definition of the word 'frustration'";
"One man's constant is another man's variable";
"Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it."
I bring him up because the desire to produce "clean" code goes way back to the first coders on the first computers. And I was very fond of him.
Give this a try:
Public Function test(x As Range)
Dim y()
ReDim y(3)
y(0) = 1
y(1) = 2
y(2) = 3
test = WorksheetFunction.Transpose(y)
End Function

matlab error, attempt to reference field of non structure array

All the references to this error I could find searching online were completely inapplicable to my situation, they were dealing with some kind of variables involving dots, like a.b (structures in other words), whereas I am strictly using arrays. Nothing involves a dot, nor does my code ask about it.
Ok, I have this GINORMOUS array called tier2comparatorconnectionpoints. It is a 4-D array of size 400×10×20×10. Consider tier2comparatorconnectionpoints(counter,counter2,counter3,counter4).
counter is a number 1 to 400,
counter2 is a number 1 to numchromosomes(counter), and numchromosomes(counter1) is bound to 10,
counter3 is a number 1 to tier2numcomparators(counter,counter2), which is in turn bounded to 20.
counter4 is a number 1 to tier2inputspercomparator(counter,counter2,counter3), which is bounded to 10.
Now, so that I don't run out of RAM, I have tier2comparatorconnectionpoints as type int8, and UNFORTUNATELY at some point in my horrendous amount of code, I forgot to cast it to a double when I'm doing math with it, and a rounding error involved with multiplying it with a rand ends up with tier2comparatorconnectionpoints for some values of its 4 inputs exceeding what it's allowed to be.
The values it's allowed to have are 1 through tier1numcomparators(counter,counter2), which is bounded to 40, 41 through 40+tier2numcomparators(counter,counter2), with tier2numcomparators(counter,counter2) being bounded to 20, and 61 through 60+tier2numcomparators(counter,counter2), thus it's not allowed to be more than 80 since tier2numcomparators(counter,counter2) is bounded to 20 and it's not allowed to be more than 60+tier2numcomparators(counter,counter2), but it's also not allowed to be less than 40 but more than tier1numcomparators(counter,counter2) and it's not allowed to be less than 60 but more than 40+tier2numcomparators(counter,counter2). I became aware of the problem because it was being set to 81 somewhere.
This is an evolutionary simulation by the way, it's natural selection on simulated organisms. I need to hunt down the part of the code that is allowing the values of tier2comparatorconnectionpoints to exceed what it's allowed to be. But that is a separate problem.
A temporary fix of my data, just so that it at least is made to conform to its allowed values, is to set anything that is greater than tier1numcomparators(counter,counter2) but less than 40 to tier1numcomparators(counter,counter2), to set anything that is greater than 40+tier2numcomparators(counter,counter2) but less than 60 to 40+tier2numcomparators(counter,counter2), and to set anything that is greater than 60+tier2numcomparators(counter,counter2) to 60+tier2numcomparators(counter,counter2). I first found this problem because it was being set to 81, so it didn't just exceed 60+tier2numcomparators(counter,counter2), it exceeded 60+20, with tier2numcomparators being bounded to 20.
I hope this isn't all too-much-information, but I felt it might be necessary to get you to understand just what sort of variables these are.
So in my attempts to at least turn the data into valid data, I did the following:
for counter=1:size(tier2comparatorconnectionpoints,1)
for counter2=1:size(tier2comparatorconnectionpoints,2)
for counter3=1:size(tier2comparatorconnectionpoints,3)
for counter4=1:size(tier2comparatorconnectionpoints,4)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)>60+tier2numcomparators(counter,counter2)
tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)=60+tier2numcomparators(counter,counter2);
end
end
end
end
end
And that worked just fine. And then:
for counter=1:size(tier2comparatorconnectionpoints,1)
for counter2=1:size(tier2comparatorconnectionpoints,2)
for counter3=1:size(tier2comparatorconnectionpoints,3)
for counter4=1:size(tier2comparatorconnectionpoints,4)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)>40+tier2numcomparators(counter,counter2)
if tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)<60
tier2comparatorconnectionpoints(counter,counter2,counter3,counter4)=40+tier2numcomparators(counter,counter2);
end
end
end
end
end
end
And that's where it said "Attempt to reference field of non-structure array".
TBH it sounds like maybe you've made a typo and put a . in somewhere? Otherwise please post the entire error as maybe it's happening in a different function or something.
Either way you don't need all those for loops, it's simpler and usually quicker to do this (and should bypass your error):
First replicate your tier2numcomparators matrix so that it has the same dimension sizes as tier2comparatorconnectionpoints
T = repmat(tier2numcomparators + 40, 1, 1, size(tier2comparatorconnectionpoints, 3), size(tier2comparatorconnectionpoints, 4));
Now in one shot you can create a logical matrix of which elements meet your criteria:
ind = tier2comparatorconnectionpoints > T | tier2comparatorconnectionpoints < 60;
Finally employ logical indexing to set your desired elements:
tier2comparatorconnectionpoints(ind) = T(ind);
You can play around with bsxfun instead of repmat if this is slow or takes too much memory

Cross between "dotimes" and "for" functionality?

I frequently find myself wanting to efficiently run a Clojure function multiple times with an integer index (like "dotimes") but also get the results out as a ready-made sequence/list (like "for").
i.e. I'd like to do something like this:
(fortimes [i 10] (* i i))
=> (0 1 4 9 16 25 36 49 64 81)
Clearly it would be possible to do:
(for [i (range 10)] (* i i))
But I'd like to avoid creating and throwing away the temporary range list if at all possible.
What's the best way to achieve this in Clojure?
Generating a range in a for loop, as you show in your second example, is the idiomatic solution for solving this problem in Clojure.
Since Clojure is grounded in the functional paradigm, programming in Clojure, by default, will generate temporary data structures like this. However, since both the "range" and the "for" command operate with lazy sequences, writing this code does not force the entire temporary range data structure to exist in memory at once. If used properly, there is therefore a very low memory overhead for lazy seqs as used in this example. Also, the computational overhead for your example is modest and should only grow linearly with the size of the range. This is considered an acceptable overhead for typical Clojure code.
The appropriate way to completely avoid this overhead, if the temporary range list is absolutely, positively unacceptable for your situation, is to write your code using atoms or transients: http://clojure.org/transients. It you do this, however, you will give up many of the advantages of the Clojure programming model in exchange for slightly better performance.
I've written an iteration macro that can do this and other types of iteration very efficiently. The package is called clj-iterate, both on github and clojars. For example:
user> (iter {for i from 0 to 10} {collect (* i i)})
(0 1 4 9 16 25 36 49 64 81 100)
This will not create a temporary list.
I'm not sure why you're concerned with "creating and throwing away" the lazy sequence created by the range function. The bounded iteration done by dotimes is likely more efficient, it being an inline increment and compare with each step, but you may pay an additional cost to express your own list concatenation there.
The typical Lisp solution is to prepend new elements to a list that you build as you go, then reverse that built-up list destructively to yield the return value. Other techniques to allow appending to a list in constant time are well known, but they do not always prove to be more efficient than the prepend-then-reverse approach.
In Clojure, you can use transients to get there, relying on the destructive behavior of the conj! function:
(let [r (transient [])]
(dotimes [i 10]
(conj! r (* i i))) ;; destructive
(persistent! r))
That seems to work, but the documentation on transients warns that one should not use conj! to "bash values in place"—that is, to count on destructive behavior in lieu of catching the return value. Hence, that form needs to be rewritten.
In order to rebind r above to the new value yielded by each call to conj!, we'd need to use an atom to introduce one more level of indirection. At that point, though, we're just fighting against dotimes, and it would be better to write your own form using loop and recur.
It would be nice to be able to preallocate the vector to be of the same size as the iteration bound. I don't see a way to do so.
(defmacro fortimes [[i end] & code]
`(let [finish# ~end]
(loop [~i 0 results# '()]
(if (< ~i finish#)
(recur (inc ~i) (cons ~#code results#))
(reverse results#)))))
example:
(fortimes [x 10] (* x x))
gives:
(0 1 4 9 16 25 36 49 64 81)
Hmm, can't seem to answer your comment because I wasn't registered. However, clj-iterate uses a PersistentQueue, which is part of the runtime library, but not exposed through the reader.
It's basically a list on which you can conj to the end.

Resources