How to sample from a Scala array efficiently - arrays

I want to sample from a Scala array, the sample size can be much larger than the length of the array. How can I do this efficiently? By using the following code the running time is linear to the sample size, when the sample size is very big it is slow if we need to do the sampling many times:
def getSample(dataArray: Array[Double], sampleSize: Int, seed: Int): Array[Double] =
{
val arrLength = dataArray.length
val r = new scala.util.Random(seed)
Array.fill(sampleSize)(dataArray(r.nextInt(arrLength)))
}
val myArr= Array(1.0,5.0,9.0,4.0,7.0)
getSample(myArr, 100000, 28)

The probability that any given element of an array of length $n$ appears at least once in a sample of size $k$ is $1-(1-1/n)^k$. If this value is close to 1, which occurs when $k$ is large compared to $n$, then the following algorithm might be a good choice depending on your needs:
import org.apache.commons.math3.random.MersennseTwister
import org.apache.commons.math3.distribution.BinomialDistribution
def getSampleCounts[T](data: Array[T], k: Int, seed: Long): Array[Int] = {
val rng = new MersenneTwister(seed)
val counts = new Array[Int](data.length)
var i = k
do {
val j = new BinomialDistribution(rng.nextLong(), i, 1.0/i)
counts(i) = j
i -= j
} while (i > 0)
counts
}
Note that this algorithm does not return a sample. Instead it returns an Array[Int] whose $i$-th entry is equal to the number of times data(i) appears in the random sample. This may not be suitable for all applications, but for some use cases having the sample in the form of some sort of Iterable over (value, count) pairs (which can be obtained by data.view.zip(getSampleCounts(data, k, seed)), for example) is actually very convenient since it often enables us to do a computation once for groups of samples (since they are equal.) For example, suppose I had an expensive function f: T => Double and I wanted to compute the sample mean of f applied to a random sample of size $k$ draw from data. Then we could do the following:
data.view.zip(getSampleCounts(data, k, seed)).map({case (x, count) => f(x)*count}).sum/k
This computation for the sample mean evaluates f $n$ instead of $k$ times (recall that we are assuming that $k$ is large compared to $n$.)
Note that getSampleCounts will loop at most $n$ times where $n$ is data.length. Also, sampling from the binomial distribution in each iteration, assuming this is done in a reasonable fashion in the apache.commons.math3 library, should have complexity no worse than $O(\log k)$ (inverse CDF method and binary search.) So the complexity of the above algorithm is $O(n \log k)$ where $n$ is data.length and $k$ is the number of samples you want to draw.

There is no way around it. If you need to take N elements with constant time element access the complexity will be O(n) (linear) no matter what.
You can deffer/amortize the cost by making it lazy. For instance you can return a Stream or Iterator that evaluates each element as you access it. This will help you save on memory usage if you can fold that stream as you are consuming it. In other words you can skip the copy part and work directly with initial array - not always possible, depends on the task.

To make this sampling program run faster, use Akka actor framework to run the sampling jobs in parallel.
Create a master actor for distributing the sampling works to Worker actors and also to concatenate the elements from different workers. So each Worker actor would prepare/collect a fixed number of sample elements and give back the resulting collection as an immutable array to the master. Upon receiving the 'WorkDone' user-defined message from Worker, the Master actor concatenates the elements into the final collection.

it is easy with a list. Use the following implicit function
object ListImplicits {
implicit class SampledArray[T](in: List[T]) {
def sample(n: Int, seed:Option[Long]=None): List[T] = {
seed match {
case Some(s) => Random.setSeed(s)
case _ => // nothing
}
Random.shuffle(in).take(n)
}
}
}
And then import the object and use collection conversions to switch from Array to list (slight overhead):
import ListImplicits.SampledArray
val n = 100000
val list = (0 to n).toList.map(i => Random.nextInt())
val array = list.toArray
val t0 = System.currentTimeMillis()
array.toList.sample(5).toArray
val t1 = System.currentTimeMillis()
list.sample(5)
val t2 = System.currentTimeMillis()
println( "Array (conversion) => delta = " + (t1-t0) + " ms") // 10 ms
println( "List => delta = " + (t2-t1) + " ms") // 8 ms

Related

fastest method to multiply every element of array with a number in Scala

I am having a large array and I want to to multiply every element of array with a given number N.
I can do this in following way
val arr = Array.fill(100000)(math.random)
val N = 5.0
val newArr = arr.map ( _ * N )
So this will return me new array as i want. An other way could be
def demo (arr :Array [Double] , value : Double ) : Array[Double] ={
var res : Array[Double] = Array()
if ( arr.length == 1 )
res = Array ( arr.head + value )
else
res = demo ( arr.slice(0, arr.length/2) , value ) ++ demo ( arr.slice ( arr.length / 2 , arr.length ) , value )
res
}
I my case I have larger array and I have to perform this operation for Thousands of iterations. I want to ask is there any faster way to get same output? Will tail recursion will increase speed? Or any other technique?
Presumably you mean arr.head * value.
Neither of these are "faster" in big-O terms; they're both O(N), which makes sense because it's right there in the description: you need to multiply "every" number by some constant.
The only difference is that in the second case you spend a bunch of time slicing and concatenating arrays. So the first one is likely going to be faster. Tail recursion isn't going to help you because this isn't a recursive problem. All you need to do is loop once through the array.
If you have a really huge number of numbers in your array, you could parallelize the multiplication across the available processors (if more than one) by using arr.par.map.
Tail recursion will be better than regular recursion if you're writing your own function to recursively loop over the list, since tail recursion doesn't fall into Stack issues like regular recursion does.
arr.map(_ * N) should be fine for you though.
Whatever you do, try not to use var. Mutable variables are a code smell in Scala.
Also, when you're dealing with thousands of values it might be worth looking into different collection types like Vector over Array; different collections are efficient at different things. For more information on the performance of collections, check out the official Scala Collections Performance Characteristics page.
For multiplying the elements of an array with a number N, the time complexity will be O(N) as you will have to evaluate all the elements of the array.
arr.map ( _ * N )
IMHO the above code is the optimal solution evaluating it in O(N). But now if your array is very huge, I would recommend you to convert the list to a stream and then perform your transformation.
For example
arr.toStream.map{ _ * 2}

Time Complexity of finding the length of an array

I am a little confused on what the time complexity of a len() function would be.
I have read in many different posts that finding the length of an array in python is O(1) with the len() function and similar for other languages.
How is this possible? Do you not have to iterate through the whole array to count how many indices its taking up?
Do you not have to iterate through the whole array to count how many indices its taking up?
No, you do not.
You can generally always trade space for time when constructing algorithms.
For example, when creating a collection, allocate a separate variable holding the size. Then increment that when adding an item to the collection and decrement it when removing something.
Then, voilĂ , the size of the collection can then be obtained in O(1) time just by accessing that variable.
And this appears to be what Python actually does, as per this page, which states (checking of the Python source code shows that this is the action when requesting the size of a great many objects):
Py_SIZE(o) - This macro is used to access the ob_size member of a Python object. It expands to (((PyVarObject*)(o))->ob_size).
If you compare the two approaches (iterating vs. a length variable), the properties of each can be seen in the following table:
Measurement
Iterate
Variable
Space needed
No extra space beyond the collection itself.
Tiny additional length (4 bytes allowing for size of about four billion).
Time taken
Iteration over the collection.Depends on collection size, so could be significant.
Extraction of length, very quick.Changes to list size (addition or deletion) incur slight extra expense of updating length, but this is also tiny.
In this case, the extra cost is minimal but the time saved for getting the length could be considerable, so it's probably worth it.
That's not always the case since, in some (rare) situations, the added space cost may outweigh the reduced time taken (or it may need more space than can be made available).
And, by way of example, this is what I'm talking about. Ignore the fact that it's totally unnecessary in Python, this is for a mythical Python-like language that has an O(n) cost for finding the length of a list:
import random
class FastQueue:
""" FastQueue: demonstration of length variable usage.
"""
def __init__(self):
""" Init: Empty list and set length zero.
"""
self._content = []
self._length = 0
def push(self, item):
""" Push: Add to end, increase length.
"""
self._content.append(item)
self._length += 1
def pull(self):
""" Pull: Remove from front, decrease length, taking
care to handle empty queue.
"""
item = None
if self._length > 0:
item = self._content[0]
self._content = self._content[1:]
self._length -= 1
return item
def length(self):
""" Length: Just return stored length. Obviously this
has no advantage in Python since that's
how it already does length. This is just
an illustration of my answer.
"""
return self._length
def slow_length(self):
""" Length: A slower version for comparison.
"""
list_len = 0
for _ in self._content:
list_len += 1
return list_len
""" Test harness to ensure I haven't released buggy code :-)
"""
queue = FastQueue()
for _ in range(10):
val = random.randint(1, 50)
queue.push(val)
print(f'push {val}, length = {queue.length()}')
for _ in range(11):
print(f'pull {queue.pull()}, length = {queue.length()}')

How to calculate efficiently the index of an array, where the cumulative sum exceeded a threshold in Scala - Spark?

Suppose we have an array of integers, of 100 elements.
val a = Array(312, 102, 95, 255, ...)
I want to find the index (say k) of the array where the cumulative sum of the first k+1 elements is greater than a certain threshold, but for the first k element is less.
Because of the high number of elements in the array, I estimated a lower and upper index where in the between the k index should be:
k_Lower <= k <= k_upper
My question is, what is the best way to find this k index?
I tried it with a while loop when k_lower = 30; k_upper = 47 and the threshold = 20000
var sum = 0
var k = 30
while (k <= 47 && sum <= 20000) {
sum = test.take(k).sum
k += 1
}
print(k-2)
I obtained the right answer, but I'm pretty sure that there is a more efficient or "Scala-ish" solution for this and I'm really new in Scala. Also I have to implement this in Spark.
Another question:
To optimize the method of finding the k index, I thought of using binary search, where the min and max values are k_lower, respective k_upper. But my attempt to implement this was unsuccessful. How should I do this?
I am using Scala 2.10.6 and Spark 1.6.0
Update!
I thought this approach is a good solution for my problem, but now I think, that I approached it wrongly. My actual problem is the following:
I have a bunch of JSON-s, which are loaded into Spark as an RDD with
val eachJson = sc.textFile("JSON_Folder/*.json")
I want to split the data into several partitions based on their size. The concatenated JSON-s size should be under a threshold. My idea was to go through the RDD one by one, calculate the size of a JSON and add it to an accumulator. When the accumulator is greater than the threshold, then I remove the last JSON and I obtain a new RDD with all the JSON-s until that, and I do it again with the remaining JSON-s. I read about tail recursion which could be a solution for this, but I wasn't able to implement it, so I tried to solve it differently. I map-ed the sizes for each JSON, and I obtained and RDD[Int]. And I managed to get all the indexes of this array, where the cumulative sum exceeded the threshold:
def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_ + _) //add the sizes together
}
val jsonSize = eachJson.map(s => s.getBytes("UTF-8").length)
val threshold = 20000
val totalSize = calcRDDSize(eachJson)
val numberOfPartitions = totalSize/threshold
val splitIndexes = scala.collection.mutable.ArrayBuffer.empty[Int]
var i = 0
while (i < numberOfPartitions)
{
splitIndexes += jsonSize.collect().toStream.scanLeft(0){_ + _}.takeWhile(_ < (i+1)*threshold).length-1
i = i + 1
}
However, I don't like this solution, because in the while loop I go through several times on the Stream and this is not really efficient. And now I have the indexes where I have to split the RDD, but I don't know how to split is.
I would to this with scanLeft and further optimize this using a lazy collection
a
.toStream
.scanLeft(0){_ + _}
.tail
.zipWithIndex
.find{case(cumsum,i) => cumsum > limit}

What is the fastest way to count elements in an array?

In my models, one of the most repeated tasks to be done is counting the number of each element within an array. The counting is from a closed set, so I know there are X types of elements, and all or some of them populate the array, along with zeros that represent 'empty' cells. The array is not sorted in any way, and could by quite long (about 1M elements), and this task is done thousands of times during one simulation (which is also part of hundreds of simulations). The result should be a vector r of size X, so r(k) is the amount of k in the array.
Example:
For X = 9, if I have the following input vector:
v = [0 7 8 3 0 4 4 5 3 4 4 8 3 0 6 8 5 5 0 3]
I would like to get this result:
r = [0 0 4 4 3 1 1 3 0]
Note that I don't want the count of zeros, and that elements that don't appear in the array (like 2) have a 0 in the corresponding position of the result vector (r(2) == 0).
What would be the fastest way to achieve this goal?
tl;dr: The fastest method depend on the size of the array. For array smaller than 214 method 3 below (accumarray) is faster. For arrays larger than that method 2 below (histcounts) is better.
UPDATE: I tested this also with implicit broadcasting, that was introduced in 2016b, and the results are almost equal to the bsxfun approach, with no significant difference in this method (relative to the other methods).
Let's see what are the available methods to perform this task. For the following examples we will assume X has n elements, from 1 to n, and our array of interest is M, which is a column array that can vary in size. Our result vector will be spp1, such that spp(k) is the number of ks in M. Although I write here about X, there is no explicit implementation of it in the code below, I just define n = 500 and X is implicitly 1:500.
The naive for loop
The most simple and straightforward way to cope this task is by a for loop that iterate over the elements in X and count the number of elements in M that equal to it:
function spp = loop(M,n)
spp = zeros(n,1);
for k = 1:size(spp,1);
spp(k) = sum(M==k);
end
end
This is off course not so smart, especially if only little group of elements from X is populating M, so we better look first for those that are already in M:
function spp = uloop(M,n)
u = unique(M); % finds which elements to count
spp = zeros(n,1);
for k = u(u>0).';
spp(k) = sum(M==k);
end
end
Usually, in MATLAB, it is advisable to take advantage of the built-in functions as much as possible, since most of the times they are much faster. I thought of 5 options to do so:
1. The function tabulate
The function tabulate returns a very convenient frequency table that at first sight seem to be the perfect solution for this task:
function tab = tabi(M)
tab = tabulate(M);
if tab(1)==0
tab(1,:) = [];
end
end
The only fix to be done is to remove the first row of the table if it counts the 0 element (it could be that there are no zeros in M).
2. The function histcounts
Another option that can be tweaked quite easily to our need it histcounts:
function spp = histci(M,n)
spp = histcounts(M,1:n+1);
end
here, in order to count all different elements between 1 to n separately, we define the edges to be 1:n+1, so every element in X has it's own bin. We could write also histcounts(M(M>0),'BinMethod','integers'), but I already tested it, and it takes more time (though it makes the function independent of n).
3. The function accumarray
The next option I'll bring here is the use of the function accumarray:
function spp = accumi(M)
spp = accumarray(M(M>0),1);
end
here we give the function M(M>0) as input, to skip the zeros, and use 1 as the vals input to count all unique elements.
4. The function bsxfun
We can even use binary operation #eq (i.e. ==) to look for all elements from each type:
function spp = bsxi(M,n)
spp = bsxfun(#eq,M,1:n);
spp = sum(spp,1);
end
if we keep the first input M and the second 1:n in different dimensions, so one is a column vector the other is a row vector, then the function compares each element in M with each element in 1:n, and create a length(M)-by-n logical matrix than we can sum to get the desired result.
5. The function ndgrid
Another option, similar to the bsxfun, is to explicitly create the two matrices of all possibilities using the ndgrid function:
function spp = gridi(M,n)
[Mx,nx] = ndgrid(M,1:n);
spp = sum(Mx==nx);
end
then we compare them and sum over columns, to get the final result.
Benchmarking
I have done a little test to find the fastest method from all mentioned above, I defined n = 500 for all trails. For some (especially the naive for) there is a great impact of n on the time of execution, but this is not the issue here since we want to test it for a given n.
Here are the results:
We can notice several things:
Interestingly, there is a shift in the fastest method. For arrays smaller than 214 accumarray is the fastest. For arrays larger than 214 histcounts is the fastest.
As expected the naive for loops, in both versions are the slowest, but for arrays smaller than 28 the "unique & for" option is slower. ndgrid become the slowest in arrays bigger than 211, probably because of the need to store very large matrices in memory.
There is some irregularity in the way tabulate works on arrays in size smaller than 29. This result was consistent (with some variation in the pattern) in all the trials I conducted.
(the bsxfun and ndgrid curves are truncated because it makes my computer stuck in higher values, and the trend is quite clear already)
Also, notice that the y-axis is in log10, so a decrease in unit (like for arrays in size 219, between accumarray and histcounts) means a 10-times faster operation.
I'll be glad to hear in the comments for improvements to this test, and if you have another, conceptually different method, you are most welcome to suggest it as an answer.
The code
Here are all the functions wrapped in a timing function:
function out = timing_hist(N,n)
M = randi([0 n],N,1);
func_times = {'for','unique & for','tabulate','histcounts','accumarray','bsxfun','ndgrid';
timeit(#() loop(M,n)),...
timeit(#() uloop(M,n)),...
timeit(#() tabi(M)),...
timeit(#() histci(M,n)),...
timeit(#() accumi(M)),...
timeit(#() bsxi(M,n)),...
timeit(#() gridi(M,n))};
out = cell2mat(func_times(2,:));
end
function spp = loop(M,n)
spp = zeros(n,1);
for k = 1:size(spp,1);
spp(k) = sum(M==k);
end
end
function spp = uloop(M,n)
u = unique(M);
spp = zeros(n,1);
for k = u(u>0).';
spp(k) = sum(M==k);
end
end
function tab = tabi(M)
tab = tabulate(M);
if tab(1)==0
tab(1,:) = [];
end
end
function spp = histci(M,n)
spp = histcounts(M,1:n+1);
end
function spp = accumi(M)
spp = accumarray(M(M>0),1);
end
function spp = bsxi(M,n)
spp = bsxfun(#eq,M,1:n);
spp = sum(spp,1);
end
function spp = gridi(M,n)
[Mx,nx] = ndgrid(M,1:n);
spp = sum(Mx==nx);
end
And here is the script to run this code and produce the graph:
N = 25; % it is not recommended to run this with N>19 for the `bsxfun` and `ndgrid` functions.
func_times = zeros(N,5);
for n = 1:N
func_times(n,:) = timing_hist(2^n,500);
end
% plotting:
hold on
mark = 'xo*^dsp';
for k = 1:size(func_times,2)
plot(1:size(func_times,1),log10(func_times(:,k).*1000),['-' mark(k)],...
'MarkerEdgeColor','k','LineWidth',1.5);
end
hold off
xlabel('Log_2(Array size)','FontSize',16)
ylabel('Log_{10}(Execution time) (ms)','FontSize',16)
legend({'for','unique & for','tabulate','histcounts','accumarray','bsxfun','ndgrid'},...
'Location','NorthWest','FontSize',14)
grid on
1 The reason for this weird name comes from my field, Ecology. My models are a cellular-automata, that typically simulate individual organisms in a virtual space (the M above). The individuals are of different species (hence spp) and all together form what is called "ecological community". The "state" of the community is given by the number of individuals from each species, which is the spp vector in this answer. In this models, we first define a species pool (X above) for the individuals to be drawn from, and the community state take into account all species in the species pool, not only those present in M
We know that that the input vector always contains integers, so why not use this to "squeeze" a bit more performance out of the algorithm?
I've been experimenting with some optimizations of the the two best binning methods suggested by the OP, and this is what I came up with:
The number of unique values (X in the question, or n in the example) should be explicitly converted to an (unsigned) integer type.
It's faster to compute an extra bin and then discard it, than to "only process" valid values (see the accumi_new function below).
This function takes about 30sec to run on my machine. I'm using MATLAB R2016a.
function q38941694
datestr(now)
N = 25;
func_times = zeros(N,4);
for n = 1:N
func_times(n,:) = timing_hist(2^n,500);
end
% Plotting:
figure('Position',[572 362 758 608]);
hP = plot(1:n,log10(func_times.*1000),'-o','MarkerEdgeColor','k','LineWidth',2);
xlabel('Log_2(Array size)'); ylabel('Log_{10}(Execution time) (ms)')
legend({'histcounts (double)','histcounts (uint)','accumarray (old)',...
'accumarray (new)'},'FontSize',12,'Location','NorthWest')
grid on; grid minor;
set(hP([2,4]),'Marker','s'); set(gca,'Fontsize',16);
datestr(now)
end
function out = timing_hist(N,n)
% Convert n into an appropriate integer class:
if n < intmax('uint8')
classname = 'uint8';
n = uint8(n);
elseif n < intmax('uint16')
classname = 'uint16';
n = uint16(n);
elseif n < intmax('uint32')
classname = 'uint32';
n = uint32(n);
else % n < intmax('uint64')
classname = 'uint64';
n = uint64(n);
end
% Generate an input:
M = randi([0 n],N,1,classname);
% Time different options:
warning off 'MATLAB:timeit:HighOverhead'
func_times = {'histcounts (double)','histcounts (uint)','accumarray (old)',...
'accumarray (new)';
timeit(#() histci(double(M),double(n))),...
timeit(#() histci(M,n)),...
timeit(#() accumi(M)),...
timeit(#() accumi_new(M))
};
out = cell2mat(func_times(2,:));
end
function spp = histci(M,n)
spp = histcounts(M,1:n+1);
end
function spp = accumi(M)
spp = accumarray(M(M>0),1);
end
function spp = accumi_new(M)
spp = accumarray(M+1,1);
spp = spp(2:end);
end

Algorithm for a diversified sort

I'm looking for a way to implement a diversified sort. Each cell contains a weight value along with an enum type. I would like to sort it in a way that it will make the weight value dynamic according to the types of elements that were already chosen, giving priority to those 'less chosen' so far. I would like to control the diversity factor, so that when setting it with a high value, it'll produce a fully diverse results array, and when giving a low value it will provide an almost 'regular' sorted array.
This doesn't sound like a very specific use case, so if there are any references to known algorithms, that will also be great.
Update:
According to Ophir suggestion, this might be a basic wrapper:
// these will be the three arrays, one per type
$contentTypeA, $contentTypeB, $contentTypeC;
// sort each by value
sort($contentTypeA);
sort($contentTypeB);
sort($contentTypeC);
// while i didn't get the amount I want or there aren't any more options to chose from
while ($amountChosen < 100 && (count($contentTypeA) + count($contentTypeB) + count($contentTypeC) > 0)) {
$diversifiedContent[] = selectBest($bestA, $bestB, $bestC, &$contentTypeA, &$contentTypeB, &$contentTypeC);
$amountChosen++;
}
$diversifiedContent = array_slice($diversifiedContent, 0, 520);
return $diversifiedContent;
}
function selectBest($bestA, $bestB, $bestC, &$contentTypeA, &$contentTypeB, &$contentTypeC) {
static $typeSelected;
$diversifyFactor = 0.5;
if (?) {
$typeSelected['A']++;
array_shift($contentTypeA);
return $bestA;
}
else if (?) {
$typeSelected['B']++;
array_shift($contentTypeB);
return $bestA;
}
else if (?) {
$typeSelected['C']++;
array_shift($contentTypeC);
return $bestA;
}
}
Your definition is very general terms, not in mathematical terms, so I doubt if you can find a close solution that matches exactly what you want.
I can suggest this simple approach:
Sort each type separately. Then merge the lists by iteratively taking the maximum value in the list of highest priority, where priority is the product of the value and a "starvation" factor for that type. The starvation factor will be a combination of how many steps ignored that type, and the diversity factor. The exact shape of this function depends on your application.
Heres an idea:
class item(object):
def __init__(self, enum_type, weight):
self.enum_type = enum_type
self.weight = weight
self.dyn_weight = weight
def __repr__(self):
return unicode((self.enum_type, self.weight, self.dyn_weight))
def sort_diverse(lst, factor):
# first sort
by_type = sorted(lst, key=lambda obj: (obj.enum_type, obj.weight))
cnt = 1
for i in xrange(1, len(lst)):
current = by_type[i]
previous = by_type[i-1]
if current.enum_type == previous.enum_type:
current.dyn_weight += factor * cnt
cnt += 1
else:
cnt = 1
return sorted(by_type, key=lambda obj: (obj.dyn_weight, obj.enum_type))
Try this example:
lst = [item('a', 0) for x in xrange(10)] + [item('b', 1) for x in xrange(10)] + [item('c', 2) for x in xrange(10)]
print sort_diverse(lst, 0) # regular sort
print sort_diverse(lst, 1) # partially diversified
print sort_diverse(lst, 100) # completely diversified
Depending on your needs, you might want to use a more sophisticated weight update function.
This algorithm is basically O(nlogn) time complexity and O(n) space complexity as it requires two sorts and two copies of the list.

Resources