What is the fastest way to make a uniq array? - arrays

I've got the following situation. I have got a big array of random strings. This array should be made unique as fast as possible.
Now through some Benchmarking I found out that ruby's uniq is quite slow:
require 'digest'
require 'benchmark'
#make a nice random array of strings
list = (1..100000).to_a.map(&:to_s).map {|e| Digest::SHA256.hexdigest(e)}
list += list
list.shuffle
def hash_uniq(a)
a_hash = {}
a.each do |v|
a_hash[v] = nil
end
a_hash.keys
end
Benchmark.bm do |x|
x.report(:uniq) { 100.times { list.uniq} }
x.report(:hash_uniq) { 100.times { hash_uniq(list) } }
end
Gist -> https://gist.github.com/stillhart/20aa9a1b2eeb0cff4cf5
The results are quite interesting. Could it be that ruby's uniq is quite slow?
user system total real
uniq 23.750000 0.040000 23.790000 ( 23.823770)
hash_uniq 18.560000 0.020000 18.580000 ( 18.591803)
Now my questions:
Are there any faster ways to make an array unique?
Am I doing something wrong?
Is there something wrong in the Array.uniq method?
I am using ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]

String parsing operations on large data sets is certainly not where Ruby shines. If this is business critical, you might want to write an extension in something like C or Go, or let another application handle this before passing it to your Ruby application.
That said. There seems to be something strange with your benchmark. Running the same on my MacBook Pro using Ruby 2.2.3 renders the following result:
user system total real
uniq 10.300000 0.110000 10.410000 ( 10.412513)
hash_uniq 11.660000 0.210000 11.870000 ( 11.901917)
Suggesting that uniq is slightly faster.
If possible, you should always try to work with the right collection types. If your collection is truly unique, then use a Set. They feature better memory profile, and the faster lookup speeds of Hash, while retaining some of the Array intuition.
If your data is already in an Array, however, this might not be a good tradeoff, as insertion into Set is rather slow as well, as you can see here:
user system total real
uniq 11.040000 0.060000 11.100000 ( 11.102644)
hash_uniq 12.070000 0.230000 12.300000 ( 12.319356)
set_insertion 12.090000 0.200000 12.290000 ( 12.294562)
Where I added the following benchmark:
x.report(:set_insertion) { 100.times { Set.new(list) } }

Related

Fastest way to check duplicate columns in a line in perl

I have a file with 1 million lines like this
aaa,111
bbb,222
...
...
a3z,222 (line# 500,000)
...
...
bz1,444 (last line# 1 million)
What I need to check is whether second value after comma is unique or not. If not then print out the line number. In example above it should print out
Duplicate: line: 500000 value: a3z,222
For this I am using perl and storing value of second column in an array. If I don't find a value in the array I add it to it. If the value already exists then I print it out as a duplicate.
The problem is the logic I am using is super slow. It takes anywhere from 2-3 hours to complete. Is there a way I can speed this up? I don't want to create an array if I don't have to. I just want to check duplicate values in column 2 of the file.
If there is a faster way to do it in a batch-file I am open to it.
Here's my working code.
# header
use warnings;
use DateTime;
use strict;
use POSIX qw(strftime);
use File::Find;
use File::Slurp;
use File::Spec;
use List::MoreUtils qw(uniq);
print "Perl Starting ... \n\n";
# Open the file for read access:
open my $filehandle, '<', 'my_huge_input_file.txt';
my $counter = 0;
my #uniqueArray;
# Loop through each line:
while (defined(my $recordLine = <$filehandle>))
{
# Keep track of line numbers
$counter++;
# Strip the linebreak character at the end.
chomp $recordLine;
my #fields = split(/,/, $recordLine);
my $val1=$fields[0];
my $val2=$fields[1];
if ( !($val2 ~~ #uniqueArray) && ($val2 ne "") )
{
push(#uniqueArray, $val2);
}
else
{
print ("DUP line: $counter - val1: $val1 - val2: $val2 \n");
}
}
print "\nPerl End ... \n\n";
That's one of the things a hash is for
use feature qw(say);
...
my %second_field_value;
while (defined(my $recordLine = <$filehandle>))
{
chomp $recordLine;
my #fields = split /,/, $recordLine;
if (exists $second_field_value{$fields[1]}) {
say "DUP line: $. -- #fields[0,1]";
}
++$second_field_value{$fields[1]};
}
This will accumulate all possible values for this field, as it must. We can also store suitable info about dupes as they are found, depending on what needs to be reported about them.
The line number (of the last read filehandle) is available in $. variable; no need for $counter.
Note that a check and a flag/counter setting can be done in one expression, for
if ($second_field_values{$fields[1]}++) { say ... } # already seen before
which is an idiom when checking for duplicates. Thanks to ikegami for bringing it up. This works by having the post-increment in the condition (so the check is done with the old value, and the count is up to date in the block).
I have to comment on the smart-match operator (~~) as well. It is widely understood that it has great problems in its current form and it is practically certain that it will suffer major changes, or worse. Thus, simply put, I'd say: don't use it. The code with it has every chance of breaking at some unspecified point in the future, possibly without a warning, and perhaps quietly.
Note on performance and "computational complexity," raised in comments.
Searching through an array on every line has O(n m) complexity (n lines, m values), what is really O(n2) here since the array gets a new value on each line (so m = n-1); further, the whole array gets searched for (practically) every line as there normally aren't dupes. With the hash the complexity is O(n), as we have a constant-time lookup on each line.
The crucial thing is that all that is about the size of input. For a file of a few hundred lines we can't tell a difference. With a million lines, the reported run times are "anywhere from 2-3 hours" with array and "under 5 seconds" with hash.
The fact that "complexity" assessment deals with input size has practical implications.
For one, code with carelessly built algorithms which "runs great" may break miserably for unexpectedly large inputs -- or, rather, for realistic data once it comes to production runs.
On the other hand, it is often quite satisfactory to run with code that is cleaner and simpler even as it has worse "complexity" -- when we understand its use cases.
Generally, the complexity tells us how the runtime depends on size, not what exactly it is. So an O(n2) algorithm may well run faster than an O(n log n) one for small enough input. This has great practical importance and is used widely in choosing algorithms.
Use a hash. Arrays are good for storing sequential data, and hashes are good for storing random-access data. Your search of #uniqueArray is O(n) on each search, which is done once per line, making your algorithm O(n^2). A hash solution would be O(1) (more or less) on each search, which is done once per line, making it O(n) overall.
Also, use $. for line numbers - perl tracks it for you.
my %seen;
while(<$filehandle>)
{
chomp;
my ($val1, $val2) = split /,/;
# track all values and their line numbers.
push #{$seen{$val2}}, [$., $val1];
}
# now go through the full list, looking for anything that was seen
# more than once.
for my $val2 (grep { #{$seen{$_}} > 1 } keys %seen)
{
print "DUP line: $val2 was seen on lines ", join ", ", map { "$_[0] ($_[1]) " } #{$seen{$val2}};
print "\n";
}
This is all O(n). Much faster.
The hash answer you've accepted would be the standard approach here. But I wonder if using an array would be a little faster. (I've also switched to using $_ as I think it makes the code cleaner.)
use feature qw(say);
...
my #second_field_value;
while (<$filehandle>))
{
chomp;
my #fields = split /,/;
if ($second_field_value[$fields[1]]) {
say "DIP line: $. -- #fields";
}
++$second_field_value[$fields[1]];
}
It would be a pretty sparse array, but it might still be faster than the hash version. (I'm afraid I don't have the time to benchmark it.)
Update: I ran some basic tests. The array version is faster. But not by enough that it's worth worrying about.

perl - searching a large /sorted/ array for index of a string

I have a large array of approx 100,000 items, and a small array of approx 1000 items. I need to search the large array for each of the strings in the small array, and I need the index of the string returned. (So I need to search the 100k array 1000 times)
The large array has been sorted so I guess some kind of binary chop type search would be a lot more efficient than using a foreach loop (using 'last' to break the loop when found) which is what I started with. (this first attempt results in some 30m comparisons!)
Is there a built in search method that would produce a more efficient result, or am I going to have to manually code a binary search? I also want to avoid using external modules.
For the purposes of the question, just assume that I need to find the index of a single string in the large sorted array. (I only mention the 1000 items to give an idea of the scale)
This sounds like classic hash use case scenario,
my %index_for = map { $large_array[$_] => $_ } 0 .. $#large_array;
print "index in large array:", $index_for{ $small_array[1000] };
Using a binary search is probably optimal here. Binary search only needs O(log n) comparisions (here ~ 17 comparisons per lookup).
Alternatively, you can create a hash table that maps items to their indices:
my %positions;
$positions{ $large_array[$_] } = $_ for 0 .. $#large_array;
for my $item (#small_array) {
say "$item has position $positions{$item}";
}
While now each lookup is possible in O(1) without any comparisons, you do have to create the hash table first. This may or may not be faster. Note that hashes can only use strings for keys. If your items are complex objects with their own concept of equality, you will have to derive a suitable key first.

Why tuples are not enumerable in Elixir?

I need an efficient structure for array of thousands of elements of the same type with ability to do random access.
While list is most efficient on iteration and prepending, it is too slow on random access, so it does not fit my needs.
Map works better. Howerver it causes some overheads because it is intended for key-value pairs where key may be anything, while I need an array with indexes from 0 to N. As a result my app worked too slow with maps. I think this is not acceptable overhead for such a simple task like handling ordered lists with random access.
I've found that tuple is most efficient structure in Elixir for my task. When comparing to map on my machine it is faster
on iteration - 1.02x for 1_000, 1.13x for 1_000_000 elements
on random access - 1.68x for 1_000, 2.48x for 1_000_000
and on copying - 2.82x for 1_000, 6.37x for 1_000_000.
As a result, my code on tuples is 5x faster than the same code on maps. It probably does not need explanation why tuple is more efficient than map. The goal is achieved, but everybody tells "don't use tuples for a list of similar elements", and nobody can explain this rule (example of such cases https://stackoverflow.com/a/31193180/5796559).
Btw, there are tuples in Python. They are also immutable, but still iterable.
So,
1. Why tuples are not enumerable in Elixir? Is there any technical or logical limitation?
2. And why should not I use them as lists of similar elements? Is there any downsides?
Please note: the questions is "why", not "how". The explanation above is just an example where tuples works better than lists and maps.
1. The reason not to implement Enumerable for Tuple
From the retired Elixir talk mailing list:
If there is a
protocol implementation for tuple it would conflict with all records.
Given that custom instances for a protocol virtually always are
defined for records adding a tuple would make the whole Enumerable
protocol rather useless.
-- Peter Minten
I wanted tuples to be enumerable at first, and even
eventually implemented Enumerable on them, which did not work out.
-- Chris Keele
How does this break the protocol? I'll try to put things together and explain the problem from the technical point of view.
Tuples. What's interesting about tuples is that they are mostly used for a kind of duck typing using pattern matching. You are not required to create new module for new struct every time you want some new simple type. Instead of this you create a tuple - a kind of object of virtual type. Atoms are often used as first elements as type names, for example {:ok, result} and {:error, description}. This is how tuples are used almost anywhere in Elixir, because this is their purpose by design. They are also used as a basis for "records" that comes from Erlang. Elixir has structs for this purpose, but it also provides module Record for compatibility with Erlang. So in most cases tuples represent single structures of heterogenous data which are not meant to be enumerated. Tuples should be considered like instances of various virtual types. There is even #type directive that allows to define custom types based on tuples. But remember they are virtual, and is_tuple/1 still returns true for all those tuples.
Protocols. On the other hand, protocols in Elixir is a kind of type classes which provide ad hoc polymorphism. For those who come from OOP this is something similar to superclasses and multiple inheritance. One important thing that protocol is doing for you is automatic type checking. When you pass some data to a protocol function, it checks that the data belongs to this class, i.e. that protocol is implemented for this data type. If not then you'll get error like this:
** (Protocol.UndefinedError) protocol Enumerable not implemented for {}
This way Elixir saves your code from stupid mistakes and complex errors unless you make wrong architectural decisions
Altogether. Now imagine we implement Enumerable for Tuple. What it does is making all tuples enumerable while 99.9% of tuples in Elixir are not intended to be so. All the checks are broken. The tragedy is the same as if all animals in the world begin quacking. If a tuple is passed to Enum or Stream module accidentally then you will not see useful error message. Instead of this your code will produce unexpected results, unpredictable behaviour and possibly data corruption.
2. The reason not to use tuples as collections
Good robust Elixir code should contain typespecs that help developers to understand the code, and give Dialyzer ability to check the code for you. Imagine you want a collection of similar elements. The typespec for lists and maps may look like this:
#type list_of_type :: [type]
#type map_of_type :: %{optional(key_type) => value_type}
But you can't write same typespec for tuple, because {type} means "a tuple of single element of type type". You can write typespec for a tuple of predefined length like {type, type, type} or for a tuple of any elements like tuple(), but there is no way to write a typespec for a tuple of similar elements just by design. So choosing tuples to store your collection of elemenets means you lose such a nice ability to make your code robust.
Conclusion
The rule not to use tuples as lists of similar elements is a rule of thumb that explains how to choose right type in Elixir in most cases. Violation of this rule may be considered as possible signal of bad design choice. When people say "tuples are not intended for collections by design" this means not just "you do something unusual", but "you can break the Elixir features by doing wrong design in your application".
If you really want to use tuple as a collection for some reason and you are sure you know what you do, then it is a good idea to wrap it into some struct. You can implement Enumerable protocol for your struct without risk to break all things around tuples. It worth to note that Erlang uses tuples as collections for internal representation of array, gb_trees, gb_sets, etc.
iex(1)> :array.from_list ['a', 'b', 'c']
{:array, 3, 10, :undefined,
{'a', 'b', 'c', :undefined, :undefined, :undefined, :undefined, :undefined,
:undefined, :undefined}}
Not sure if there is any other technical reason not to use tuples as collections. If somebody can provide another good explanation for the conflict between the Record and the Enumerable protocol, he is welcome to improve this answer.
As you are sure you need to use tuples there, you might achieve the requested functionality at a cost of compilation time. The solution below will be compiling for long (consider ≈100s for #max_items 1000.) Once compiled the execution time would gladden you. The same approach is used in Elixir core to build up-to-date UTF-8 string matchers.
defmodule Tuple.Enumerable do
defimpl Enumerable, for: Tuple do
#max_items 1000
def count(tuple), do: tuple_size(tuple)
def member?(_, _), do: false # for the sake of compiling time
def reduce(tuple, acc, fun), do: do_reduce(tuple, acc, fun)
defp do_reduce(_, {:halt, acc}, _fun), do: {:halted, acc}
defp do_reduce(tuple, {:suspend, acc}, fun) do
{:suspended, acc, &do_reduce(tuple, &1, fun)}
end
defp do_reduce({}, {:cont, acc}, _fun), do: {:done, acc}
defp do_reduce({value}, {:cont, acc}, fun) do
do_reduce({}, fun.(value, acc), fun)
end
Enum.each(1..#max_items-1, fn tot ->
tail = Enum.join(Enum.map(1..tot, & "e_★_#{&1}"), ",")
match = Enum.join(["value"] ++ [tail], ",")
Code.eval_string(
"defp do_reduce({#{match}}, {:cont, acc}, fun) do
do_reduce({#{tail}}, fun.(value, acc), fun)
end", [], __ENV__
)
end)
defp do_reduce(huge, {:cont, _}, _) do
raise Protocol.UndefinedError,
description: "too huge #{tuple_size(huge)} > #{#max_items}",
protocol: Enumerable,
value: Tuple
end
end
end
Enum.each({:a, :b, :c}, fn e -> IO.puts "Iterating: #{e}" end)
#⇒ Iterating: a
#  Iterating: b
#  Iterating: c
The code above explicitly avoids the implementation of member?, since it would take even more time to compile while you have requested the iteration only.

What is an efficient way to split an array into a training and testing set in Julia?

So I am running a machine learning algorithm in Julia with limited spare memory on my machine. Anyway, I have noticed a rather large bottleneck in the code I am using from the repository. It seems that splitting the array (randomly) takes even longer than reading the file from disk which seems to highlight code's inefficiencies. As I said before, any tricks to speed up this function would be greatly appreciated. The original function can be found here. Since it's a short function, I'll post it below as well.
# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
target_percentage=0.10)
seen_users = Set()
seen_items = Set()
training_set = (Rating)[]
test_set = (Rating)[]
shuffled = shuffle(ratings)
for rating in shuffled
if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
push!(test_set, rating)
else
push!(training_set, rating)
end
push!(seen_users, rating.user)
push!(seen_items, rating.item)
end
return training_set, test_set
end
As previously stated, anyway I can push the data would be greatly appreciated. I also will note that I do not really need to retain the ability to remove duplicates, but it would be a nice feature. Also if this is already implemented in a Julia library I would be grateful to know about it. Bonus points for any solutions that leverage that parallelism abilities of Julia!
This is the most efficient code I could come up with in terms of memory.
function splitratings(ratings::Array{Rating,1}, target_percentage=0.10)
N = length(ratings)
splitindex = round(Integer, target_percentage * N)
shuffle!(ratings) #This shuffles in place which avoids the allocation of another array!
return sub(ratings, splitindex+1:N), sub(ratings, 1:splitindex) #This makes subarrays instead of copying the original array!
end
However, Julia's incredibly slow file IO is now the bottleneck. This algorithm takes about 20 seconds to run on an array of 170 million elements so I say it''s rather performant.

Fast indexing of arrays

What is the most efficient way to access (and perhaps replace) an entry in a large multidimensional array? I am using something like this inside a loop:
tup = (16,45,6,40,3)
A[tup...] = 100
but I'm wondering if there is a more efficient way. In particular, is there a way I can avoid using ...?
There's not always a penalty involved with splatting, but determining where it is efficient isn't always obvious (or easy). Your trivial example is actually just as efficient as writing A[16,45,6,40,3] = 100. You can see this by comparing
function f(A)
tup = (16,45,6,40,3)
A[tup...] = 100
A
end
function g(A)
A[16,45,6,40,3] = 100
A
end
julia> code_llvm(f, Tuple{Array{Int, 5}})
# Lots of output (bounds checks).
julia> code_llvm(g, Tuple{Array{Int, 5}})
# Identical to above
If there was a splatting penalty, you'd see it in the form of allocations. You can test for this with the #allocated macro or by simply inspecting code_llvm for a reference to #jl_pgcstack — that's the garbage collector, which is required any time there's an allocation. Note that there is very likely other things in a more complicated function that will also cause allocations, so it's presence doesn't necessarily mean that there's a splatting pessimization. But if this is in a hot loop, you want to minimize all allocations, so it's a great target… even if your problem isn't due to splatting. You should also be using #code_warntype, as poorly typed code will definitely pessimize splats and many other operations. Here's what will happen if your tuple isn't well typed:
function h(A)
tup = ntuple(x->x+1, 5) # type inference doesn't know the type or size of this tuple
A[tup...] = 100
A
end
julia> code_warntype(h, Tuple{Array{Int,5}})
# Lots of red flags
So optimizing this splat will be highly dependent upon how you construct or obtain tup.
To iterate over multidimensional arrays, it is recommended to do for index in eachindex(A); see e.g.
https://groups.google.com/forum/#!msg/julia-users/CF_Iphgt2Wo/V-b31-6oxSkJ
If A is a standard array, then this corresponds to just indexing using a single integer, which is the fastest way to access your array (your original question):
A = rand(3, 3)
for i in eachindex(A)
println(i)
end
However, if A is a more complicated object, e.g. a subarray, then eachindex(A) will give you a different, efficient, access object:
julia> for i in eachindex(slice(A, 1:3, 2:3))
println(i)
end
gives
CartesianIndex{2}((1,1))
CartesianIndex{2}((2,1))
etc.
The fastest way to index a multidimensional array is to index it linearly.
Base.Base.linearindexing is a related function from Base module to find most efficiently accesses way to elements of an array.
julia> a=rand(1:10...);
julia> Base.Base.linearindexing(a)
Base.LinearFast()
one could use ii=sub2ind(size(A),tup...) syntax to convert a tuple of index to one linear index or for i in eachindex(A) to traverse it.

Resources