Whats the 'Ruby way' to identify duplicate values in a hash? - arrays

I'm on Ruby 2.7.x on macOS Catalina.
I have up to 1 million key value tuples.
The keys are strings and are guaranteed unique.
The values are strings and may contain duplicates (or triplicates or more).
Given the uniqueness of the keys, it seems that a hash is a natural data structure for them.
So if I start with original_hash, containing all the key value tuples,
I'd like to end up with uniques_hash, containing all and only unique key value tuples,
and duplicates_hash, containing all the keys with duplicated values.
I am more interested in optimising for clarity and Ruby idiom than memory efficiency or speed - I don't expect to be running this code frequently, and I have plenty of RAM.
If I convert to two arrays, I can find the uniques in the values array - but how do I guarantee re-pairing with the correct key? And is that the right way to approach this problem?
Many thanks for any assistance!

Suppose
original_hash = {:a=>1, :b=>2, :c=>2, :d=>2, :e=>3, :f=>4, :g=>4}
If you were only interested in returning a hash uniques_hash that contained unique values, you could write the following.
uniques_hash = original_hash.invert.invert
#=> {:a=>1, :d=>2, :e=>3, :g=>4}
the intermediate step being
original_hash.invert
#=> {1=>:a, 2=>:d, 3=>:e, 4=>:g}
See Hash#invert. Note that uniques_hash, as defined, is not itself unique. It could be any of the following.
{:a=>1, :b=>2, :e=>3, :f=>4}
{:a=>1, :b=>2, :e=>3, :g=>4}
{:a=>1, :c=>2, :e=>3, :f=>4}
{:a=>1, :c=>2, :e=>3, :g=>4}
{:a=>1, :d=>2, :e=>3, :f=>4}
{:a=>1, :d=>2, :e=>3, :g=>4}
Another way of doing this is to use Enumerable#uniq and Array#to_h.
unique_hash = original_hash.uniq(&:last).to_h
#=> {:a=>1, :b=>2, :e=>3, :f=>4}
the intermediate calculation being
original_hash.uniq(&:last)
#=> [[:a, 1], [:b, 2], [:e, 3], [:f, 4]]
which is shorthand for
original_hash.uniq { |_k,v| v }
Presumably, each key of duplicates_hash is a value in original_hash and the value of that key is an array of those keys k in original_hash for which original_hash[k] == v.
One way to compute duplicates_hash is as follows.
duplicates_hash = original_hash.each_with_object({}) do |(k,v),h|
h[v] = (h[v] || []) << k
end
#=> {1=>[:a], 2=>[:b, :c, :d], 3=>[:e], 4=>[:f, :g]}
This can also be written
duplicates_hash = original_hash.
each_with_object(Hash.new { |h,k| h[k] = [] }) { |(k,v),h| h[v] << k }
#=> {1=>[:a], 2=>[:b, :c, :d], 3=>[:e], 4=>[:f, :g]}
See Hash::new. Both forms are equivalent to
duplicates_hash = original_hash.each_with_object({}) do |(k,v),h|
h[v] = [] unless h.key?(v)
h[v] << k
end
Writing the block variables as |(k,v),h| makes use of array decomposition.
We have an enumerator that will generate values and pass them to its block.
enum = original_hash.each_with_object({})
#=> #<Enumerator: {:a=>1, :b=>2, :c=>2, :d=>2, :e=>3, :f=>4, :g=>4}:
# each_with_object({})>
Enumerators are instances of the class Enumerator.
The first value of the enumerator is generated and the block variables are assigned values like so:
(k,v),h = enum.next
#=> [[:a, 1], {}]
Array decomposition is seen to split this array of two elements as follows:
k #=> :a
v #=> 1
h #=> {}
Notice how the parentheses on the left correspond to the inner brackets on the right. The block calculation is then performed using these variables.
h[v] = (h[v] || []) << k
#=> [:a]
Now,
h #=> {1=>[:a]}
The next value is then generated by the enumerator and the block calculation is performed.
(k,v),h = enum.next
#=> [[:b, 2], {1=>[:a]}]
k #=> :b
v #=> 2
h #=> {1=>[:a]}
h[v] = (h[v] || []) << k
so now
h #=> {1=>[:a], 2=>[:b]}
This continues until
enum.next
#=> Stop Interation (exception)
causing Ruby to return the value of h.
Note that by computing duplicates_hash first we could compute uniques_hash as follows.
keeper_keys = duplicates_hash.values.map(&:first)
#=> [:a, :b, :e, :f]
unique_keys = original_hash.slice(*keeper_keys)
#=> {:a=>1, :b=>2, :e=>3, :f=>4}
or
unique_keys = original_hash.slice(*duplicates_hash.values.map(&:first))
#=> {:a=>1, :b=>2, :e=>3, :f=>4}
See Hash#slice. If one feels guilty by favouring certain keys one could instead write
unique_keys = original_hash.slice(*duplicates_hash.values.map(&:sample))
#=> {:a=>1, :b=>2, :e=>3, :g=>4}
See Array#sample.

it might or might not be the best way to go about this, but I've used the "group_by" and "select" functions to get me a new hash that finds duplicates:
hash.group_by{|k,v| v}.select{|k,v| v.count > 1}
in this case, the returned hash will look a bit like:
{value: [{key: value}, {key: value}]}

Count the values using group_by and store the results in a hash. Use that hash to partition the original hash like so:
h = {a: 1, b: 2, c: 2, d: 2, e: 3, f: 4, g: 4}
cnt = Hash[h.values.group_by{ |i| i }.map { |k, v| [k, v.count] }]
h_uniq, h_dups = h.partition{ |k, v| cnt[v] == 1 }.map(&:to_h)
puts cnt
# {1=>1, 2=>3, 3=>1, 4=>2}
puts h_uniq.inspect
# {:a=>1, :e=>3}
puts h_dups.inspect
# {:b=>2, :c=>2, :d=>2, :f=>4, :g=>4}

Thanks to all for the help on this! I thought it might help someone if I posted my draft code, and any comments /improvements are very welcome! (I'm in the process of refactoring Listing..)
N
#!/usr/bin/env ruby
# shebang to run the script from Terminal
# include shasum
require 'digest'
class Listing
# class of arrays of file listings
attr_reader :path
def initialize(path)
#path = path
Dir.chdir(#path)
#list_of_items = Dir['**/*']
#list_of_folders = []
#list_of_files = []
#list_of_extensions = []
#list_of_uniques = []
#list_of_duplicates = []
end
def to_s
"#{#path}"
end
def analyse_items
#list_of_items.each do |f|
if File.directory?(f)
#list_of_folders << f
else
#list_of_files << f
end
end
#list_of_folders.sort!
#list_of_files.sort!
#folder_count = #list_of_folders.count
#files_count = #list_of_files.count
#items_count = #list_of_items.count
#count_check = (#items_count -(#folder_count + #files_count))
# count_check should be zero
end
def identify_duplicates
# Given an array of filepaths, this method divides it into an array of unique files and an array of duplicated files.
source = {}
uniques = {}
#list_of_files.each do |f|
digest = Digest::SHA512.hexdigest File.read f
source.store(f, digest)
end
uniques = source.invert.invert
#list_of_uniques = uniques.keys
#list_of_duplicates = source.keys - uniques.keys
end
def tell_duplicates
puts "dupes = #{#list_of_duplicates}"
end
end
l = Listing.new("/Volumes/Things/Photos/")
l.analyse_items
l.identify_duplicates
l.tell_duplicates

Related

I need some clarification on how this 2D array can be created in Ruby?

I'm currently learning Ruby through App Academy Open, and came across a problem that I solved differently than the course solution. I could use some clarification on how the course solution works.
We have to define a function "zip" that takes any number of arrays as arguments (but all arrays the same length). The function should return a 2D array where each subarray contains the elements at the same index from each argument.
zip(['a','b','c'],[1,2,3])
should return:
[['a',1],['b',2],['c',3]]
Here is my solution:
def zip(*arrs)
main_arr = Array.new(arrs[0].length) {Array.new}
arrs.each do |array|
array.each_with_index do |ele, ele_idx|
main_arr[ele_idx] << ele
end
end
main_arr
end
And here is the course solution:
def zip(*arrays)
length = arrays.first.length
(0...length).map do |i|
arrays.map { |array| array[i] }
end
end
Can someone explain how the 2D array is being built within the mapped range above? I'm a bit confused as a beginner and could use some clarification.
EDIT:
Thank you very much iGian. Explanation really helped.
Running the code below should be self explanatory, see the comments:
def zip_steps(*arrays)
# get the size of the array
length = arrays.first.length
# mapping the range up to use as indexing
p (0...length).map { |i| i } #=> [0, 1, 2]
# map the arrays
p arrays.map { |array| array } #=> [["a", "b", "c"], [1, 2, 3]]
# map the arrays returning a specific element at index 1, for example
p arrays.map { |array| array[1] } #=> ["b", 2]
# put arrays mapping inside the range mapping
# where instead of returning the element 1
# it returns the element i
(0...length).map do |i|
arrays.map { |array| array[i] }
end
end
ary1 = ['a','b','c']
ary2 = [1,2,3]
p zip_steps(ary1, ary2) #=> [["a", 1], ["b", 2], ["c", 3]]

Can't get updated values in array after using .map method

I need to implement a method, which works that way:
# do_magic("abcd") # "Aaaa-Bbb-Cc-D"
# do_magic("a") # "A"
# do_magic("ab") # "Aa-B"
# do_magic("teSt") # "Tttt-Eee-Ss-T"
My decision was to convert a string into an array, iterate through this array and save the result. The code works properly inside the block, but I'm unable to get the array with updated values with this solution, it returns the same string divided by a dash (for example "t-e-S-t" when ".map" used or "3-2-1-0" when ".map!" used):
def do_magic(str)
letters = str.split ''
counter = letters.length
while counter > 0
letters.map! do |letter|
(letter * counter).capitalize
counter -= 1
end
end
puts letters.join('-')
end
Where is the mistake?
You're so close. When you have a block (letters.map!), the return of that block is the last evaluated statement. In this case, counter -= 1 is being mapped into letters.
Try
l = (letter * counter).capitalize
counter -= 1
l
You can try something like this using each_with_index
def do_magic(str)
letters = str.split("")
length = letters.length
new_letters = []
letters.each_with_index do |letter, i|
new_letters << (letter * (length - i)).capitalize
end
new_letters.join("-")
end
OR
using map_with_index equivalent each_with_index.map
def do_magic(str)
letters = str.split("")
length = letters.length
letters.each_with_index.map { |letter, i|
(letter * (length - i)).capitalize
}.join("-")
end
I suggest the following.
def do_magic(letters)
length = letters.size
letters.downcase.each_char.with_index.with_object([]) { |(letter, i), new_letters|
new_letters << (letter * (length - i)).capitalize }.join
end
do_magic 'teSt'
# => "TtttEeeSsT"
Let's go through the steps.
letters = 'teSt'
length = letters.size
#=> 4
str = letters.downcase
#=> "test"
enum0 = str.each_char
#=> #<Enumerator: "test":each_char>
enum1 = enum0.with_index
#=> #<Enumerator: #<Enumerator: "test":each_char>:with_index>
enum2 = enum1.with_object([])
#=> #<Enumerator: #<Enumerator: #<Enumerator: "test":each_char>:
# with_index>:with_object([])>
Carefully examine the return values from the creation of the enumerators enum0, enum1 and enum2. The latter two may be thought of as compound enumerators.
The first element is generated by enum2 (the value of enum2.next) and the block variables are assigned values using disambiguation (aka decomposition).
(letter, i), new_letters = enum2.next
#=> [["t", 0], []]
letter
#=> "t"
i #=> 0
new_letters
#=> []
The block calculation is then performed.
m = letter * (length - i)
#=> "tttt"
n = m.capitalize
#=> "Tttt"
new_letters << n
#=> ["Tttt"]
The next element is generated by enum2, passed to the block and the block calculations are performed.
(letter, i), new_letters = enum2.next
#=> [["e", 1], ["Tttt"]]
letter
#=> "e"
i #=> 1
new_letters
#=> ["Tttt"]
Notice how new_letters has been updated. The block calculation is as follows.
m = letter * (length - i)
#=> "eee"
n = m.capitalize
#=> "Eee"
new_letters << n
#=> ["Tttt", "Eee"]
After the last two elements of enum2 are generated we have
new_letters
#=> ["Tttt", "Eee", "Se", "T"]
The last step is to combine the elements of new_letters to form a single string.
new_letters.join
#=> "TtttEeeSeT"

While converting Array of array in ruby to hash hash does not include all the keys but takes the last one

This is an array which i want to convert into a hash
a = [[1, 3], [3, 2], [1, 2]]
but the hash i am getting is
2.2.0 :004 > a.to_h
=> {1=>2, 3=>2}
why is it so?
Hashes have unique keys. Array#to_h is effectively doing the following:
h = {}.merge(1=>3).merge(3=>2).merge(1=>2)
#=> { 1=>3 }.merge(3=>2).merge(1=>2)
#=> { 1=>3, 3=>2 }.merge(1=>2)
#=> { 1=>2, 3=>2 }
In the last merge the value of the key 1 (3) is replaced with 2.
Note that
h.merge(k=>v)
is (permitted) shorthand for
h.merge({ k=>v })
The keys of a Hash are basically a Set, so no duplicate keys are allowed.
If two pairs are present in your Array with the same first element, only the last pair will be kept in the Hash.
If you want to keep the whole information, you could define arrays as values :
a = [[1, 3], [3, 2], [1, 2]]
hash = Hash.new{|h,k| h[k] = []}
p a.each_with_object(hash) { |(k, v), h| h[k] << v }
#=> {1=>[3, 2], 3=>[2]}
Here's a shorter but less common way to define it :
hash = a.each_with_object(Hash.new{[]}) { |(k, v), h| h[k] <<= v }
Calling hash[1] returns [3,2], which are all the second elements from the pairs of your array having 1 as first element.

Multiple return values with arrays and hashes

I believe arrays are mostly used for returning multiple values from methods:
def some_method
return [1, 2]
end
[a, b] = some_method # should yield a = 1 and b = 2
I presume this is a kind of syntactic sugar that Ruby provides. Can we get a similar result with hashes, for instance
def some_method
return { "a" => 1, "b" => 2 }
end
{"c", "d"} = some_method() # "c" => 1, "d" => 2
I'm looking for the result { "c" => 1, "d" => 2 }, which obviously does not happen. Is there any other way this can be done? I know that we can return a hash from the method and store it and use it like so
def some_method
return {"a" => 1, "b" => 2}
end
hash = some_method()
Just curious if there is another way similar to the one with arrays but using hashes....
I think a simpler way to put the question would be...
# If we have a hash
hash = {"a" => 1, "b" => 2}
# Is the following possible
hash = {2, 3} # directly assigning values to the hash.
OR
# another example
{"c", "d"} = {2, 3} # c and d would be treated as keys and {2, 3} as respective values.
First of all, you have a syntax error. Instead of this:
[a, b] = [1, 2]
you should use:
a, b = [1, 2]
And if you want to use similar syntax with hashes, you can do:
a, b = { "c" => 1, "d" => 2 }.values # a => 1, b => 2
This is actually the same thing as the array version, beacause Hash#values returns an array of the hash values in the order they were inserted to the hash (because ruby hashes have a nice feature of preserving their order)
What you are asking is syntactically not possible.
What you want to accomplish is possible, but you will have to code it.
One possible way to do that is shown below
hash = {"a" => 1, "b" => 2}
# Assign new keys
new_keys = ["c", "d"]
p [new_keys, hash.values].transpose.to_h
#=> {"c"=>1, "d"=>2}
# Assign new values
new_values = [3, 4]
p [hash.keys, new_values].transpose.to_h
#=> {"a"=>3, "b"=>4}
If you really want some more easier looking way of doing, you could monkey-patch Hash class and define new methods to manipulate the values of keys and values array. Please be cautioned that it may not be really worthwhile to mess with core classes. Anyways, a possible implementation is shown below. Use at your own RISK.
class Hash
def values= new_values
new_values.each_with_index do |v, i|
self[keys[i]] = v if i < self.size
end
end
def keys= new_keys
orig_keys = keys.dup
new_keys.each_with_index do |k, i|
if i < orig_keys.size
v = delete(orig_keys[i])
self[k] = v
rehash
end
end
end
end
hash = {"a" => 1, "b" => 2}
hash.values = [2,3]
p hash
#=> {"a"=>2, "b"=>3}
hash.keys = ["c", "d"]
p hash
#=> {"c"=>2, "d"=>3}
hash.keys, hash.values = ["x","y"], [9, 10]
p hash
#=> {"x"=>9, "y"=>10}
hash.keys, hash.values = ["x","y"], [9, 10]
p hash
#=> {"x"=>9, "y"=>10}
# *** results can be unpredictable at times ***
hash.keys, hash.values = ["a"], [20, 10]
p hash
#=> {"y"=>20, "a"=>10}

Compare two array of hashes with same keys

I have 2 arrays of hashes with same keys but different values.
A = [{:a=>1, :b=>4, :c=>2},{:a=>2, :b=>1, :c=>3}]
B = [{:a=>1, :b=>1, :c=>2},{:a=>1, :b=>3, :c=>3}]
I'm trying to compare 1st hash in A with 1st hash in B and so on using their keys and identify which key and which value is not matching if they do not match. please help.
A.each_key do |key|
if A[key] == B[key]
puts "#{key} match"
else
puts "#{key} dont match"
I am not certain which comparisons you want to make, so I will show ways of answering different questions. You want to make pairwise comparisons of two arrays of hashes, but that's really no more difficult than just comparing two hashes, as I will show later. For now, suppose you merely want to compare two hashes:
h1 = {:a=>1, :b=>4, :c=>2, :d=>3 }
h2 = {:a=>1, :b=>1, :c=>2, :e=>5 }
What keys are in h1 or h2 (or both)?
h1.keys | h2.keys
#=> [:a, :b, :c, :d, :e]
See Array#|.
What keys are in both hashes?
h1.keys & h2.keys
#=> [:a, :b, :c]
See Array#&.
What keys are in h1 but not h2?
h1.keys - h2.keys
#=> [:d]
See Array#-.
What keys are in h2 but not h1?
h2.keys - h1.keys #=> [:e]
What keys are in one hash only?
(h1.keys - h2.keys) | (h2.keys - h1.keys)
#=> [:d, :e]
or
(h1.keys | h2.keys) - (h1.keys & h2.keys)
What keys are in both hashes and have the same values in both hashes?
(h1.keys & h2.keys).select { |k| h1[k] == h2[k] }
#=> [:a, :c]
See Array#select.
What keys are in both hashes and have different values in the two hashes?
(h1.keys & h2.keys).reject { |k| h1[k] == h2[k] }
#=> [:b]
Suppose now we had two arrays of hashes:
a1 = [{:a=>1, :b=>4, :c=>2, :d=>3 }, {:a=>2, :b=>1, :c=>3, :d=>4}]
a2 = [{:a=>1, :b=>1, :c=>2, :e=>5 }, {:a=>1, :b=>3, :c=>3, :e=> 6}]
and wished to compare the hashes pairwise. To do that first take the computation of interest above and wrap it in a method. For example:
def keys_in_both_with_different_values(h1, h2)
(h1.keys & h2.keys).reject { |k| h1[k] == h2[k] }
end
Then write:
a1.zip(a2).map { |h1,h2| keys_in_both_with_different_values(h1, h2) }
#=> [[:b], [:a, :b]]
See Enumerable#zip.
Since you're comparing elements of arrays...
A.each_with_index do |hasha, index|
hashb = B[index]
hasha.each_key do |key|
if hasha[key] == hashb[key]
puts "in array #{index} the key #{key} matches"
else
puts "in array #{index} the key #{key} doesn't match"
end
end
end
edit - added a missing end!
When you are dealing with an array, you should reference an element with open-close bracket '[]' as in
A[index at which lies the element you are looking for]
If you want to access an element in a hash, you want to use open-close bracket with the corresponding key in it, as in
A[:a]
(referencing the value that corresponds to the key ':a', which is of a type symbol.)
In this case, the arrays in question are such that hashes are nested within an array. So for example, the expression B[0][:c] will give 2.
To compare the 1st hash in A with the 1st hash in B, the 2nd hash in A with the second hash in B and so forth, you can use each_with_index method on an Array object ,like so;
A = [{:a=>1, :b=>4, :c=>2},{:a=>2, :b=>1, :c=>3}]
B = [{:a=>1, :b=>1, :c=>2},{:a=>1, :b=>3, :c=>3}]
sym = [:a, :b, :c]
A.each_with_index do |hash_a, idx_a|
sym.each do |sym|
if A[idx_a][sym] == B[idx_a][sym]
puts "Match found! (key -- :#{sym}, value -- #{A[idx_a][sym]})"
else
puts "No match here."
end
end
end
which is checking the values based on the keys, which are symbols, in the following order; :a -> :b -> :c -> :a -> :b -> :c
This will print out;
Match found! (key -- :a, value -- 1)
No match here.
Match found! (key -- :c, value -- 2)
No match here.
No match here.
Match found! (key -- :c, value -- 3)
The method each_with_index may look a little bit cryptic if you are not familiar with it.
If you are uncomfortable with it you might want to check;
http://apidock.com/ruby/Enumerable/each_with_index
Last but not least, don't forget to add 'end'(s) at the end of a block (i.e. the code between do/end) and if statement in your code.
I hope it helps.

Resources