Group strings with similar pattern in Ruby - arrays

I have an array of filenames. A subset of these may have similar pattern like this (alphabet strings with a number at the end):
arr = %w[
WordWord1.html
WordWord3.html
WordWord10.html
WordWord11.html
AnotherWord1.html
AnotherWord2.html
FileFile.html
]
How to identify the similar ones (they have identical substring, just their numbers differ) and move them to an array ?
['WordWord1.html', 'WordWord3.html', 'WordWord10.html', 'WordWord11.html']
['AnotherWord1.html', 'AnotherWord2.html']
['FileFile.html']

arr.group_by { |x| x[/[a-zA-Z]+/] }.values

filenames = ["WordWord1.html", "WordWord3.html", "WordWord10.html", "WordWord11.html", "AnotherWord1.html", "AnotherWord2.html", "FileFile.html"]
filenames.inject({}){|h,f|k = f.split(/[^a-zA-Z]/, 2).first;h[k] ||= [];h[k] << f; h}

arr = %w[
WordWord1.html
WordWord3.html
WordWord10.html
WordWord11.html
AnotherWord1.html
AnotherWord2.html
FileFile.html
]
result = {}
arr.each do |a|
prefix = a.match(/[A-Za-z]+/).to_s
if result[prefix]
result[prefix] << a
else
result[prefix] = [a]
end
end
p result

Related

Get the average of numbers in an array which is the values of an hash

in a Ruby program I have an hash which has normal strings as keys and the values are array of numbers:
hash_1 = {"Luke"=> [2,3,4], "Mark"=>[3,5], "Jack"=>[2]}
And what I'm looking for is to have as result the same hash with the values that become the average of the numbers inside the arrays:
{"Luke"=> 3, "Mark"=>4, "Jack"=>2}
One way to make it to work can be to create a new empty hash_2, loop over hash_1 and within the block assign the keys to hash_2 and the average of the numbers as values.
hash_2 = {}
hash_1.each do |key, value|
hash_2[key] = value.sum / value.count
end
hash_2 = {"Luke"=> 3, "Mark"=>4, "Jack"=>2}
Is there a better way I could do this, for instance without having to create a new hash?
hash_1 = {"Luke"=> [2,3,4], "Mark"=>[3,5], "Jack"=>[2]}
You don't need another hash for the given below code.
p hash_1.transform_values!{|x| x.sum/x.count}
Result
{"Luke"=>3, "Mark"=>4, "Jack"=>2}
def avg(arr)
return nil if arr.empty?
arr.sum.fdiv(arr.size)
end
h = { "Matthew"=>[2], "Mark"=>[3,6], "Luke"=>[2,3,4], "Jack"=>[] }
h.transform_values { |v| avg(v) }
#=> {"Matthew"=>2.0, "Mark"=>4.5, "Luke"=>3.0, "Jack"=>nil}
#Виктор
OK. How about this:
hash_1 = {"Luke"=> [2,3,4], "Mark"=>[3,5], "Jack"=>[2], "Bobby"=>[]}
hash_2 = hash_1.reduce(Hash.new(0)) do |acc, (k, v)|
v.size > 0 ? acc[k] = v.sum / v.size : acc[k] = 0
acc
end
p hash_2
This solution is different than the one that use transform_values! because return a new Hash object.
hash_1.map { |k,v| [k, v.sum / v.size] }.to_h

Ruby - Print only the values of dubplicate hash keys in an array of hashes

I have created an array of hashes from data ive to pull in from an xml file. Problem is, some of the hash keys in the array are duplicates and id like to pull just the values. For example, the code below outputs the following:
{"server_host"=>"hostone", "server_type"=>"redhat", "server_name"=>"RedhatOne"}
{"server_host"=>"hostone", "server_type"=>"windows", "server_name"=>"WinOne"}
and i'd like to be able print out this:
{"server_host"=>"hostone", "server_type"=>"redhat", "server_name"=>"RedhatOne"}
"server_type"=>"windows", "server_name"=>"WinOne"}
I think i need to create another array based on duplicate keys but what i am trying below is not working:
def parse_xml_file(filename)
require 'nokogiri'
xmlSource = File.read(filename)
parsedXml = Nokogiri::XML(xmlSource)
hostArray = Array.new
parsedXml.xpath("/New/Server").each do |srvNode|
hostNode = srvNode.at_xpath("Host")
hostArray << {"server_name"=>srvNode["Name"],
"server_type"=>srvNode["Type"], "server_host"=>hostNode["Address"] }
grouped = hostArray.group_by{|row| [row[:server_host]]}
filtered = grouped.values.select { |a| a.size > 1 }.flatten
end
Assuming you have a variable hash_arr which contains your duplicated hashes, here is some code that should get you pretty close to where you want to be. It's not optimized, but it's simple enough to understand:
hash_arr.group_by { |h| h["server_host"] }.each do |host_name, values|
puts "Server Host: #{host_name}"
values.each do |val|
val.delete("server_host")
puts val
end
end
prints out:
Server Host: hostone
{"server_type"=>"redhat", "server_name"=>"RedhatOne"}
{"server_type"=>"windows", "server_name"=>"WinOne"}
Or if you just the values per group without associating them across hashes:
hash_arr =[{"server_host"=>"hostone", "server_type"=>"redhat", "server_name"=>"RedhatOne"}, {"server_host"=>"hostone", "server_type"=>"windows", "server_name"=>"WinOne"}]
merged_hash = {}
hash_arr.each do |hash|
hash.each do |k, v|
merged_hash[k] ||= []
merged_hash[k] << v
end
end
merged_hash.values.each(&:uniq!)
And then the output:
[9] pry(main)> merged_hash
=> {"server_host"=>["hostone"], "server_type"=>["redhat", "windows"], "server_name"=>["RedhatOne", "WinOne"]}
This will get you the shared values:
shared = hash1.keep_if { |k, v| hash2.key? k }
And them you could print that however you like. Don't know if you want to print the keys, values, or both, but however you like:
shared.each_pair { |k, v| print k, v }
You could obviously merge these two snippets into one command, but for the sake of clarity, they are 2.
EDIT:
Just noticed you wanted as an array. If you wanted just values:
array = hash1.keep_if { |k, v| hash2.key? k }.values
Thanks for the advice - i've tried this :
shared = Hash.new
grouped = hostArray.group_by{|row| [row[:server_host]]}
filtered = grouped.values.select { |a| a.size > 1 }.flatten
filtered.each do |element|
element.each do |key, value|
shared = element.keep_if { |k, v| element.key? k }
end
shared.each_pair { |k, v| print k," ", v, "\n" }
end
but this output is still incorrect - i think i've referenced 'hash2' wrongly? is that correct?

Find difference between two arrays considering duplicates [duplicate]

[1,2,3,3] - [1,2,3] produces the empty array []. Is it possible to retain duplicates so it returns [3]?
I am so glad you asked. I would like to see such a method added to the class Array in some future version of Ruby, as I have found many uses for it:
class Array
def difference(other)
h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
reject { |e| h[e] > 0 && h[e] -= 1 }
end
end
A description of the method and links to some of its applications are given here.
By way of example:
a = [1,2,3,4,3,2,4,2]
b = [2,3,4,4,4]
a - b #=> [1]
a.difference b #=> [1,2,3,2]
Ruby v2.7 gave us the method Enumerable#tally, allowing us to replace the first line of the method with
h = other.tally
As far as I know, you can't do this with a built-in operation. Can't see anything in the ruby docs either. Simplest way to do this would be to extend the array class like this:
class Array
def difference(array2)
final_array = []
self.each do |item|
if array2.include?(item)
array2.delete_at(array2.find_index(item))
else
final_array << item
end
end
end
end
For all I know there's a more efficient way to do this, also
EDIT:
As suggested by user2864740 in question comments, using Array#slice! is a much more elegant solution
def arr_sub(a,b)
a = a.dup #if you want to preserve the original array
b.each {|del| a.slice!(a.index(del)) if a.include?(del) }
return a
end
Credit:
My original answer
def arr_sub(a,b)
b = b.each_with_object(Hash.new(0)){ |v,h| h[v] += 1 }
a = a.each_with_object([]) do |v, arr|
arr << v if b[v] < 1
b[v] -= 1
end
end
arr_sub([1,2,3,3],[1,2,3]) # a => [3]
arr_sub([1,2,3,3,4,4,4],[1,2,3,4,4]) # => [3, 4]
arr_sub([4,4,4,5,5,5,5],[4,4,5,5,5,5,6,6]) # => [4]

Ruby merging items in array conditionally

I have an array containing capital and small letters. I am trying to concatenate capital letters with the following small letters in a new array. For example, I have the following array
first_array = ["A","b","C","d","e"]
and I want to obtain the following array
["Ab","Cde"] #new array
I am trying to iterate through the first array with a code that looks like this:
new_array = []
first_array.each_with_index do |a,index|
if (a!~/^[a-z].*$/)
new_array = new_array.push "#{a}"
else
new_array[-1] = first_array[index-1] + "#{a}" #the idea is to concatenate the small letter with the previous capital letter and replace the last item in the new array
end
but it does not work. I am not sure I am tackling this issue efficiently which is why I can't resolve it. Could somebody suggest some options?
If you join as a string you can then scan to get all the matches:
first_array.join.scan(/[A-Z][a-z]*/)
=> ["Ab", "Cde"]
While I prefer #Paul's answer, you could do the following.
first_array.slice_before { |s| s.upcase == s }.map(&:join)
#=> ["Ab", "Cde"]
So, you want to split your original array when next char is uppercase, and then make strings of those subarrays? There's a method in standard lib that can help you here:
first_array = ["A","b","C","d","e"]
result = first_array.slice_when do |a, b|
a_lower = a.downcase == a
b_upper = b.upcase == b
a_lower && b_upper
end.map(&:join)
result # => ["Ab", "Cde"]
I like Sergio's answer a lot, here's what I brewed up while trying it out:
def append_if_present(lowercase, letters)
lowercase << letters.join if letters.size > 0
end
first_array = ["A","b","C","d","e"]
capitals = []
lowercase = []
letters = []
first_array.each_with_index do |l, i|
if l =~ /[A-Z]/
capitals << l
append_if_present(lowercase, letters)
letters = []
else
letters << l
end
end
append_if_present(lowercase, letters)
p capitals.zip(lowercase).map(&:join)
Here's a method that uses Enumerable#slice_when :
first_array.slice_when{ |a, b| b.upcase == b && a.downcase == a }.map(&:join)

Dynamically deleting elements from an array while enumerating through it

I am going through my system dictionary and looking for words that are, according to a strict definition, neither subsets nor supersets of any other word.
The implementation below does not work, but if it did, it would be pretty efficient, I think. How do I iterate through the array and also remove items from that same array during iteration?
def collect_dead_words
result = #file #the words in my system dictionary, as an array
wg = WordGame.new # the class that "knows" the find_subset_words &
# find_superset_words methods
result.each do |value|
wg.word = value
supersets = wg.find_superset_words.values.flatten
subsets = wg.find_subset_words.values.flatten
result.delete(value) unless (matches.empty? && subsets.empty?)
result.reject! { |cand| supersets.include? cand }
result.reject! { |cand| subsets.include? cand }
end
result
end
Note: find_superset_words and find_subset_words both return hashes, hence the values.flatten bit
It is inadvisable to modify a collection while iterating over it. Instead, either iterate over a copy of the collection, or create a separate array of things to remove later.
One way to accomplish this is with Array#delete_if. Here's my run at it so you get the idea:
supersets_and_subsets = []
result.delete_if do |el|
wg.word = el
superset_and_subset = wg.find_superset_words.values.flatten + wg.find_subset_words.values.flatten
supersets_and_subsets << superset_and_subset
!superset_and_subset.empty?
end
result -= supersets_and_subsets.flatten.uniq
Here's what I came up with based on your feedback (plus a further optimization by starting with the shortest words):
def collect_dead_words
result = []
collection = #file
num = #file.max_by(&:length).length
1.upto(num) do |index|
subset_by_length = collection.select {|word| word.length == index }
while !subset_by_length.empty? do
wg = WordGame.new(subset_by_length[0])
supermatches = wg.find_superset_words.values.flatten
submatches = wg.find_subset_words.values.flatten
collection.reject! { |cand| supermatches.include? cand }
collection.reject! { |cand| submatches.include? cand }
result << wg.word if (supermatches.empty? && submatches.empty?)
subset.delete(subset_by_length[0])
collection.delete(subset_by_length[0])
end
end
result
end
Further optimizations are welcome!
The problem
As I understand, string s1 is a subset of string s2 if s1 == s2 after zero or more characters are removed from s2; that is, if there exists a mapping m of the indices of s1 such that1:
for each index i of s1, s1[i] = s2[m(i)]; and
if i < j then m(i) < m(j).
Further s2 is a superset of s1 if and only if s1 is a subset of s2.
Note that for s1 to be a subset of s2, s1.size <= s2.size must be true.
For example:
"cat" is a subset of "craft" because the latter becomes "cat" if the "r" and "f" are removed.
"cat" is not a subset of "cutie" because "cutie" has no "a".
"cat" is not a superset of "at" because "cat".include?("at") #=> true`.
"cat" is not a subset of "enact" because m(0) = 3 and m(1) = 2, but m(0) < m(1) is false;
Algorithm
Subset (and hence superset) is a transitive relation, which permit significant algorithmic efficiencies. By this I mean that if s1 is a subset of s2 and s2 is a subset of s3, then s1 is a subset of s3.
I will proceed as follows:
Create empty sets neither_sub_nor_sup and longest_sups and an empty array subs_and_sups.
Sort the words in the dictionary by length, longest first.
Add w to neither_sub_nor_sup, where w is longest word in the dictionary.
For each subsequent word w in the dictionary (longest to shortest), perform the following operations:
for each element u of neither_sub_nor_sup determine if w is a subset of u. If it is, move u from neither_sub_nor_sup to longest_sups and append u to subs_and_sups.
if one or more elements were moved from from neither_sub_nor_sup to longest_sups, append w to subs_and_sups; else add w to neither_sub_nor_sup.
Return subs_and_sups.
Code
require 'set'
def identify_subs_and_sups(dict)
neither_sub_nor_sup, longest_sups = Set.new, Set.new
dict.sort_by(&:size).reverse.each_with_object([]) do |w,subs_and_sups|
switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
arr << u if w.subset(u) }
if switchers.any?
subs_and_sups << w
switchers.each do |u|
neither_sub_nor_sup.delete(u)
longest_sups << u
subs_and_sups << u
end
else
neither_sub_nor_sup << w
end
end
end
class String
def subset(w)
w =~ Regexp.new(self.gsub(/./) { |m| "#{m}\\w*" })
end
end
Example
dict = %w| cat catch craft cutie enact trivial rivert river |
#=> ["cat", "catch", "craft", "cutie", "enact", "trivial", "rivert", "river"]
identify_subs_and_sups(dict)
#=> ["river", "rivert", "cat", "catch", "craft"]
Variant
Rather than processing the words in the dictionary from longest to shortest, we could instead order them shortest to longest:
def identify_subs_and_sups1(dict)
neither_sub_nor_sup, shortest_sups = Set.new, Set.new
dict.sort_by(&:size).each_with_object([]) do |w,subs_and_sups|
switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
arr << u if u.subset(w) }
if switchers.any?
subs_and_sups << w
switchers.each do |u|
neither_sub_nor_sup.delete(u)
shortest_sups << u
subs_and_sups << u
end
else
neither_sub_nor_sup << w
end
end
end
identify_subs_and_sups1(dict)
#=> ["craft", "cat", "rivert", "river"]
Benchmarks
(to be continued...)
1 The OP stated (in a later comment) that s1 is not a substring of s2 if s2.include?(s1) #=> true. I am going to pretend I never saw that, as it throws a spanner into the works. Unfortunately, subset is no longer a transitive relation with that additional requirement. I haven't investigate the implications of that, but I suspect it means a rather brutish algorithm would be required, possibly requiring pairwise comparisons of all the words in the dictionary.

Resources