I'm parsing multiple website and trying to build a hash that looks something like:
"word" => [[01.html, 2], [02.html, 7], [03.html, 4]]
where word is a given word in the index, the first value in each sublist is the file it was found in, and the second value is the number of occurrences in that given file.
I'm running into an issue where, rather than appending ["02.html", 7] inside the values list, it creates a whole new entry for "word" and puts ["02.html", 7] at the end of the hash. This results in basically giving me individual indexes for all of my websites appended after each other rather than giving me one master index.
Here is my code:
for token in tokens
if !invindex.include?(token)
invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and occurrence of 1
else
for list in invindex[token]
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
end
end
end
end
end
Hopefully it's something simple and I just missed something when I traced it on paper.
I'm running into an issue where, rather than appending ["02.html", 7]
inside the values list, it creates a whole new entry for "word" and
puts ["02.html", 7] at the end of the hash.
I'm not seeing that:
invindex = {
word1: [
['01.html', 2],
]
}
tokens = %i[
word1
word2
word3
]
doc_name = '02.html'
tokens.each do |token|
if !invindex.include?(token)
invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and occurrence of 1
else
invindex[token].each do |list|
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
end
end
end
end
p invindex
--output:--
{:word1=>[["01.html", 2]], :word2=>[["02.html", 1]], :word3=>[["02.html", 1]]}
invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name
Nope:
invindex = {
word: [
['01.html', 2],
]
}
token = :word
doc_name = '02.html'
invindex[token].insert([doc_name, 7])
p invindex
invindex[token].insert(-1, ["02.html", 7])
p invindex
--output:--
{:word=>[["01.html", 2]]}
{:word=>[["01.html", 2], ["02.html", 7]]}
Array#insert() requires that you specify an index as the first argument. Typically when you want to append something to the end, you use <<:
invindex = {
word: [
['01.html', 2],
]
}
token = :word
doc_name = '02.html'
invindex[token] << [doc_name, 7]
p invindex
--output:--
{:word=>[["01.html", 2], ["02.html", 7]]}
for token in tokens
Rubyists don't use for-in loops because for-in loops call each(), so rubyists call each() directly:
tokens.each do |token|
...
end
Finally, indenting in ruby is 2 spaces--not 3 spaces, not 1 space, not 4 spaces. It's 2 spaces.
Applying all that to your code:
invindex = {
word1: [
['01.html', 2],
]
}
tokens = %i[
word1
word2
word3
]
doc_name = '01.html'
tokens.each do |token|
if !invindex.include?(token)
invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and occurrence of 1
else
invindex[token].each do |list|
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token] << [doc_name, 1] #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
end
end
end
end
p invindex
--output:--
{:word1=>[["01.html", 3]], :word2=>[["01.html", 1]], :word3=>[["01.html", 1]]}
However, there is still a problem, which is due to the fact that you are changing an Array that you are stepping through--a big no-no in computer programming:
invindex[token].each do |list|
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token] << [doc_name, 1] #***PROBLEM***
Look what happens:
invindex = {
word1: [
['01.html', 2],
]
}
tokens = %i[
word1
word2
word3
]
%w[ 01.html 02.html].each do |doc_name|
tokens.each do |token|
if !invindex.include?(token)
invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and occurrence of 1
else
invindex[token].each do |list|
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token] << [doc_name, 1] #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
end
end
end
end
end
p invindex
--output:--
{:word1=>[["01.html", 3], ["02.html", 2]], :word2=>[["01.html", 1], ["02.html", 2]], :word3=>[["01.html", 1], ["02.html", 2]]}
Problem 1: You don't want to insert [doc_name, 1] every time the sub Array you are examining doesn't contain the doc_name--you only want to insert [doc_name, 1] after ALL the sub Arrays have been examined, and the doc_name wasn't found. If you run the example above with the starting hash:
invindex = {
word1: [
['01.html', 2],
['02.html', 7],
]
}
...you will see that the output is even worse.
Problem 2: Appending [doc_name, 1] to the Array while you are stepping through the Array means that [doc-name, 1] will be examined, too, when the loop gets to the end of the Array--and then your loop will increment its count to 2. The rule is: don't change an Array you are stepping through because bad things will happen.
Do you actually need to have a hash that contains an array of arrays?
This can be much better described with a nested hash
invindex = {
"word" => { '01.html' => 2, '02.html' => 7, '03.html' => 4 },
"other" => { '01.html' => 1, '02.html' => 17, '04.html' => 4 }
}
which can be easily populated by using a Hash factory like
invindex = Hash.new { |h,k| h[k] = Hash.new {|hh,kk| hh[kk] = 0} }
tokens.each do |token|
invindex[token][doc_name] += 1
end
now if you absolutely need to have the format you mention you can get it from the described invindex with a simple iteration
result = {}
invindex.each {|k,v| result[k] = v.to_a }
Suppose:
arr = %w| 01.html 02.html 03.html 02.html 03.html 03.html |
#=> ["01.html", "02.html", "03.html", "02.html", "03.html", "03.html"]
is an array of your files for a given word in the index. Then the value of that word in the hash is given by constructing the counting hash:
h = arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }
#=> {"01.html"=>1, "02.html"=>2, "03.html"=>3}
and then converting it to an array:
h.to_a
#=> [["01.html", 1], ["02.html", 2], ["03.html", 3]]
so you could write:
arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }.to_a
Hash::new is given a default value of zero. That means that if the hash being constructed, h, does not have a key s, h[s] returns zero. In that case:
h[s] += 1
#=> h[s] = h[s] + 1
# = 0 + 1 = 1
and when the same value of s in arr is passed to the block:
h[s] += 1
#=> h[s] = h[s] + 1
# = 1 + 1 = 2
You may consider whether it would be better to make the value of each word in the index the hash h.
Related
I am learning Ruby and just solved this pyramid problem. For whatever reason, I tried to change twoD[0] to the variable twoDidx (see third line).
However, when I try replacing while twoD[0].length != 1 with while twoDidx.length != 1, I get "undefined." What am I not understanding about how variables work? Thanks.
def pyramid_sum(base)
twoD = [base]
twoDidx = twoD[0]
while twoD[0].length != 1
arr = twoD[0].map.with_index do |num, idx|
if idx != twoD[0].length - 1
num + twoD[0][idx + 1]
end
end
arr = arr.compact
twoD.unshift(arr)
end
return twoD
end
print pyramid_sum([1, 4, 6]) #=> [[15], [5, 10], [1, 4, 6]]
There's a big difference between twoDidx and twoD[0]. twoDidx is a reference to a first element of twoD at the time you made an assignment while twoD[0] is the reference to the first element of twoD array at the time of execution.
To make it more obvious:
array = [1]
first = array[0] # Here you just assign 1 to the variable
array = [100]
first #=> 1
array[0] #=> 100
There is an array with some numbers. All numbers are equal except for one. I'm trying to get this type of thing:
find_uniq([ 1, 1, 1, 2, 1, 1 ]) == 2
find_uniq([ 0, 0, 0.55, 0, 0 ]) == 0.55
I tried this:
def find_uniq(arr)
arr.uniq.each{|e| arr.count(e)}
end
It gives me the two different values in the array, but I'm not sure how to pick the one that's unique. Can I use some sort of count or not? Thanks!
This worked:
def find_uniq(arr)
return nil if arr.size < 3
if arr[0] != arr[1]
return arr[1] == arr[2] ? arr[0] : arr[1]
end
arr.each_cons(2) { |x, y| return y if y != x }
end
Thanks pjs and Cary Swoveland.
I would do this:
[ 1, 1, 1, 2, 1, 1 ]
.tally # { 1=>5, 2=>1 }
.find { |_, v| v == 1 } # [2, 1]
.first # 2
Or as 3limin4t0r suggested:
[ 1, 1, 1, 2, 1, 1 ]
.tally # { 1=>5, 2=>1 }
.invert[1] # { 5=>1, 1=>2 } => 2
The following doesn't use tallies and will short circuit the search when a unique item is found. First, it returns nil if the array has fewer than 3 elements, since there's no way to answer the question in that case. If that check is passed, it works by comparing adjacent values. It performs an up-front check that the first two elements are equal—if not, it checks against the third element to see which one is different. Otherwise, it iterates through the array and returns the first value it finds which is unequal. It returns nil if there is not a distinguished element in the array.
def find_uniq(arr)
return nil if arr.size < 3
if arr[0] == arr[1]
arr.each.with_index do |x, i|
i += 1
return arr[i] if arr[i] != x
end
elsif arr[1] == arr[2]
arr[0]
else
arr[1]
end
end
This also works with non-numeric arrays such as find_uniq(%w(c c c d c c c c)).
Thanks to Cary Swoveland for reminding me about each_cons. That can tighten up the solution considerably:
def find_uniq(arr)
return nil if arr.size < 3
if arr[0] != arr[1]
return arr[1] == arr[2] ? arr[0] : arr[1]
end
arr.each_cons(2) { |x, y| return y if y != x }
end
For all but tiny arrays this method effectively has the speed of Enumerable#find.
def find_uniq(arr)
multi = arr[0,3].partition { |e| e == arr.first }
.sort_by { |e| -e.size }.first.first
arr.find { |e| e != multi }
end
find_uniq [1, 1, 1, 2, 1, 1] #=> 2
find_uniq [0, 0, 0.55, 0, 0] #=> 0.55
find_uniq [:pig, :pig, :cow, :pig] #=> :cow
The wording of the question implies the array contains at least three elements. It certainly cannot be empty or have two elements. (If it could contain one element add the guard clause return arr.first if arr.size == 1.)
I examine the first three elements to determine the object that has duplicates, which I assign to the variable multi. I then am able to use find. find is quite fast, in part because it short-circuits (stops enumerating the array when it achieves a match).
If
arr = [1, 1, 1, 2, 1, 1]
then
a = arr[0,3].partition { |e| e == arr.first }.sort_by { |e| -e.size }
#=> [[1, 1, 1], []]
multi = a.first.first
#=> 1
If any of these:
arr = [2, 1, 1, 1, 1, 1]
arr = [1, 2, 1, 1, 1, 1]
arr = [1, 1, 2, 1, 1, 1]
apply then
a = arr[0,3].partition { |e| e == arr.first }.sort_by { |e| -e.size }
#=> [[1, 1], [2]]
multi = a.first.first
#=> 1
Let's compare the computational performace of the solutions that have been offered.
def spickermann1(arr)
arr.tally.find { |_, v| v == 1 }.first
end
def spickermann2(arr)
arr.tally.invert[1]
end
def spickermann3(arr)
arr.tally.min_by(&:last).first
end
def pjs(arr)
if arr[0] == arr[1]
arr.each.with_index do |x, i|
i += 1
return arr[i] if arr[i] != x
end
elsif arr[1] == arr[2]
arr[0]
else
arr[1]
end
end
I did not include #3limin4t0r's solution because of the author's admission that it is relatively inefficient. I did include, however, include two variants of #spikermann's answer, one ("spickermann2") having been proposed by #3limin4t0r in a comment.
require 'benchmark'
def test(n)
puts "\nArray size = #{n}"
arr = Array.new(n-1,0) << 1
Benchmark.bm do |x|
x.report("Cary") { find_uniq(arr) }
x.report("spickermann1") { spickermann1(arr) }
x.report("spickermann2") { spickermann2(arr) }
x.report("spickermann3") { spickermann3(arr) }
x.report("PJS") { pjs(arr) }
end
end
test 100
Array size = 100
user system total real
Cary 0.000032 0.000009 0.000041 ( 0.000029)
spickermann1 0.000022 0.000015 0.000037 ( 0.000019)
spickermann2 0.000017 0.000002 0.000019 ( 0.000016)
spickermann3 0.000019 0.000002 0.000021 ( 0.000018)
PJS 0.000042 0.000025 0.000067 ( 0.000034)
test 10_000
Array size = 10_000
user system total real
Cary 0.001101 0.000091 0.001192 ( 0.001119)
spickermann1 0.000699 0.000096 0.000795 ( 0.000716)
spickermann2 0.000794 0.000071 0.000865 ( 0.000896)
spickermann3 0.000776 0.000081 0.000857 ( 0.000781)
PJS 0.001140 0.000113 0.001253 ( 0.001300)
test 1_000_000
Array size = 1_000_000
user system total real
Cary 0.061148 0.000787 0.061935 ( 0.063022)
spickermann1 0.043598 0.000474 0.044072 ( 0.044590)
spickermann2 0.044909 0.000663 0.045572 ( 0.046371)
spickermann3 0.042907 0.000210 0.043117 ( 0.043162)
PJS 0.072766 0.000226 0.072992 ( 0.073168)
I attribute the apparent superiority of #spickermann's answer to the fact that Enumerable#tally has no block to evaluate (unlike, for example, Enumerable#find in my answer).
Your code can be fixed by using find instead of each:
def find_uniq(arr)
arr.uniq.find { |e| arr.count(e) == 1 }
end
However this is quite inefficient since uniq needs to iterate the full collection. After finding the unique values the arr collection is iterated 1 or 2 more times by count (assuming there are only two unique values), depending on the position of the values in the uniq result.
For simple solution I suggest looking at the answer of spickermann which only iterates the full collection once.
For your specific scenario you could technically increase performance by short-circuiting the tally. This is done by manually tallying and breaking the loop if the tally contains 2 distinct values and at least 3 items are tallied.
def find_uniq(arr)
tally = Hash.new(0)
arr.each_with_index do |item, index|
break if tally.size == 2 && index >= 3
tally[item] += 1
end
tally.invert[1]
end
In this code if user type 2, two times and 1, two times. Then there's two maximum elements and both Kinder and Twix should be printed. But how ? I probably can do this with if method but this will make my code even longer. Any cool version? Can I do this with just one if?
a = [0, 0, 0,]
b = ["Kinder", "Twix", "Mars"]
while true
input = gets.chomp.to_i
if input == 1
a[0] += 1
elsif input == 2
a[1] += 1
elsif input == 3
a[2] += 1
elsif input == 0
break
end
end
index = a.index(a.max)
chocolate = b[index] if index
print a.max,chocolate
The question really has nothing to do with how the array a is constructed.
def select_all_max(a, b)
mx = a.max
b.values_at(*a.each_index.select { |i| a[i] == mx })
end
b = ["Kinder", "Twix", "Mars"]
p select_all_max [0, 2, 1], b
["Twix"]
p select_all_max [2, 2, 1], b
["Kinder", "Twix"]
See Array#values_at.
This could alternatively be done in a single pass.
def select_all_max(a, b)
b.values_at(
*(1..a.size-1).each_with_object([0]) do |i,arr|
case a[i] <=> arr.last
when 0
arr << i
when 1
arr = [i]
end
end
)
end
p select_all_max [0, 2, 1], b
["Twix"]
p select_all_max [2, 2, 1], b
["Kinder", "Twix"]
p select_all_max [1, 1, 1], b
["Kinder", "Twix", "Mars"]
One way would be as follows:
First, just separate the input-gathering from the counting, so we'll just gather input in this step:
inputs = []
loop do
input = gets.chomp.to_i
break if input.zero?
inputs << input
end
Now we can tally up the inputs. If you have Ruby 2.7 you can simply do counts_by_input = inputs.tally to get { "Twix" => 2, "Kinder" => 2 }. Otherwise, my preferred approach is to use group_by with transform_values:
counts_by_input = inputs.group_by(&:itself).transform_values(&:count)
# => { "Twix" => 2, "Kinder" => 2 }
Now, since we're going to be extracting values based on their count, we want to have the counts as keys. Normally we might invert the hash, but that won't work in this case because it will only give us one value per key, and we need multiple:
inputs_by_count = counts_by_input.invert
# => { 2 => "Kinder" }
# This doesn't work, it removed one of the values
Instead, we can use another group_by and transform_values (the reason I like these methods is because they're very versatile ...):
inputs_by_count = counts_by_input.
group_by { |input, count| count }.
transform_values { |keyvals| keyvals.map(&:first) }
# => { 2 => ["Twix", "Kinder"] }
The transform_values code here is probably a bit confusing, but one important thing to understand is that often times, calling Enumerable methods on hashes converts them to [[key1, val1], [key2, val2]] arrays:
counts_by_input.group_by { |input, count| count }
# => { 2 => [["Twix", 2], ["Kinder", 2]] }
Which is why we call transform_values { |keyvals| keyvals.map(&:first) } afterwards to get our desired format { 2 => ["Twix", "Kinder"] }
Anyway, at this point getting our result is very easy:
inputs_by_count[inputs_by_count.keys.max]
# => ["Twix", "Kinder"]
I know this probably all seems a little insane, but when you get familiar with Enumerable methods you will be able to do this kind of data transformation pretty fluently.
Tl;dr, give me the codez
inputs = []
loop do
input = gets.chomp.to_i
break if input.zero?
inputs << input
end
inputs_by_count = inputs.
group_by(&:itself).
transform_values(&:count).
group_by { |keyvals, count| count }.
transform_values { |keyvals| keyvals.map(&:first) }
top_count = inputs_by_count.keys.max
inputs_by_count[top_count]
# => ["Twix", "Kinder"]
How about something like this:
maximum = a.max # => 2
top_selling_bars = a.map.with_index { |e, i| b[i] if e == maximum }.compact # => ['Kinder', 'Twix']
p top_selling_bars # => ['Kinder', 'Twix']
If you have
a = [2, 2, 0,]
b = ['Kinder', 'Twix', 'Mars']
You can calculate the maximum value in a via:
max = a.max #=> 2
and find all elements corresponding to that value via:
b.select.with_index { |_, i| a[i] == max }
#=> ["Kinder", "Twix"]
I need to implement a method, which works that way:
# do_magic("abcd") # "Aaaa-Bbb-Cc-D"
# do_magic("a") # "A"
# do_magic("ab") # "Aa-B"
# do_magic("teSt") # "Tttt-Eee-Ss-T"
My decision was to convert a string into an array, iterate through this array and save the result. The code works properly inside the block, but I'm unable to get the array with updated values with this solution, it returns the same string divided by a dash (for example "t-e-S-t" when ".map" used or "3-2-1-0" when ".map!" used):
def do_magic(str)
letters = str.split ''
counter = letters.length
while counter > 0
letters.map! do |letter|
(letter * counter).capitalize
counter -= 1
end
end
puts letters.join('-')
end
Where is the mistake?
You're so close. When you have a block (letters.map!), the return of that block is the last evaluated statement. In this case, counter -= 1 is being mapped into letters.
Try
l = (letter * counter).capitalize
counter -= 1
l
You can try something like this using each_with_index
def do_magic(str)
letters = str.split("")
length = letters.length
new_letters = []
letters.each_with_index do |letter, i|
new_letters << (letter * (length - i)).capitalize
end
new_letters.join("-")
end
OR
using map_with_index equivalent each_with_index.map
def do_magic(str)
letters = str.split("")
length = letters.length
letters.each_with_index.map { |letter, i|
(letter * (length - i)).capitalize
}.join("-")
end
I suggest the following.
def do_magic(letters)
length = letters.size
letters.downcase.each_char.with_index.with_object([]) { |(letter, i), new_letters|
new_letters << (letter * (length - i)).capitalize }.join
end
do_magic 'teSt'
# => "TtttEeeSsT"
Let's go through the steps.
letters = 'teSt'
length = letters.size
#=> 4
str = letters.downcase
#=> "test"
enum0 = str.each_char
#=> #<Enumerator: "test":each_char>
enum1 = enum0.with_index
#=> #<Enumerator: #<Enumerator: "test":each_char>:with_index>
enum2 = enum1.with_object([])
#=> #<Enumerator: #<Enumerator: #<Enumerator: "test":each_char>:
# with_index>:with_object([])>
Carefully examine the return values from the creation of the enumerators enum0, enum1 and enum2. The latter two may be thought of as compound enumerators.
The first element is generated by enum2 (the value of enum2.next) and the block variables are assigned values using disambiguation (aka decomposition).
(letter, i), new_letters = enum2.next
#=> [["t", 0], []]
letter
#=> "t"
i #=> 0
new_letters
#=> []
The block calculation is then performed.
m = letter * (length - i)
#=> "tttt"
n = m.capitalize
#=> "Tttt"
new_letters << n
#=> ["Tttt"]
The next element is generated by enum2, passed to the block and the block calculations are performed.
(letter, i), new_letters = enum2.next
#=> [["e", 1], ["Tttt"]]
letter
#=> "e"
i #=> 1
new_letters
#=> ["Tttt"]
Notice how new_letters has been updated. The block calculation is as follows.
m = letter * (length - i)
#=> "eee"
n = m.capitalize
#=> "Eee"
new_letters << n
#=> ["Tttt", "Eee"]
After the last two elements of enum2 are generated we have
new_letters
#=> ["Tttt", "Eee", "Se", "T"]
The last step is to combine the elements of new_letters to form a single string.
new_letters.join
#=> "TtttEeeSeT"
I have an array like this ['n','n','n','s','n','s','n','s','n','s'] and I want to check if there are equal counts of characters or not. In the above one I have 6 ns and 4 ss and so they are not equal and I tried, but nothing went correct. How can I do this using Ruby?
Given array:
a = ['n','n','n','s','n','s','n','s','n','s']
Group array by it's elements and take only values of this group:
(f,s) = a.group_by{|e| e}.values
Compare sizes:
f.size == s.size
Result: false
Or you can try this:
x = ['n','n','n','s','n','s','n','s','n','s']
x.group_by {|c| c}.values.map(&:size).inject(:==)
You can go for something like this:
def eq_num? arr
return false if arr.size == 1
arr.uniq.map {|i| arr.count(i)}.uniq.size == 1
end
arr = ['n','n','n','s','n','s','n','s','n','s']
eq_num? arr #=> false
arr = ['n','n','n','s','n','s','s','s']
eq_num? arr #=> true
Works for more than two kinds of letters too:
arr = ['n','n','t','s','n','t','s','s','t']
eq_num? arr #=> true
Using Array#count is relatively inefficient as it requires a full pass through the array for each element whose instances are being counted. Instead use Enumerable#group_by, as others have done, or use a counting hash, as below (see Hash::new):
Code
def equal_counts?(arr)
arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }.values.uniq.size == 1
end
Examples
equal_counts? ['n','n','n','s','n','s','n','s','n','s']
#=> false
equal_counts? ['n','r','r','n','s','s','n','s','r']
#=> true
Explanation
For
arr = ['n','n','n','s','n','s','n','s','n','s']
the steps are as follows.
h = arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }
#=> {"n"=>6, "s"=>4}
a = h.values
#=> [6, 4]
b = a.uniq
#=> [6, 4]
b.size == 1
#=> false