How to get all different part in two array? - arrays

Suppose I have a sentence:
When Grazia Deledda submitted a short story to a fashion magazine at the age of 13
this sentence is split into two list:
# list 1
[When] [Grazia Deledda] [submitted a short story] [to] [a] [fashion magazine] [at] [the age of] [13]
# list 2
[When] [Grazia Deledda] [submitted] [a short story] [to] [a fashion] [magazine at] [the age of] [13]
Now I want to get the different parts in this two array, this example's result should be:
[
([[submitted a short story]],[[submitted] [a short story]]),
([[a] [fashion magazine] [at]], [[a fashion] [magazine at]])
]
so it should meet these requirements:
every pair should have the same content, for example: [[submitted a short story]] can be joined into 'submitted a short story', and [[submitted] [a short story]] can also be joined into 'submitted a short story'
every pair should have the same start position and end position, for example: [[submitted a short story]] the starts at 3, and ends with 6. [[submitted] [a short story] are the same.
the most important is that every one should be the shortest, for example [[submitted a short story] [to]] and [[submitted] [a short story] [to]] also meets the first two requirements, but it is not the shortest.
Any way to avoid O(n^2) complexity?

I may got wrong direction at beginning, this question can be easy, I have think a good idea:
#!/usr/bin/env python
# encoding: utf-8
# list 1
llist = [["When"], ["Grazia", "Deledda"], ["submitted", "a", "short", "story"], ["to"], ["a"], ["fashion", "magazine"], ["at"], ["the", "age", "of",], ["13"],]
# list 2
rlist = [["When"], ["Grazia", "Deledda"], ["submitted"], ["a", "short", "story"], ["to"], ["a", "fashion"], ["magazine", "at"], ["the", "age", "of",], ["13"],]
loffset = -1
roffset = 0
rindex = 0
lstart = -1
rstart = -1
for lindex, litem in enumerate(llist):
if loffset == roffset and litem != rlist[rindex]:
lstart = lindex
rstart = rindex
loffset += len(litem)
while roffset < loffset:
roffset += len(rlist[rindex])
rindex += 1
if loffset == roffset and lstart >= 0:
print(llist[lstart:lindex+1], rlist[rstart:rindex])
lstart = -1

I tokenize all the words and pad them as a sequence as a list of list. Then I compare the first list against the second building string buffers and match when the index length counts differ. I then remove the duplicate indexe values for out1 and out2 at the end
from keras.preprocessing.text import Tokenizer
tokenizer=Tokenizer()
# list 1
list1 = [["When"], ["Grazia Deledda"], ["submitted a short story"], ["to"],
["a"], ["fashion magazine"], ["at"], ["the age of"], ["13"],["EOS"]]
# list 2
list2 = [["When"], ["Grazia Deledda"], ["submitted"], ["a short story"], ["to"],
["a fashion"], ["magazine at"], ["the age of"], ["13"],["EOS"]]
tokenizer.fit_on_texts([" ".join(item) for item in list1])
tokenizer.fit_on_texts([" ".join(item) for item in list2])
seq1=[]
seq2=[]
for item1,item2 in zip(list1,list2):
seq1.append(tokenizer.texts_to_sequences(item1))
seq2.append(tokenizer.texts_to_sequences(item2))
out1=[]
out2=[]
out1_buffer=[]
out2_buffer=[]
current_index=0
string1=""
for seq1_index in range(len(seq1)-1):
string1=""
index=0
out1_buffer=[]
found=False
#check each seq1 string accumulation until a match is found or the end of queue is detect 16 - maps to eos
while seq1[seq1_index+index][0] != [16] and found==False:
out1_buffer.append(seq1_index+index)
seq_string=" ".join([str(token) for token in seq1[seq1_index+index][0]])
if string1=="":
string1=seq_string
else:
string1+=" "+seq_string
string2=""
out2_buffer=[]
for seq2_index in range(current_index,len(seq2)-1):
seq_string=" ".join([str(token) for token in seq2[seq2_index][0]])
if string2=="":
string2=seq_string
else:
string2+=" "+seq_string
out2_buffer.append(seq2_index)
count_seq1=len(out1_buffer)
count_seq2=len(out2_buffer)
if string1==string2 and count_seq1!=count_seq2:
print("string_a", [list1[int(index)] for index in out1_buffer])
print("string_b",[list2[int(index)] for index in out2_buffer])
current_index=seq2_index+1
print("match",count_seq1,count_seq2)
for index1 in out1_buffer:
out1.append(index1)
for index2 in out2_buffer:
out2.append(index2)
out1_buffer=[]
out2_buffer=[]
found=True
break
index+=1
tuple1=[]
tuple2=[]
result1=[]
for item1 in out1:
found=False
for item2 in out2:
if list1[item1]==list2[item2]:
found=True
break
if found==True:
out2 = list(filter(lambda item2: list1[item1]!=list2[item2],out2))
if found==False:
result1.append(item1)
for item1 in result1:
tuple1.append(list1[item1])
for item2 in out2:
tuple2.append(list2[item2])
tuple1=tuple(tuple1)
tuple2=tuple(tuple2)
print("{}\n{}\n".format(tuple1,tuple2))
output
(['submitted a short story'], ['a'], ['fashion magazine'], ['at'])
(['submitted'], ['a short story'], ['a fashion'], ['magazine at'])

Related

ruby - How to make an array of arrays of letters (a-z) of varying lengths with maximum length five

So I'm trying to make an array of all possible permutations of the alphabet letters (all lowercase), in which the letters can repeat and vary in length from 1 to 5. So for example these are some possibilities that would be in the array:
['this','is','some','examp','le']
I tried this, and it gets all the variations of words 5 letters long, but I don't know how to find varying length.
("a".."z").to_a.repeated_permutation(5).map(&:join)
EDIT:
I'm trying to do this in order to crack a SHA1 encrypted string:
require 'digest'
def decrypt_string(hash)
("a".."z").to_a.repeated_permutation(5).map(&:join).find {|elem| Digest::SHA1.hexdigest(elem) == hash}
end
Hash being the SHA1 encryption of the word, such as 'e6fb06210fafc02fd7479ddbed2d042cc3a5155e'
You can modify your method slightly.
require 'digest'
def decrypt_string(hash)
arr = ("a".."z").to_a
(1..5).each do |n|
arr.repeated_permutation(n) do |a|
s = a.join
return s if Digest::SHA1.hexdigest(s) == hash
end
end
end
word = "cat"
hash = Digest::SHA1.hexdigest(word)
#=> "9d989e8d27dc9e0ec3389fc855f142c3d40f0c50"
decrypt_string(hash)
#=> "cat"
word = "zebra"
hash = Digest::SHA1.hexdigest(word)
#=> "38aa53de31c04bcfae9163cc23b7963ed9cf90f7"
decrypt_string(hash)
#=> "zebra"
Calculations for "cat" took well under one second on my 2020 Macbook Pro; those for "zebra" took about 15 seconds.
Note that join should be applied within repeated_permutation's block, as repeated_permutation(n).map(&:join) would create a temporary array having as many as 26**5 #=> 11,881,376 elements (for n = 5).
If you do not mind the possibility of repeating strings then
e = Enumerator.new do |y|
r = ('a'..'z').to_a * 5
loop do
y << r.shuffle.take(rand(4)+1).join
end
end
Should work. Then you can call as
e.take(10)
#=> ["bz", "tnld", "jv", "s", "ngrm", "phiy", "ar", "zq", "ajjn", "cn"]
This:
Creates an Array of a through z repeated 5 times
Continually shuffles said Array
Then takes the first 1 to 5 ("random number") elements from the shuffled Array and joins them together

How to compress a string in-place

I have been working on this problem on leetcode https://leetcode.com/problems/string-compression/
Given an array of characters, compress it in-place.
The length after compression must always be smaller than or equal to the original array.
Every element of the array should be a character (not int) of length 1.
After you are done modifying the input array in-place, return the new length of the array.
I almost have a solution, but I can't seem to count the last character in the string and I also am not sure how to make it so if there is only an amount of one of a character that I do not show 1 in the array.
I feel like I'm pretty close and I'd like to try and keep the solution that I have without altering it too much if possible.
This is what I have so far. chars is a list of characters
def compress(chars):
char = 0
curr = 0
count = 0
while curr < len(chars):
if chars[char] == chars[curr]:
count += 1
else:
# if count == 1:
# break
# else:
chars[char-1] = count
char = curr
count = 0
curr += 1
chars[char-1] += 1
return chars
print(compress(["a", "a", "b", "b", "c", "c", "c"]))
I wasn't quite able to format your code to get the answer you were seeking. Based on your answer, I was able to put together code and explanation that could help you:
def compress(chars):
count = 1
current_position = 0
# if it's a single character, just return a
# a basic array with count
if len(chars) == 1:
chars.append("1")
return chars
# loop till the 2nd last character is analyzed
while current_position < len(chars) - 1:
# assume that we haven't reached the 2nd last character
# if next character is the same as the current one, delete
# the current one and increase our count
while current_position < len(chars) - 1 and \
chars[current_position] == chars[current_position + 1]:
del chars[current_position]
count += 1
# if next character isn't the same, time to add the count to
# the list. Split the number into
# character list (e.g. 12 will become ["1", "2"]
# insert those numbers behind the character and increment position
for x in str(count):
chars.insert(current_position + 1, str(x))
current_position += 1
# great, on to the next character
current_position += 1
# if we are now at the last character, it's a lonely character
# give it a counter of 1 and exit the looping
if current_position == len(chars) - 1:
chars.append("1")
break
count = 1
return chars
mylist = ["a","b","b","b","b","b","b","b","b","b","b","b","b"]
print(compress(mylist))
Results
mylist = ["a","b","b","b","b","b","b","b","b","b","b","b","b"]
['a', '1', 'b', '1', '2']
mylist = ["a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b"]
['a', '1', '0', 'b', '1', '2']
mylist = ["a"]
['a', '1']
mylist = ["a","b"]
['a', '1', 'b', '1']
mylist = ["a","a","b","b","c","c","c"]
['a', '2', 'b', '2', 'c', '3']

Append strings to array if found in paragraph using `.match` in Ruby

I'm attempting to search a paragraph for each word in an array, and then output a new array with only the words that could be found.
But I've been unable to get the desired output format so far.
paragraph = "Japan is a stratovolcanic archipelago of 6,852 islands.
The four largest are Honshu, Hokkaido, Kyushu and Shikoku, which make up about ninety-seven percent of Japan's land area.
The country is divided into 47 prefectures in eight regions."
words_to_find = %w[ Japan archipelago fishing country ]
words_found = []
words_to_find.each do |w|
paragraph.match(/#{w}/) ? words_found << w : nil
end
puts words_found
Currently the output I'm getting is a vertical list of printed words.
Japan
archipelago
country
But I would like something like, ['Japan', 'archipelago', 'country'].
I don't have much experience matching text in a paragraph and am not sure what I'm doing wrong here. Could anyone give some guidance?
this is because you are using puts to print the elements of the array . appending "\n" to the end of every element "word":
#!/usr/bin/env ruby
def run_me
paragraph = "Japan is a stratovolcanic archipelago of 6,852 islands.
the four largest are Honshu, Hokkaido, Kyushu and Shikoku, which make up about ninety-seven percent of Japan's land area.
the country is divided into 47 prefectures in eight regions."
words_to_find = %w[ Japan archipelago fishing country ]
find_words_from_a_text_file paragraph , words_to_find
end
def find_words_from_a_text_file( paragraph , *words_to_find )
words_found = []
words_to_find.each do |w|
paragraph.match(/#{w}/) ? words_found << w : nil
end
# print array with enum .
words_found.each { |x| puts "with enum and puts : : #{x}" }
# or just use "print , which does not add anew line"
print "with print :"; print words_found "\n"
# or with p
p words_found
end
run_me
outputs :
za:ruby_dir za$ ./fooscript.rb
with enum and puts : : ["Japan", "archipelago", "fishing", "country"]
with print :[["Japan", "archipelago", "fishing", "country"]]
Here are a couple of ways to do that. Both are case-indifferent.
Use a regular expression
r = /
\b # Match a word break
#{ Regexp.union(words_to_find) } # Match any word in words_to_find
\b # Match a word break
/xi # Free-spacing regex definition mode (x)
# and case-indifferent (i)
#=> /
# \b # Match a word break
# (?-mix:Japan|archipelago|fishing|country) # Match any word in words_to_find
# \b # Match a word break
# /ix # Free-spacing regex definition mode (x)
# and case-indifferent (i)
paragraph.scan(r).uniq(&:itself)
#=> ["Japan", "archipelago", "country"]
Intersect two arrays
words_to_find_hash = words_to_find.each_with_object({}) { |w,h| h[w.downcase] = w }
#=> {"japan"=>"Japan", "archipelago"=>"archipelago", "fishing"=>"fishing",
"country"=>"country"}
words_to_find_hash.values_at(*paragraph.delete(".;:,?'").
downcase.
split.
uniq & words_to_find_hash.keys)
#=> ["Japan", "archipelago", "country"]

Dynamically deleting elements from an array while enumerating through it

I am going through my system dictionary and looking for words that are, according to a strict definition, neither subsets nor supersets of any other word.
The implementation below does not work, but if it did, it would be pretty efficient, I think. How do I iterate through the array and also remove items from that same array during iteration?
def collect_dead_words
result = #file #the words in my system dictionary, as an array
wg = WordGame.new # the class that "knows" the find_subset_words &
# find_superset_words methods
result.each do |value|
wg.word = value
supersets = wg.find_superset_words.values.flatten
subsets = wg.find_subset_words.values.flatten
result.delete(value) unless (matches.empty? && subsets.empty?)
result.reject! { |cand| supersets.include? cand }
result.reject! { |cand| subsets.include? cand }
end
result
end
Note: find_superset_words and find_subset_words both return hashes, hence the values.flatten bit
It is inadvisable to modify a collection while iterating over it. Instead, either iterate over a copy of the collection, or create a separate array of things to remove later.
One way to accomplish this is with Array#delete_if. Here's my run at it so you get the idea:
supersets_and_subsets = []
result.delete_if do |el|
wg.word = el
superset_and_subset = wg.find_superset_words.values.flatten + wg.find_subset_words.values.flatten
supersets_and_subsets << superset_and_subset
!superset_and_subset.empty?
end
result -= supersets_and_subsets.flatten.uniq
Here's what I came up with based on your feedback (plus a further optimization by starting with the shortest words):
def collect_dead_words
result = []
collection = #file
num = #file.max_by(&:length).length
1.upto(num) do |index|
subset_by_length = collection.select {|word| word.length == index }
while !subset_by_length.empty? do
wg = WordGame.new(subset_by_length[0])
supermatches = wg.find_superset_words.values.flatten
submatches = wg.find_subset_words.values.flatten
collection.reject! { |cand| supermatches.include? cand }
collection.reject! { |cand| submatches.include? cand }
result << wg.word if (supermatches.empty? && submatches.empty?)
subset.delete(subset_by_length[0])
collection.delete(subset_by_length[0])
end
end
result
end
Further optimizations are welcome!
The problem
As I understand, string s1 is a subset of string s2 if s1 == s2 after zero or more characters are removed from s2; that is, if there exists a mapping m of the indices of s1 such that1:
for each index i of s1, s1[i] = s2[m(i)]; and
if i < j then m(i) < m(j).
Further s2 is a superset of s1 if and only if s1 is a subset of s2.
Note that for s1 to be a subset of s2, s1.size <= s2.size must be true.
For example:
"cat" is a subset of "craft" because the latter becomes "cat" if the "r" and "f" are removed.
"cat" is not a subset of "cutie" because "cutie" has no "a".
"cat" is not a superset of "at" because "cat".include?("at") #=> true`.
"cat" is not a subset of "enact" because m(0) = 3 and m(1) = 2, but m(0) < m(1) is false;
Algorithm
Subset (and hence superset) is a transitive relation, which permit significant algorithmic efficiencies. By this I mean that if s1 is a subset of s2 and s2 is a subset of s3, then s1 is a subset of s3.
I will proceed as follows:
Create empty sets neither_sub_nor_sup and longest_sups and an empty array subs_and_sups.
Sort the words in the dictionary by length, longest first.
Add w to neither_sub_nor_sup, where w is longest word in the dictionary.
For each subsequent word w in the dictionary (longest to shortest), perform the following operations:
for each element u of neither_sub_nor_sup determine if w is a subset of u. If it is, move u from neither_sub_nor_sup to longest_sups and append u to subs_and_sups.
if one or more elements were moved from from neither_sub_nor_sup to longest_sups, append w to subs_and_sups; else add w to neither_sub_nor_sup.
Return subs_and_sups.
Code
require 'set'
def identify_subs_and_sups(dict)
neither_sub_nor_sup, longest_sups = Set.new, Set.new
dict.sort_by(&:size).reverse.each_with_object([]) do |w,subs_and_sups|
switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
arr << u if w.subset(u) }
if switchers.any?
subs_and_sups << w
switchers.each do |u|
neither_sub_nor_sup.delete(u)
longest_sups << u
subs_and_sups << u
end
else
neither_sub_nor_sup << w
end
end
end
class String
def subset(w)
w =~ Regexp.new(self.gsub(/./) { |m| "#{m}\\w*" })
end
end
Example
dict = %w| cat catch craft cutie enact trivial rivert river |
#=> ["cat", "catch", "craft", "cutie", "enact", "trivial", "rivert", "river"]
identify_subs_and_sups(dict)
#=> ["river", "rivert", "cat", "catch", "craft"]
Variant
Rather than processing the words in the dictionary from longest to shortest, we could instead order them shortest to longest:
def identify_subs_and_sups1(dict)
neither_sub_nor_sup, shortest_sups = Set.new, Set.new
dict.sort_by(&:size).each_with_object([]) do |w,subs_and_sups|
switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
arr << u if u.subset(w) }
if switchers.any?
subs_and_sups << w
switchers.each do |u|
neither_sub_nor_sup.delete(u)
shortest_sups << u
subs_and_sups << u
end
else
neither_sub_nor_sup << w
end
end
end
identify_subs_and_sups1(dict)
#=> ["craft", "cat", "rivert", "river"]
Benchmarks
(to be continued...)
1 The OP stated (in a later comment) that s1 is not a substring of s2 if s2.include?(s1) #=> true. I am going to pretend I never saw that, as it throws a spanner into the works. Unfortunately, subset is no longer a transitive relation with that additional requirement. I haven't investigate the implications of that, but I suspect it means a rather brutish algorithm would be required, possibly requiring pairwise comparisons of all the words in the dictionary.

Counting matching elements in an array

Given two arrays of equal size, how can I find the number of matching elements disregarding the position?
For example:
[0,0,5] and [0,5,5] would return a match of 2 since there is one 0 and one 5 in common;
[1,0,0,3] and [0,0,1,4] would return a match of 3 since there are two matches of 0 and one match of 1;
[1,2,2,3] and [1,2,3,4] would return a match of 3.
I tried a number of ideas, but they all tend to get rather gnarly and convoluted. I'm guessing there is some nice Ruby idiom, or perhaps a regex that would be an elegant answer to this solution.
You can accomplish it with count:
a.count{|e| index = b.index(e) and b.delete_at index }
Demonstration
or with inject:
a.inject(0){|count, e| count + ((index = b.index(e) and b.delete_at index) ? 1 : 0)}
Demonstration
or with select and length (or it's alias – size):
a.select{|e| (index = b.index(e) and b.delete_at index)}.size
Demonstration
Results:
a, b = [0,0,5], [0,5,5] output: => 2;
a, b = [1,2,2,3], [1,2,3,4] output: => 3;
a, b = [1,0,0,3], [0,0,1,4] output => 3.
(arr1 & arr2).map { |i| [arr1.count(i), arr2.count(i)].min }.inject(0, &:+)
Here (arr1 & arr2) return list of uniq values that both arrays contain, arr.count(i) counts the number of items i in the array.
Another use for the mighty (and much needed) Array#difference, which I defined in my answer here. This method is similar to Array#-. The difference between the two methods is illustrated in the following example:
a = [1,2,3,4,3,2,4,2]
b = [2,3,4,4,4]
a - b #=> [1]
a.difference b #=> [1, 3, 2, 2]
For the present application:
def number_matches(a,b)
left_in_b = b
a.reduce(0) do |t,e|
if left_in_b.include?(e)
left_in_b = left_in_b.difference [e]
t+1
else
t
end
end
end
number_matches [0,0,5], [0,5,5] #=> 2
number_matches [1,0,0,3], [0,0,1,4] #=> 3
number_matches [1,0,0,3], [0,0,1,4] #=> 3
Using the multiset gem:
(Multiset.new(a) & Multiset.new(b)).size
Multiset is like Set, but allows duplicate values. & is the "set intersection" operator (return all things that are in both sets).
I don't think this is an ideal answer, because it's a bit complex, but...
def count(arr)
arr.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
end
def matches(a1, a2)
m = 0
a1_counts = count(a1)
a2_counts = count(a2)
a1_counts.each do |e, c|
m += [a1_counts, a2_counts].min
end
m
end
Basically, first write a method that creates a hash from an array of the number of times each element appears. Then, use those to sum up the smallest number of times each element appears in both arrays.

Resources