Suppose I have a sentence:
When Grazia Deledda submitted a short story to a fashion magazine at the age of 13
this sentence is split into two list:
# list 1
[When] [Grazia Deledda] [submitted a short story] [to] [a] [fashion magazine] [at] [the age of] [13]
# list 2
[When] [Grazia Deledda] [submitted] [a short story] [to] [a fashion] [magazine at] [the age of] [13]
Now I want to get the different parts in this two array, this example's result should be:
[
([[submitted a short story]],[[submitted] [a short story]]),
([[a] [fashion magazine] [at]], [[a fashion] [magazine at]])
]
so it should meet these requirements:
every pair should have the same content, for example: [[submitted a short story]] can be joined into 'submitted a short story', and [[submitted] [a short story]] can also be joined into 'submitted a short story'
every pair should have the same start position and end position, for example: [[submitted a short story]] the starts at 3, and ends with 6. [[submitted] [a short story] are the same.
the most important is that every one should be the shortest, for example [[submitted a short story] [to]] and [[submitted] [a short story] [to]] also meets the first two requirements, but it is not the shortest.
Any way to avoid O(n^2) complexity?
I may got wrong direction at beginning, this question can be easy, I have think a good idea:
#!/usr/bin/env python
# encoding: utf-8
# list 1
llist = [["When"], ["Grazia", "Deledda"], ["submitted", "a", "short", "story"], ["to"], ["a"], ["fashion", "magazine"], ["at"], ["the", "age", "of",], ["13"],]
# list 2
rlist = [["When"], ["Grazia", "Deledda"], ["submitted"], ["a", "short", "story"], ["to"], ["a", "fashion"], ["magazine", "at"], ["the", "age", "of",], ["13"],]
loffset = -1
roffset = 0
rindex = 0
lstart = -1
rstart = -1
for lindex, litem in enumerate(llist):
if loffset == roffset and litem != rlist[rindex]:
lstart = lindex
rstart = rindex
loffset += len(litem)
while roffset < loffset:
roffset += len(rlist[rindex])
rindex += 1
if loffset == roffset and lstart >= 0:
print(llist[lstart:lindex+1], rlist[rstart:rindex])
lstart = -1
I tokenize all the words and pad them as a sequence as a list of list. Then I compare the first list against the second building string buffers and match when the index length counts differ. I then remove the duplicate indexe values for out1 and out2 at the end
from keras.preprocessing.text import Tokenizer
tokenizer=Tokenizer()
# list 1
list1 = [["When"], ["Grazia Deledda"], ["submitted a short story"], ["to"],
["a"], ["fashion magazine"], ["at"], ["the age of"], ["13"],["EOS"]]
# list 2
list2 = [["When"], ["Grazia Deledda"], ["submitted"], ["a short story"], ["to"],
["a fashion"], ["magazine at"], ["the age of"], ["13"],["EOS"]]
tokenizer.fit_on_texts([" ".join(item) for item in list1])
tokenizer.fit_on_texts([" ".join(item) for item in list2])
seq1=[]
seq2=[]
for item1,item2 in zip(list1,list2):
seq1.append(tokenizer.texts_to_sequences(item1))
seq2.append(tokenizer.texts_to_sequences(item2))
out1=[]
out2=[]
out1_buffer=[]
out2_buffer=[]
current_index=0
string1=""
for seq1_index in range(len(seq1)-1):
string1=""
index=0
out1_buffer=[]
found=False
#check each seq1 string accumulation until a match is found or the end of queue is detect 16 - maps to eos
while seq1[seq1_index+index][0] != [16] and found==False:
out1_buffer.append(seq1_index+index)
seq_string=" ".join([str(token) for token in seq1[seq1_index+index][0]])
if string1=="":
string1=seq_string
else:
string1+=" "+seq_string
string2=""
out2_buffer=[]
for seq2_index in range(current_index,len(seq2)-1):
seq_string=" ".join([str(token) for token in seq2[seq2_index][0]])
if string2=="":
string2=seq_string
else:
string2+=" "+seq_string
out2_buffer.append(seq2_index)
count_seq1=len(out1_buffer)
count_seq2=len(out2_buffer)
if string1==string2 and count_seq1!=count_seq2:
print("string_a", [list1[int(index)] for index in out1_buffer])
print("string_b",[list2[int(index)] for index in out2_buffer])
current_index=seq2_index+1
print("match",count_seq1,count_seq2)
for index1 in out1_buffer:
out1.append(index1)
for index2 in out2_buffer:
out2.append(index2)
out1_buffer=[]
out2_buffer=[]
found=True
break
index+=1
tuple1=[]
tuple2=[]
result1=[]
for item1 in out1:
found=False
for item2 in out2:
if list1[item1]==list2[item2]:
found=True
break
if found==True:
out2 = list(filter(lambda item2: list1[item1]!=list2[item2],out2))
if found==False:
result1.append(item1)
for item1 in result1:
tuple1.append(list1[item1])
for item2 in out2:
tuple2.append(list2[item2])
tuple1=tuple(tuple1)
tuple2=tuple(tuple2)
print("{}\n{}\n".format(tuple1,tuple2))
output
(['submitted a short story'], ['a'], ['fashion magazine'], ['at'])
(['submitted'], ['a short story'], ['a fashion'], ['magazine at'])
Related
So I'm trying to make an array of all possible permutations of the alphabet letters (all lowercase), in which the letters can repeat and vary in length from 1 to 5. So for example these are some possibilities that would be in the array:
['this','is','some','examp','le']
I tried this, and it gets all the variations of words 5 letters long, but I don't know how to find varying length.
("a".."z").to_a.repeated_permutation(5).map(&:join)
EDIT:
I'm trying to do this in order to crack a SHA1 encrypted string:
require 'digest'
def decrypt_string(hash)
("a".."z").to_a.repeated_permutation(5).map(&:join).find {|elem| Digest::SHA1.hexdigest(elem) == hash}
end
Hash being the SHA1 encryption of the word, such as 'e6fb06210fafc02fd7479ddbed2d042cc3a5155e'
You can modify your method slightly.
require 'digest'
def decrypt_string(hash)
arr = ("a".."z").to_a
(1..5).each do |n|
arr.repeated_permutation(n) do |a|
s = a.join
return s if Digest::SHA1.hexdigest(s) == hash
end
end
end
word = "cat"
hash = Digest::SHA1.hexdigest(word)
#=> "9d989e8d27dc9e0ec3389fc855f142c3d40f0c50"
decrypt_string(hash)
#=> "cat"
word = "zebra"
hash = Digest::SHA1.hexdigest(word)
#=> "38aa53de31c04bcfae9163cc23b7963ed9cf90f7"
decrypt_string(hash)
#=> "zebra"
Calculations for "cat" took well under one second on my 2020 Macbook Pro; those for "zebra" took about 15 seconds.
Note that join should be applied within repeated_permutation's block, as repeated_permutation(n).map(&:join) would create a temporary array having as many as 26**5 #=> 11,881,376 elements (for n = 5).
If you do not mind the possibility of repeating strings then
e = Enumerator.new do |y|
r = ('a'..'z').to_a * 5
loop do
y << r.shuffle.take(rand(4)+1).join
end
end
Should work. Then you can call as
e.take(10)
#=> ["bz", "tnld", "jv", "s", "ngrm", "phiy", "ar", "zq", "ajjn", "cn"]
This:
Creates an Array of a through z repeated 5 times
Continually shuffles said Array
Then takes the first 1 to 5 ("random number") elements from the shuffled Array and joins them together
I have been working on this problem on leetcode https://leetcode.com/problems/string-compression/
Given an array of characters, compress it in-place.
The length after compression must always be smaller than or equal to the original array.
Every element of the array should be a character (not int) of length 1.
After you are done modifying the input array in-place, return the new length of the array.
I almost have a solution, but I can't seem to count the last character in the string and I also am not sure how to make it so if there is only an amount of one of a character that I do not show 1 in the array.
I feel like I'm pretty close and I'd like to try and keep the solution that I have without altering it too much if possible.
This is what I have so far. chars is a list of characters
def compress(chars):
char = 0
curr = 0
count = 0
while curr < len(chars):
if chars[char] == chars[curr]:
count += 1
else:
# if count == 1:
# break
# else:
chars[char-1] = count
char = curr
count = 0
curr += 1
chars[char-1] += 1
return chars
print(compress(["a", "a", "b", "b", "c", "c", "c"]))
I wasn't quite able to format your code to get the answer you were seeking. Based on your answer, I was able to put together code and explanation that could help you:
def compress(chars):
count = 1
current_position = 0
# if it's a single character, just return a
# a basic array with count
if len(chars) == 1:
chars.append("1")
return chars
# loop till the 2nd last character is analyzed
while current_position < len(chars) - 1:
# assume that we haven't reached the 2nd last character
# if next character is the same as the current one, delete
# the current one and increase our count
while current_position < len(chars) - 1 and \
chars[current_position] == chars[current_position + 1]:
del chars[current_position]
count += 1
# if next character isn't the same, time to add the count to
# the list. Split the number into
# character list (e.g. 12 will become ["1", "2"]
# insert those numbers behind the character and increment position
for x in str(count):
chars.insert(current_position + 1, str(x))
current_position += 1
# great, on to the next character
current_position += 1
# if we are now at the last character, it's a lonely character
# give it a counter of 1 and exit the looping
if current_position == len(chars) - 1:
chars.append("1")
break
count = 1
return chars
mylist = ["a","b","b","b","b","b","b","b","b","b","b","b","b"]
print(compress(mylist))
Results
mylist = ["a","b","b","b","b","b","b","b","b","b","b","b","b"]
['a', '1', 'b', '1', '2']
mylist = ["a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b"]
['a', '1', '0', 'b', '1', '2']
mylist = ["a"]
['a', '1']
mylist = ["a","b"]
['a', '1', 'b', '1']
mylist = ["a","a","b","b","c","c","c"]
['a', '2', 'b', '2', 'c', '3']
I'm attempting to search a paragraph for each word in an array, and then output a new array with only the words that could be found.
But I've been unable to get the desired output format so far.
paragraph = "Japan is a stratovolcanic archipelago of 6,852 islands.
The four largest are Honshu, Hokkaido, Kyushu and Shikoku, which make up about ninety-seven percent of Japan's land area.
The country is divided into 47 prefectures in eight regions."
words_to_find = %w[ Japan archipelago fishing country ]
words_found = []
words_to_find.each do |w|
paragraph.match(/#{w}/) ? words_found << w : nil
end
puts words_found
Currently the output I'm getting is a vertical list of printed words.
Japan
archipelago
country
But I would like something like, ['Japan', 'archipelago', 'country'].
I don't have much experience matching text in a paragraph and am not sure what I'm doing wrong here. Could anyone give some guidance?
this is because you are using puts to print the elements of the array . appending "\n" to the end of every element "word":
#!/usr/bin/env ruby
def run_me
paragraph = "Japan is a stratovolcanic archipelago of 6,852 islands.
the four largest are Honshu, Hokkaido, Kyushu and Shikoku, which make up about ninety-seven percent of Japan's land area.
the country is divided into 47 prefectures in eight regions."
words_to_find = %w[ Japan archipelago fishing country ]
find_words_from_a_text_file paragraph , words_to_find
end
def find_words_from_a_text_file( paragraph , *words_to_find )
words_found = []
words_to_find.each do |w|
paragraph.match(/#{w}/) ? words_found << w : nil
end
# print array with enum .
words_found.each { |x| puts "with enum and puts : : #{x}" }
# or just use "print , which does not add anew line"
print "with print :"; print words_found "\n"
# or with p
p words_found
end
run_me
outputs :
za:ruby_dir za$ ./fooscript.rb
with enum and puts : : ["Japan", "archipelago", "fishing", "country"]
with print :[["Japan", "archipelago", "fishing", "country"]]
Here are a couple of ways to do that. Both are case-indifferent.
Use a regular expression
r = /
\b # Match a word break
#{ Regexp.union(words_to_find) } # Match any word in words_to_find
\b # Match a word break
/xi # Free-spacing regex definition mode (x)
# and case-indifferent (i)
#=> /
# \b # Match a word break
# (?-mix:Japan|archipelago|fishing|country) # Match any word in words_to_find
# \b # Match a word break
# /ix # Free-spacing regex definition mode (x)
# and case-indifferent (i)
paragraph.scan(r).uniq(&:itself)
#=> ["Japan", "archipelago", "country"]
Intersect two arrays
words_to_find_hash = words_to_find.each_with_object({}) { |w,h| h[w.downcase] = w }
#=> {"japan"=>"Japan", "archipelago"=>"archipelago", "fishing"=>"fishing",
"country"=>"country"}
words_to_find_hash.values_at(*paragraph.delete(".;:,?'").
downcase.
split.
uniq & words_to_find_hash.keys)
#=> ["Japan", "archipelago", "country"]
I am going through my system dictionary and looking for words that are, according to a strict definition, neither subsets nor supersets of any other word.
The implementation below does not work, but if it did, it would be pretty efficient, I think. How do I iterate through the array and also remove items from that same array during iteration?
def collect_dead_words
result = #file #the words in my system dictionary, as an array
wg = WordGame.new # the class that "knows" the find_subset_words &
# find_superset_words methods
result.each do |value|
wg.word = value
supersets = wg.find_superset_words.values.flatten
subsets = wg.find_subset_words.values.flatten
result.delete(value) unless (matches.empty? && subsets.empty?)
result.reject! { |cand| supersets.include? cand }
result.reject! { |cand| subsets.include? cand }
end
result
end
Note: find_superset_words and find_subset_words both return hashes, hence the values.flatten bit
It is inadvisable to modify a collection while iterating over it. Instead, either iterate over a copy of the collection, or create a separate array of things to remove later.
One way to accomplish this is with Array#delete_if. Here's my run at it so you get the idea:
supersets_and_subsets = []
result.delete_if do |el|
wg.word = el
superset_and_subset = wg.find_superset_words.values.flatten + wg.find_subset_words.values.flatten
supersets_and_subsets << superset_and_subset
!superset_and_subset.empty?
end
result -= supersets_and_subsets.flatten.uniq
Here's what I came up with based on your feedback (plus a further optimization by starting with the shortest words):
def collect_dead_words
result = []
collection = #file
num = #file.max_by(&:length).length
1.upto(num) do |index|
subset_by_length = collection.select {|word| word.length == index }
while !subset_by_length.empty? do
wg = WordGame.new(subset_by_length[0])
supermatches = wg.find_superset_words.values.flatten
submatches = wg.find_subset_words.values.flatten
collection.reject! { |cand| supermatches.include? cand }
collection.reject! { |cand| submatches.include? cand }
result << wg.word if (supermatches.empty? && submatches.empty?)
subset.delete(subset_by_length[0])
collection.delete(subset_by_length[0])
end
end
result
end
Further optimizations are welcome!
The problem
As I understand, string s1 is a subset of string s2 if s1 == s2 after zero or more characters are removed from s2; that is, if there exists a mapping m of the indices of s1 such that1:
for each index i of s1, s1[i] = s2[m(i)]; and
if i < j then m(i) < m(j).
Further s2 is a superset of s1 if and only if s1 is a subset of s2.
Note that for s1 to be a subset of s2, s1.size <= s2.size must be true.
For example:
"cat" is a subset of "craft" because the latter becomes "cat" if the "r" and "f" are removed.
"cat" is not a subset of "cutie" because "cutie" has no "a".
"cat" is not a superset of "at" because "cat".include?("at") #=> true`.
"cat" is not a subset of "enact" because m(0) = 3 and m(1) = 2, but m(0) < m(1) is false;
Algorithm
Subset (and hence superset) is a transitive relation, which permit significant algorithmic efficiencies. By this I mean that if s1 is a subset of s2 and s2 is a subset of s3, then s1 is a subset of s3.
I will proceed as follows:
Create empty sets neither_sub_nor_sup and longest_sups and an empty array subs_and_sups.
Sort the words in the dictionary by length, longest first.
Add w to neither_sub_nor_sup, where w is longest word in the dictionary.
For each subsequent word w in the dictionary (longest to shortest), perform the following operations:
for each element u of neither_sub_nor_sup determine if w is a subset of u. If it is, move u from neither_sub_nor_sup to longest_sups and append u to subs_and_sups.
if one or more elements were moved from from neither_sub_nor_sup to longest_sups, append w to subs_and_sups; else add w to neither_sub_nor_sup.
Return subs_and_sups.
Code
require 'set'
def identify_subs_and_sups(dict)
neither_sub_nor_sup, longest_sups = Set.new, Set.new
dict.sort_by(&:size).reverse.each_with_object([]) do |w,subs_and_sups|
switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
arr << u if w.subset(u) }
if switchers.any?
subs_and_sups << w
switchers.each do |u|
neither_sub_nor_sup.delete(u)
longest_sups << u
subs_and_sups << u
end
else
neither_sub_nor_sup << w
end
end
end
class String
def subset(w)
w =~ Regexp.new(self.gsub(/./) { |m| "#{m}\\w*" })
end
end
Example
dict = %w| cat catch craft cutie enact trivial rivert river |
#=> ["cat", "catch", "craft", "cutie", "enact", "trivial", "rivert", "river"]
identify_subs_and_sups(dict)
#=> ["river", "rivert", "cat", "catch", "craft"]
Variant
Rather than processing the words in the dictionary from longest to shortest, we could instead order them shortest to longest:
def identify_subs_and_sups1(dict)
neither_sub_nor_sup, shortest_sups = Set.new, Set.new
dict.sort_by(&:size).each_with_object([]) do |w,subs_and_sups|
switchers = neither_sub_nor_sup.each_with_object([]) { |u,arr|
arr << u if u.subset(w) }
if switchers.any?
subs_and_sups << w
switchers.each do |u|
neither_sub_nor_sup.delete(u)
shortest_sups << u
subs_and_sups << u
end
else
neither_sub_nor_sup << w
end
end
end
identify_subs_and_sups1(dict)
#=> ["craft", "cat", "rivert", "river"]
Benchmarks
(to be continued...)
1 The OP stated (in a later comment) that s1 is not a substring of s2 if s2.include?(s1) #=> true. I am going to pretend I never saw that, as it throws a spanner into the works. Unfortunately, subset is no longer a transitive relation with that additional requirement. I haven't investigate the implications of that, but I suspect it means a rather brutish algorithm would be required, possibly requiring pairwise comparisons of all the words in the dictionary.
Given two arrays of equal size, how can I find the number of matching elements disregarding the position?
For example:
[0,0,5] and [0,5,5] would return a match of 2 since there is one 0 and one 5 in common;
[1,0,0,3] and [0,0,1,4] would return a match of 3 since there are two matches of 0 and one match of 1;
[1,2,2,3] and [1,2,3,4] would return a match of 3.
I tried a number of ideas, but they all tend to get rather gnarly and convoluted. I'm guessing there is some nice Ruby idiom, or perhaps a regex that would be an elegant answer to this solution.
You can accomplish it with count:
a.count{|e| index = b.index(e) and b.delete_at index }
Demonstration
or with inject:
a.inject(0){|count, e| count + ((index = b.index(e) and b.delete_at index) ? 1 : 0)}
Demonstration
or with select and length (or it's alias – size):
a.select{|e| (index = b.index(e) and b.delete_at index)}.size
Demonstration
Results:
a, b = [0,0,5], [0,5,5] output: => 2;
a, b = [1,2,2,3], [1,2,3,4] output: => 3;
a, b = [1,0,0,3], [0,0,1,4] output => 3.
(arr1 & arr2).map { |i| [arr1.count(i), arr2.count(i)].min }.inject(0, &:+)
Here (arr1 & arr2) return list of uniq values that both arrays contain, arr.count(i) counts the number of items i in the array.
Another use for the mighty (and much needed) Array#difference, which I defined in my answer here. This method is similar to Array#-. The difference between the two methods is illustrated in the following example:
a = [1,2,3,4,3,2,4,2]
b = [2,3,4,4,4]
a - b #=> [1]
a.difference b #=> [1, 3, 2, 2]
For the present application:
def number_matches(a,b)
left_in_b = b
a.reduce(0) do |t,e|
if left_in_b.include?(e)
left_in_b = left_in_b.difference [e]
t+1
else
t
end
end
end
number_matches [0,0,5], [0,5,5] #=> 2
number_matches [1,0,0,3], [0,0,1,4] #=> 3
number_matches [1,0,0,3], [0,0,1,4] #=> 3
Using the multiset gem:
(Multiset.new(a) & Multiset.new(b)).size
Multiset is like Set, but allows duplicate values. & is the "set intersection" operator (return all things that are in both sets).
I don't think this is an ideal answer, because it's a bit complex, but...
def count(arr)
arr.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
end
def matches(a1, a2)
m = 0
a1_counts = count(a1)
a2_counts = count(a2)
a1_counts.each do |e, c|
m += [a1_counts, a2_counts].min
end
m
end
Basically, first write a method that creates a hash from an array of the number of times each element appears. Then, use those to sum up the smallest number of times each element appears in both arrays.