R. Array-based replacement of string matches in data frame - arrays

I have a data frame column containing sentences.
Within these sentences, there's the whole host of words which I want to remove.
These are words that could appear more than once in a single sentence, and when found I want to remove these words entirely.
e.g.
Sample list of words for removal: ("the", "and", "a") * (list will have 100's of words)
String Before: "the quick brown fox jumps over the lazy dog and cat"
String After: "quick brown fox jumps over lazy dog cat"
sentences <- as.data.frame(c("it's a new sentence","another sentence i've constructed","and a third sentence"))
colnames(sentences) <- c("sentence")
stop_words <- list( "i" = '', "a" = "", "me" = '' , "my" = "", "myself" = "", "we" = "", "it's" = "", "a" = "", "i've" = "")
stop_pattern <- paste0("\\b", "(", paste0(stop_words, collapse = "|"),")","\\b")
trimws(gsub("\\s{2}", " ", gsub(stop_pattern, "", sentences$sentence)))
Output should remove words such as "I've" from the above sentences, however fails to do so.
Output is as shows:
[1] "it's a new sentence" "another sentence i've constructed" "and a third sentence"

Try:
stop_pattern <- paste0("\\b", "(", paste0(stop_words, collapse = "|"),")","\\b") trimws(gsub("\\s{2}", " ", gsub(stop_pattern, "", sentences)))

Related

Strings: How to combine first and last names in a long text string with other words?

Problem:
I have a given string = "Hello #User Name and hello again #Full Name and this works"
Desired output: = ["#User Name, #Full Name"]
Code I have in Swift:
let commentString = "Hello #User Name and hello again #Full Name and this works"
let words = commentString.components(separatedBy: " ")
let mentionQuery = "#"
for word in words.filter({ $0.hasPrefix(mentionQuery) }) {
print(word) = prints out each single name word "#User" and "#Full"
}
Trying this:
if words.filter({ $0.hasPrefix(mentionQuery) }).isNotEmpty {
print(words) ["Hello", "#User", "Name".. etc.]
}
I'm stuck on how to get an array of strings with the full name = ["#User Name", "#Full Name"]
Would you know how?
First of all, .filter means that check each value in the array which condition you given and if true then take value - which not fit here.
For the problem, it can divide into two task: Separate string into substring by " " ( which you have done); and combine 2 substring which starts with prefix "#"
Code will be like this
let commentString = "Hello #User Name and hello again #Full Name"
let words = commentString.components(separatedBy: " ")
let mentionQuery = "#"
var result : [String] = []
var i = 0
while i < words.count {
if words[i].hasPrefix(mentionQuery) {
result.append(words[i] + " " + words[i + 1])
i += 2
continue
}
i += 1
}
The result
print("result: ", result) // ["#User Name", "#Full Name"]
You can also use filter, like this below:
let str = "Hello #User Name and hello again #Full Name"
let res = str.components(separatedBy: " ").filter({$0.hasPrefix("#")})
print(res)
// ["#User", "#Full"]

How to match a string in array, regardless of the string size in Ruby

I am trying to figure out the following:
When I run this in the terminal using Ruby, the string in the array is removed until it is done when I continue typing in a string that is in the saxophone_section array. But I still want to be able to remove the string from the array when I type in "alto saxophone 1" and because "alto 1" is found in the input string.
How can I do this when a string in an array matches, regardless of the size of an input string?
saxophone_section = ["alto 1", "alto 2", "tenor 1", "tenor 2", "bari sax"]
until saxophone_section == []
puts "Think of sax parts in a jazz big band."
print ">>"
sax_part = gets.chomp.downcase
# this is the part that is confusing me. Trying to figure out the method in which
# a string in the above array matches an input, whether "alto 1" or "alto saxophone 1"
# or "Eb alto saxophone 1" is typed in ("alto 1" is found in all).
# How can I make it true in all three (or more) cases?
saxophone_section.any?(sax_part)
# I am thinking that this bottom parts one could be used? or not?
parts = saxophone_section.map {|sax| sax.gsub(/\s+/m, ' ').strip.split(" ")}
#this is the loop to delete the item in the array:
if saxophone_section.include?(sax_part) == true
p saxophone_section.delete_if{ |s| s == sax_part}
puts "Damn, you're lucky"
else
puts "Woops! Try again."
end
end
puts "You got all parts."
Converting strings into array and making an intersection operation should be an option. I know, this is not the best solution, but might save your day.
[17] pry(main)> x = "alto saxophone 1"
=> "alto saxophone 1"
[18] pry(main)> y = "i am not an anglophone"
=> "i am not an anglophone"
[19] pry(main)> z = "alto 1"
=> "alto 1"
[20] pry(main)> x.split & z.split == z.split # & is array intersection
=> true
[21] pry(main)> x.split & y.split == y.split
=> false
You should use regex to match the input. So instead of creating an array of strings, try the array of regular expressions like so;
saxophone_section = [/alto\s(?:.*\s)?1/, /alto\s(?:.*\s)?2/, /tenor\s(?:.*\s)?1/, /tenor\s(?:.*\s)?2/, /bari\s(?:.*\s)?sax/]
Then use match with all the elements in the array against the input to find if there is a match with the input string;
sax_part = gets.chomp.downcase
index = saxophone_section.find_index { |regex| sax_part.match(regex) }
Later you can use this index to remove the element from array if it's not nil;
saxophone_section.delete(index)
Or you can just use Array#delete_if method to delete the element from array directly like so;
saxophone_section.delete_if { |regex| sax_part.match(regex) }
Note: You can use https://www.rubular.com to test your regular expressions.
Here's where I'd start with this sort of task; These are great building blocks for human-interfaces on the web or in applications:
require 'regexp_trie'
saxophone_section = ["alto 1", "alto 2", "tenor 1", "tenor 2", "bari sax"]
RegexpTrie.union saxophone_section # => /(?:alto\ [12]|tenor\ [12]|bari\ sax)/
The output of RegexpTrie.union is a pattern that will match all of the strings in saxophone_section. The pattern is concise and efficient, and best of all, doesn't have to be generated by hand.
Applying that pattern to the string being created will show if you have a hit when there's a match, but only when there's enough of the string to match.
That's where a regular Trie is very useful. When you're trying to find what possible hits you could have, prior to having a full match, a Trie can find all the possibilities:
require 'trie'
trie = Trie.new
saxophone_section = ["alto 1", "alto 2", "tenor 1", "tenor 2", "bari sax"]
saxophone_section.each { |w| trie.add(w) }
trie.children('a') # => ["alto 1", "alto 2"]
trie.children('alto') # => ["alto 1", "alto 2"]
trie.children('alto 2') # => ["alto 2"]
trie.children('bari') # => ["bari sax"]
Blend those together and see what you come up with.

How to batch enumerables in ruby

In my quest to understand ruby's enumerable, I have something similar to the following
FileReader.read(very_big_file)
.lazy
.flat_map {|line| get_array_of_similar_words } # array.size is ~10
.each_slice(100) # wait for 100 items
.map{|array| process_100_items}
As much as each flat_map call emits an array of ~10 items, I was expecting the each_slice call to batch the items in 100's but that is not the case. I.e wait until there are 100 items before passing them to the final .map call.
How do I achieve functionality similar to the buffer function in reactive programming?
To see how lazy affects the calculations, let's look at an example. First construct a file:
str =<<~_
Now is the
time for all
good Ruby coders
to come to
the aid of
their bowling
team
_
fname = 't'
File.write(fname, str)
#=> 82
and specify the slice size:
slice_size = 4
Now I will read lines, one-by-one, split the lines into words, remove duplicate words and then append those words to an array. As soon as the array contains at least 4 words I will take the first four and map them into the longest word of the 4. The code to do that follows. To show how the calculations progress I will salt the code with puts statements. Note that IO::foreach without a block returns an enumerator.
IO.foreach(fname).
lazy.
tap { |o| puts "o1 = #{o}" }.
flat_map { |line|
puts "line = #{line}"
puts "line.split.uniq = #{line.split.uniq} "
line.split.uniq }.
tap { |o| puts "o2 = #{o}" }.
each_slice(slice_size).
tap { |o| puts "o3 = #{o}" }.
map { |arr|
puts "arr = #{arr}, arr.max = #{arr.max_by(&:size)}"
arr.max_by(&:size) }.
tap { |o| puts "o3 = #{o}" }.
to_a
#=> ["time", "good", "coders", "bowling", "team"]
The following is displayed:
o1 = #<Enumerator::Lazy:0x00005992b1ab6970>
o2 = #<Enumerator::Lazy:0x00005992b1ab6880>
o3 = #<Enumerator::Lazy:0x00005992b1ab6678>
o3 = #<Enumerator::Lazy:0x00005992b1ab6420>
line = Now is the
line.split.uniq = ["Now", "is", "the"]
line = time for all
line.split.uniq = ["time", "for", "all"]
arr = ["Now", "is", "the", "time"], arr.max = time
line = good Ruby coders
line.split.uniq = ["good", "Ruby", "coders"]
arr = ["for", "all", "good", "Ruby"], arr.max = good
line = to come to
line.split.uniq = ["to", "come"]
line = the aid of
line.split.uniq = ["the", "aid", "of"]
arr = ["coders", "to", "come", "the"], arr.max = coders
line = their bowling
line.split.uniq = ["their", "bowling"]
arr = ["aid", "of", "their", "bowling"], arr.max = bowling
line = team
line.split.uniq = ["team"]
arr = ["team"], arr.max = team
If the line lazy. is removed the return value is the same but the following is displayed (.to_a at the end now being superfluous):
o1 = #<Enumerator:0x00005992b1a438f8>
line = Now is the
line.split.uniq = ["Now", "is", "the"]
line = time for all
line.split.uniq = ["time", "for", "all"]
line = good Ruby coders
line.split.uniq = ["good", "Ruby", "coders"]
line = to come to
line.split.uniq = ["to", "come"]
line = the aid of
line.split.uniq = ["the", "aid", "of"]
line = their bowling
line.split.uniq = ["their", "bowling"]
line = team
line.split.uniq = ["team"]
o2 = ["Now", "is", "the", "time", "for", "all", "good", "Ruby",
"coders", "to", "come", "the", "aid", "of", "their",
"bowling", "team"]
o3 = #<Enumerator:0x00005992b1a41a08>
arr = ["Now", "is", "the", "time"], arr.max = time
arr = ["for", "all", "good", "Ruby"], arr.max = good
arr = ["coders", "to", "come", "the"], arr.max = coders
arr = ["aid", "of", "their", "bowling"], arr.max = bowling
arr = ["team"], arr.max = team
o3 = ["time", "good", "coders", "bowling", "team"]

Join list into one string

How to remove the first word of a string?
0.91% ABC DEF
0.922% ABC DEF GHI
OUTPUT: ABC DEF / ABC DEF GHI
I tried
let test = str.split(separator: " ")[1...]
print(test)
print(test.joined(separator: " "))
Which gives me:
["ABC", "DEF"]
JoinedSequence<ArraySlice<Substring>>(_base: ArraySlice(["ABC", "DEF"]), _separator: ContiguousArray([" "]))
How can I print the JoinedSequence as a string?
Try this:
let str = "0.91% ABC DEF"
var parts = str.components(separatedBy: " ").dropFirst()
print(parts.joined(separator: " "))
Which prints:
"ABC DEF\n"
Given a string
let text = "0.91% ABC DEF"
you can search for the ´index´ after the first space
if let index = text.range(of: " ")?.upperBound {
let result = text.substring(from: index)
print(result)
}

Read and display CSV in text widget using loop

I have a program that is reading in a csv file and outputting a number of fields into a text widget, my initial hack of allocating variables works fine, but doesnt give me any flexibility to display more than three lines from the csv file, so i need to go down the route of using a loop routine. Unfortunately I'm unsure how to attack this with what I currently have. My hack code follows below.
def checkcsv():
with open("lesspreadsheettest.csv") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
result=(row['Shop Order'])
if sonumber.get() == result:
descQty=(row['Quantity'])
descInfo=(row['Description'])
descPN=(row['Part Number'])
descDwg1=(row['Drawings1'])
descIss1=(row['Issue1'])
descDwg2=(row['Drawings2'])
descIss2=(row['Issue2'])
descDwg3=(row['Drawings3'])
descIss3=(row['Issue3'])
self.outputQty.insert(1.0, descQty)
self.outputDesc.insert(1.0, descPN, "", ": ", "", descInfo)
self.dwgoutputbox.insert(1.0, descDwg3, "dwg", " Issue: ", "", descIss3, "", "\n")
self.dwgoutputbox.insert(1.0, descDwg2, "dwg", " Issue: ", "", descIss2, "", "\n")
self.dwgoutputbox.insert(1.0, descDwg1, "dwg", " Issue: ", "", descIss1, "", "\n")
self.outputQty.configure(state="disabled")
self.outputDesc.configure(state="disabled")
self.dwgoutputbox.configure(state="disabled")
Hmm, to be honest I feel like one of us is missing something quite easy and obvious. I've tweaked your code a little bit. But all I've done is adding one for loop, you'll. Maybe this is what you are looking for.. If not, hopefully we'll be able to specify the problem better:
def checkcsv():
with open("lesspreadsheettest.csv") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
result=(row['Shop Order'])
if sonumber.get() == result:
descQty=(row['Quantity'])
descInfo=(row['Description'])
descPN=(row['Part Number'])
# I've added the following four lines of code
# and commented out some lines, hopefully it is clear
# how to add more descDwg, descIss, simply change the range
# assuming your csv work that way = DrawingsX, IssueX...
for i in xrange(1,4):
descDwg=(row['Drawings'+ str(i)])
descIss=(row['Issue'+ str(i)])
# check, whether the issue is empty if so, skip to the next one
if descIss == '':
continue
self.dwgoutputbox.insert(1.0, descDwg, "dwg", " Issue: ", "", descIss, "", "\n")
# descDwg1=(row['Drawings1'])
# descIss1=(row['Issue1'])
# descDwg2=(row['Drawings2'])
# descIss2=(row['Issue2'])
# descDwg3=(row['Drawings3'])
# descIss3=(row['Issue3'])
self.outputQty.insert(1.0, descQty)
self.outputDesc.insert(1.0, descPN, "", ": ", "", descInfo)
# self.dwgoutputbox.insert(1.0, descDwg3, "dwg", " Issue: ", "", descIss3, "", "\n")
# self.dwgoutputbox.insert(1.0, descDwg2, "dwg", " Issue: ", "", descIss2, "", "\n")
# self.dwgoutputbox.insert(1.0, descDwg1, "dwg", " Issue: ", "", descIss1, "", "\n")
self.outputQty.configure(state="disabled")
self.outputDesc.configure(state="disabled")
self.dwgoutputbox.configure(state="disabled")

Resources