I am looking for a way to go through a sentence to see if an apostrophe is a quote or a contraction so I can remove punctuation from the string, and then normalize all words.
My test sentence is: don't frazzel the horses. 'she said wow'.
In my attempts I have split the sentence into words parts tokonizing on words and non words like so:
contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]
sentence = "don't frazzel the horses. 'she said wow'.".split(/(\w+)|(\W+)/i).reject! { |word| word.empty? }
This returns ["don", "'", "t", " ", "frazzel", " ", "the", " ", "horses", ". '", "she", " ", "said", " ", "wow", "'."]
Next I want to be able to iterate sentence looking for apostrophes ' and when one is found, compare the next element to see if it is included in the contractionEndings array. If it is included I want to join the prefix, the apostrophe ', and the suffix into one index, else remove the apostrophes.
In this example, don, ', and t would be joined into don't as a single index, but . ' and '. would be removed.
Afterwards I can run a regex to remove other punctuation from the sentence so that I can pass it into my stemmer to normalize the input.
The final output I am after is don't frazzel the horses she said wow in which all punctuation will be removed besides apostrophes for contractions.
If anyone has any suggestions to make this work or have a better idea on how to solve this problem I would like to know.
Overall I want to remove all punctuation from the sentence except for contractions.
Thanks
How about this?
irb:0> s = "don't frazzel the horses. 'she said wow'."
irb:0> contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]
irb:0> s.scan(/\w+(?:'(?:#{contractionEndings.join('|')}))?/)
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"]
The regex scans for some "word" characters, and then optionally (with the ?) an apostrophe-plus-contraction ending. You can subsitute in Ruby expressions just like double-quote strings do, so we can get our contractions in, joining them with the regex alternation operator |. The last thing is to mark the groups (sections in parentheses) as non-capturing with ?: so that scan doesn't return a bunch of nils, just the whole match per-iteration.
Or maybe you don't need the list of explicit abbreviation endings with this method. I also fixed other problematic constructions, thanks to Cary.
irb:0> "don't -frazzel's the jack-o'-lantern's handle, ma'am- 'she said hey-ho'.".scan(/\w[-'\w]*\w(?:'\w+)?/)
=> ["don't", "frazzel's", "the", "jack-o'-lantern's", "handle", "ma'am", "she", "said", "hey-ho"]
As I mentioned in a comment, I think trying to list all possible contraction endings is fruitless. In fact, some contractions, such as "couldn’t’ve", contain more than one apostrophe.
The other option is to match single quotes. My first thought was to remove the character "'" if is at the start of the sentence or after a space, or if it is followed by a space or is at the end of a sentence. Unfortunately, that approach is frustrated by possessive words that end in an "s": "Chris' cat has fleas". Even worse, how are we to interpret "Where are 'Chris' cars'?" or "'Twas the 'night before Christmas'."?
Here is a way to remove single quotes when there are no apostrophes at the beginning or ends of words (which, admittedly, is of questionable value).
r = /
(?<=\A|\s) # match the beginning of the string or a whitespace char in a
# positive lookbehind
\' # match a single quote
| # or
\' # match a single quote
(?=\s|\z) # match a whitespace char or the end of the string in a
# positive lookahead
/x # free-spacing regex definition mode
"don't frazzel the horses. 'she said wow'".gsub(r,'')
#=> "don't frazzel the horses. she said wow"
I think the best solution is for the English language to use different symbols for apostrophes and single quotes.
Usually the apostrophe will stay with the contraction after tokenzation.
Try a normal NLP tokenizer, e.g. in python nltk:
>>> from nltk import word_tokenize
>>> word_tokenize("don't frazzel the horses")
['do', "n't", 'frazzel', 'the', 'horses']
For multiple sentences:
>>> from string import punctuation
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "don't frazzel the horses. 'she said wow'."
>>> sents = sent_tokenize(text)
>>> sents
["don't frazzel the horses.", "'she said wow'."]
>>> [word for word in word_tokenize(sents[0]) if word not in punctuation]
['do', "n't", 'frazzel', 'the', 'horses']
>>> [word for word in word_tokenize(sents[1]) if word not in punctuation]
["'she", 'said', 'wow']
Flattening the sentences before word_tokenize:
>>> from itertools import chain
>>> sents
["don't frazzel the horses.", "'she said wow'."]
>>> [word_tokenize(sent) for sent in sents]
[['do', "n't", 'frazzel', 'the', 'horses', '.'], ["'she", 'said', 'wow', "'", '.']]
>>> list(chain(*[word_tokenize(sent) for sent in sents]))
['do', "n't", 'frazzel', 'the', 'horses', '.', "'she", 'said', 'wow', "'", '.']
>>> [word for word in list(chain(*[word_tokenize(sent) for sent in sents])) if word not in punctuation]
['do', "n't", 'frazzel', 'the', 'horses', "'she", 'said', 'wow']
Note that the single quote stays with the 'she. Sadly, simple task of tokenization still has its weakness amidst all the hype on sophisticated (deep) machine learning methods today =(
It makes mistakes even with formal grammatical text:
>>> text = "Don't frazzel the horses. 'She said wow'."
>>> sents = sent_tokenize(text)
>>> sents
["Don't frazzel the horses.", "'She said wow'."]
>>> [word_tokenize(sent) for sent in sents]
[['Do', "n't", 'frazzel', 'the', 'horses', '.'], ["'She", 'said', 'wow', "'", '.']]
You could use the Pragmatic Tokenizer gem. It can detect English contractions.
s = "don't frazzel the horses. 'she said wow'."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"]
s = "'Twas the 'night before Christmas'."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["'twas", "the", "night", "before", "christmas"]
s = "He couldn’t’ve been right."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["he", "couldn’t’ve", "been", "right"]
Related
Say I have an array of sentences:
sentences = ["Tom is a good person", "Jack spent some time", "Kat did something wrong"]
and I have a blacklist of names:
blacklist = ["Jack", "Kat"]
Now I need to filter sentences into an array that removes all the sentences that contain blacklisted names, so:
["Tom is a good person"]
How would I do it in Ruby?
Thanks!
sentences = ["Tom is a good person", "Jack spent some time", "Kat did something wrong",
"Kathy knows her stuff"]
blacklist = ["Jack", "Kat"]
r = /\b#{Regexp.union(blacklist)}\b/
#=> /\b(?-mix:Jack|Kat)\b/
sentences.reject { |s| s.match?(r) }
#=> ["Tom is a good person", "Kathy knows her stuff"]
Word breaks (\b) are needed in the regular expression so that "Kat" does not match the first three letters of "Kathy". One could instead write:
r = /\b#{blacklist.join('|')}\b/
#=> /\bJack|Kat\b/
You just need to reject the records
sentences.reject!{|sentence| sentence.match(blacklist.join('|'))}
You will get the required output -
["Tom is a good person"]
Docs for reject! - https://ruby-doc.org/core-2.2.0/Array.html#method-i-reject-21
reject! will update the same array, if you don't want that, you can use reject and store in a new array.
Go back to =~ :)
sentences.reject!{|sentence| !((Regexp.new(blacklist.join('|')) =~ sentence).nil?) }
Reject the sentence, if `=~' finds a match ( Code used - !nil? )
This is going to do essentially the same thing. Although ruby says, match is faster
sentences = ["Tom is a good person", "Jack spent some time", "Kat did something wrong"]
blacklist = ["Jack", "Kat"]
Program
p sentences.filter{|string|!(blacklist.map{|x|string.match?(x)}.any?)}
If you are using < Ruby 2.6 then
p sentences.select{|string|!(blacklist.map{|x|string.include?(x)}.any?)}
Result
["Tom is a good person"]
I have an array of strings
["123", "a", "cc", "dddd", "mi hello", "33"]
I want to join by a space consecutive elements that begin with a letter, have at least two characters, and do not contain a space. Applying that logic to the above would yield
["123", "a", "cc dddd", "mi hello", "33"]
Similarly if my array were
["mmm", "3ss", "foo", "bar", "foo", "55"]
I would want the result to be
["mm", "3ss", "foo bar foo", "55"]
How do I do this operation?
There are many ways to solve this; ruby is a highly expressive language. It would be most beneficial for you to show what you have tried so far, so that we can help debug/fix/improve your attempt.
For example, here is one possible implementation that I came up with:
def combine_words(array)
array
.chunk {|string| string.match?(/\A[a-z][a-z0-9]+\z/i) }
.flat_map {|concat, strings| concat ? strings.join(' ') : strings}
end
combine_words(["aa", "b", "cde", "f1g", "hi", "2j", "l3m", "op", "q r"])
# => ["aa", "b", "cde f1g hi", "2j", "l3m op", "q r"]
Note that I was a little unclear exactly how to interpret your requirement:
begin with a letter, have at least two characters, and do not contain a space
Can strings contain punctuation? Underscores? Utf-8 characters? I took it to mean "only a-z, A-Z or 0-9", but you may want to tweak this.
A literal interpretation of your requirement could be: /\A[[:alpha:]][^ ]+\z/, but I suspect that's not what you meant.
Explanation:
Enumerable#chunk will iterate through the array and collect terms by the block's response value. In this case, it will find sequential elements that match/don't match the required regex.
String#match? checks whether the string matches the pattern, and returns a boolean response. Note that if you were using ruby v2.3 or below, you'd have needed some workaround such as !!string.match, to force a boolean response.
Enumerable#flat_map then loops through each "result", joining the strings if necessary, and flattens the result to avoid returning any nested arrays.
Here is another, similar, solution:
def word?(string)
string.match?(/\A[a-z][a-z0-9]+\z/i)
end
def combine_words(array)
array
.chunk_while {|x, y| word?(x) && word?(y)}
.map {|group| group.join(' ')}
end
Or, here's a more "low-tech" solution - which only uses more basic language features. (I'm re-using the same word? method here):
def combine_words(array)
previous_was_word = false
result = []
array.each do |string|
if previous_was_word && word?(string)
result.last << " #{string}"
else
result << string
end
previous_was_word = word?(string)
end
result
end
You can use Enumerable#chunk.
def chunk_it(arr)
arr.chunk { |s|
(s.size > 1) && (s[0].match?(/\p{Alpha}/)) && !s.include?(' ')}.
flat_map { |tf,a| tf ? a.join(' ') : a }
end
chunk_it(["123", "a", "cc", "dddd", "mi hello", "33"])
#=> ["123", "a", "cc dddd", "mi hello", "33"]
chunk_it ["mmm", "3ss", "foo", "bar", "foo", "55"]
I want to case-insensitively match a string from my array, TOKENS, at the beginning of another string followed by a space or the end of the line.
My tokens array looks like:
2.4.0 :013 > TOKENS = ["m", "o"]
=> ["m", "o"]
When I try to match each element from my array, it is picking out the wrong results:
2.4.0 :009 > data_col = ["M", "b", "Mabc", "abc m b"]
=> ["M", "b", "Mabc", "abc m b"]
...
2.4.0 :015 > data_col.select{|string| string =~ /^[#{Regexp.union(TOKENS)}]([[:space:]]|$)/i }
=> ["M", "b"]
This is matching both the "M" and the "b" entries even though "b" does not appear in my list of TOKENS. How do I modify my regular expression so that only the proper value, "M" will be matched?
I'm using Ruby 2.4.
I'd use:
TOKENS = ["m", "o"]
DATA_COL = ["M", "b", "Mabc", "abc m b"]
RE = /^(?:#{Regexp.union(TOKENS).source})(?: |$)/i
DATA_COL.select{ |string| string[RE] }
# => ["M"]
Breaking it down a bit:
Regexp.union(TOKENS).source # => "m|o"
/^(?:#{Regexp.union(TOKENS).source})(?: |$)/i # => /^(?:m|o)(?: |$)/i
/^[#{Regexp.union(TOKENS)}]([[:space:]]|$)/i isn't a good idea inside a loop. Each time through you force Ruby to create the pattern; Efficiency is important inside loops, especially big ones, so create the pattern outside the loop then refer to the pattern inside.
The next problem is that Regexp.union has a concept of the correct case it should match:
Regexp.union(TOKENS).to_s # => "(?-mix:m|o)"
The (?-mix: part is how the Regular Expression engine remembers the flags for the pattern. When the pattern is embedded inside another pattern they continue to know what they should look for, causing us to gnash our teeth and weep:
/#{Regexp.union(TOKENS)}/i # => /(?-mix:m|o)/i
The trailing i is telling the pattern it should ignore case, but the embedded i is not set so it's honoring case. And that's what is breaking your pattern.
The fix is to use source when embedding like I did above.
See the Regex "options" section for more information.
Using Ruby 2.4. I have an array of strings. I want to strip off non-breaking and breaking space from the end of each item in the array as well as replace multiple consecutive occurrences of white space with a single white space. I thought teh below was the way, but I get an error
> words = ["1", "HUMPHRIES \t\t\t\t\t\t\t\t\t\t\t\t\t\t, \t\t\t\t\t\t\t\t\t\t\t\t\tJASON", "328", "FAIRVIEW, OR (US)", "US", "M", " 27 ", "00:27:30.00 \t\t\t\t\t\t\t\t\t\t\t \n"]
> words.map{|word| word ? word.gsub!(/\A\p{Space}+|\p{Space}+\z/, '').gsub!(/[[:space:]]+/, ' ') : nil }
NoMethodError: undefined method `gsub!' for nil:NilClass
from (irb):4:in `block in irb_binding'
from (irb):4:in `map'
from (irb):4
from /Users/nataliab/.rvm/gems/ruby-2.4.0/gems/railties-5.0.2/lib/rails/commands/console.rb:65:in `start'
from /Users/nataliab/.rvm/gems/ruby-2.4.0/gems/railties-5.0.2/lib/rails/commands/console_helper.rb:9:in `start'
from /Users/nataliab/.rvm/gems/ruby-2.4.0/gems/railties-5.0.2/lib/rails/commands/commands_tasks.rb:78:in `console'
from /Users/nataliab/.rvm/gems/ruby-2.4.0/gems/railties-5.0.2/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
from /Users/nataliab/.rvm/gems/ruby-2.4.0/gems/railties-5.0.2/lib/rails/commands.rb:18:in `<top (required)>'
from bin/rails:4:in `require'
from bin/rails:4:in `<main>'
How can I properly replace consecutive occurrences of white space as well as strip it off from each word in the array?
Do it with simple gsub not gsub!
words.map do |w|
#respond_to?(:gsub) if you are not sure that array only from strings
w.gsub(/(?<=[^\,\.])\s+|\A\s+/, '') if w.respond_to?(:gsub)
end
Because gsub! can return nil if don't change the string and then you try to do gsub! again with nil. That's why you get an undefined method gsub!' for nil:NilClass error.
From gsub! explanation in ruby doc:
Performs the substitutions of String#gsub in place, returning str, or
nil if no substitutions were performed. If no block and no replacement
is given, an enumerator is returned instead.
As mentioned #CarySwoveland in comments \s doesn't handle non-breaking spaces. To handle it you should use [[:space:]] insted of \s.
You can use the following:
words.map { |w| w.gsub(/(?<=[^\,\.])\s+/,'') }
#=> ["1", "HUMPHRIES, JASON", "328", "FAIRVIEW,
# OR(US)", "US", "M", " 27", "00:27:30.00"]
I assume all whitespace and non-breaking spaces at the send of each string are to be removed and, of what's left, all substrings of whitespace characters and non-breaking spaces is to be replaced by one space. (Natalia, if that's not correct please let me know in a comment.)
words =
["1",
"HUMPHRIES \t\t\t\, \t\t\t\t\t\t\t\t\t\t\t\t\tJASON",
" M\u00A0 \u00A0",
" 27 ",
"00:27:30.00 \t\t\t\t\t\t\t\t\t\t\t \n"]
R = /
[[:space:]] # match a POSIX bracket expression for one character
(?=[[:space:]]) # match a POSIX bracket expression for in a positive lookahead
| # or
[[:space:]]+ # match a POSIX bracket expression one or more times
\z # match end of string
/x # free-spacing regex definition mode
words.map { |w| w.gsub(R, '').gsub(/[[:space:]]/, ' ') }
#=> ["1", "HUMPHRIES , JASON", " M", " 27", "00:27:30.00"]
Note that the POSIX [[:space:]] includes ASCII whitespace and Unicode's non-breaking space character, \u00A0.
To see why the second gsub is needed, note that
words.map { |w| w.gsub(R, '') }
#=> ["1", "HUMPHRIES\t,\tJASON", " M", " 27", "00:27:30.00"]
There is a list of words and list of banned words. I want to go through the word list and redact all the banned words. This is what I ended up doing (notice the catched boolean):
puts "Give input text:"
text = gets.chomp
puts "Give redacted word:"
redacted = gets.chomp
words = text.split(" ")
redacted = redacted.split(" ")
catched = false
words.each do |word|
redacted.each do |redacted_word|
if word == redacted_word
catched = true
print "REDACTED "
break
end
end
if catched == true
catched = false
else
print word + " "
end
end
Is there any proper/efficient way?
It also can works.
words - redacted
+, -, &, these methods are very simple and useful.
irb(main):016:0> words = ["a", "b", "a", "c"]
=> ["a", "b", "a", "c"]
irb(main):017:0> redacted = ["a", "b"]
=> ["a", "b"]
irb(main):018:0> words - redacted
=> ["c"]
irb(main):019:0> words + redacted
=> ["a", "b", "a", "c", "a", "b"]
irb(main):020:0> words & redacted
=> ["a", "b"]
You can use .reject to exclude all banned words that are present in the redacted array:
words.reject {|w| redacted.include? w}
Demo
If you want to get the list of banned words that are present in the words array, you can use .select:
words.select {|w| redacted.include? w}
Demo
This might be a bit more 'elegant'. Whether it's more or less efficient than your solution, I don't know.
puts "Give input text:"
original_text = gets.chomp
puts "Give redacted word:"
redacted = gets.chomp
redacted_words = redacted.split
print(
redacted_words.inject(original_text) do |text, redacted_word|
text.gsub(/\b#{redacted_word}\b/, 'REDACTED')
end
)
So what's going on here?
I'm using String#split without an argument, because ' ' is the default, anyway.
With Array#inject, the following block (staring at do and ending at end is executed for each element in the array—in this case, our list of forbidden words.
In each round, the second argument to the block will be the respective element from the array
The first argument to the block will be the block's return value from the previous round. For the first round, the argument to the inject function (in our case original_text) will be used.
The block's return value from the last round will be used as return value of the inject function.
In the block, I replace all occurrences of the currently handled redacted word in the text.
String#gsub performs a global substitution
As the pattern to be substituted, I use a regexp literal (/.../). Except, it's not really a literal as I'm performing a string substitution (#{...}) on it to get the currently handled redacted word into it.
In the regexp, I'm surrounding the word to be redacted with \b word boundary matchers. They match the boundary between alphanumeric and non-alphanumeric characters (or vice verca), without matching any of the characters themselves. (They match the zero-lenght 'position' between the characters.) If a string starts or ends with alphanumeric characters, \b will also match the start or end of the string, respectively, so that we can use it to match whole words.
The result of inject (which is the result of the last execution of the block, i.e., the text when all the substitutions have taken place) is passed as an argument to print, which will output the now redacted text.
Note that, other than your solution, mine will not consider punctuation as parts of adjacent words.
Also note that my solution will be vulnerable to regex injection.
Example 1:
Give input text:
A fnord is a fnord.
Give redacted word:
ford fnord foo
My output:
A REDACTED is a REDACTED.
Your output:
A REDACTED is a fnord.
Example 2:
Give input text:
A fnord is a fnord.
Give redacted word:
fnord.
My output:
A REDACTEDis a fnord.
(Note how the . was interpreted to match any character.)
Your output:
A fnord is a REDACTED.