Sorry I'am new on Ruby (just a Java programmer), I have two string arrays:
Array with file paths.
Array with patterns (can be a path or a file)
I need to check each patter over each "file path". I do with this way:
#flag = false
["aa/bb/cc/file1.txt","aa/bb/cc/file2.txt","aa/bb/dd/file3.txt"].each do |source|
["bb/cc/","zz/xx/ee"].each do |to_check|
if source.include?(to_check)
#flag = true
end
end
end
puts #flag
This code is ok, prints "true" because "bb/cc" is in source.
I have seen several posts but can not find a better way. I'm sure there should be functions that allow me to do this in fewer lines.
Is this is possible?
As mentioned by #dodecaphonic use Enumerable#any?. Something like this:
paths.any? { |s| patterns.any? { |p| s[p] } }
where paths and patterns are arrays as defined by the OP.
While that will work, that's going to have geometric scaling problems, that is it has to do N*M tests for a list of N files versus M patterns. You can optimize this a little:
files = ["aa/bb/cc/file1.txt","aa/bb/cc/file2.txt","aa/bb/dd/file3.txt"]
# Create a pattern that matches all desired substrings
pattern = Regexp.union(["bb/cc/","zz/xx/ee"])
# Test until one of them hits, returns true if any matches, false otherwise
files.any? do |file|
file.match(pattern)
end
You can wrap that up in a method if you want. Keep in mind that if the pattern list doesn't change you might want to create that once and keep it around instead of constantly re-generating it.
Related
When I assign variable names such as service_names and name_array they are nil and nothing goes to the class variable ##product_names.
I used Pry to try the code without storing it into a variable and it works. It has the values I need.
I had this split up in more variables before to make cleaner code, for example:
require 'pry'
require 'rubygems'
require 'open-uri'
require 'nokogiri'
class KefotoScraper::CLI
##product_names =[]
PAGE_URL = "https://kefotos.mx/"
def call
binding.pry
puts "These are the services that Kefoto offers:"
#list_products
puts "which service would you like to select?"
#selection = gets.chomp
view_price_range
puts "Would you like to go back to the service menu? y/n"
answer = gets.chomp
if answer == "y"
call
end
end
private
def home_html
# #home_html ||=
# HTTParty.get root_path
Nokogiri::HTML(open(PAGE_URL))
end
#
# # TODO: read about ruby memoization
# def home_node
#
# #home_node ||=
# Nokogiri::HTML(PAGE_URL)
# end
def service_names
#service_names = home_html.css(".nav-link").map do
|link| link['href'].to_s.gsub(/.php/, "")
end
#service_names.each do |pr|
##product_names << pr
end
end
def list_products
i = 1
n = 0
while ##product_names.length < n
##product_names.each do |list_item|
puts "#{i} #{list_item[n]}"
i += 1
n += 1
end
end
end
def view_price_range
price_range = []
#service_links.each do |link|
if #service = link
link.css(".row").map {|price| price["p"].value}
price_range << p
end
price_range
end
def service_links
#service_links ||=
home_html.css(".nav-item").map { |link| link['href'] }
end
end
end
##product_names should contain the code that comes out of
home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "")
which later I turn back to an array.
This is what it looks like in Pry:
9] pry(#<KefotoScraper::CLI>)> home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "").split(",")
=> ["[\"foto-enmarcada\"", " \"impresion-fotografica\"", " \"photobooks\"", " \"impresion-directa-canvas\"", " \"impresion-acrilico\"", " \"fotoregalos\"]"]
[10] pry(#<KefotoScraper::CLI>)> home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "").split(",")[0]
=> "[\"foto-enmarcada\""
Nokogiri's command-line IRB is your friend. Use nokogiri "https://kefotos.mx/" at the shell to start it up:
irb(main):006:0> #doc.css('.nav-link[href]').map { |l| l['href'].sub(/\.php$/, '') }
=> ["foto-enmarcada", "impresion-fotografica", "photobooks", "impresion-directa-canvas", "impresion-acrilico", "fotoregalos"]
That tells us it's not dynamic HTML and shows how I'd retrieve those values. Since an a tag doesn't have to contain href parameters I guarded against retrieving any such tags by accident.
You've got bugs, potential bugs and bad practices. Here are some untested but likely to work ways to fix them:
Running the code results in:
uninitialized constant KefotoScraper (NameError)
In your code you have #service and #service_links which are never initialized so...?
Don't do this because it's cruel:
def home_html
Nokogiri::HTML(open(PAGE_URL))
end
Every time you call home_html you (re)open and (re)read the page from the remote site and wasting your and their CPU and network time. Instead, cache the parsed document in a variable kind of like you did in your commented-out line using HTTParty. It's much more friendly to not hit sites repeatedly and helps avoid getting banned.
Moving on:
def service_names
#service_names = home_html.css(".nav-link").map do
|link| link['href'].to_s.gsub(/.php/, "")
end
#service_names.each do |pr|
##product_names << pr
end
end
I'd use something like get_product_names and return the array like I did in Nokogiri above:
def get_product_names
get_html.css('.nav-link[href]').map { |l|
l['href'].sub(/\.php$/, '')
}
end
:
:
##product_names = get_product_names()
Here's why I'd do it another way. You used:
link['href'].to_s.gsub(/.php/, "")
to_s is redundant because link['href'] is already returning a string. Stringizing a string wastes brain cycles when rereading/debugging the code. Be kind to yourself and don't do that.
require 'nokogiri'
html = '<a href="foo">'
doc = Nokogiri::HTML(html)
doc.at('a')['href'] # => "foo"
doc.at('a')['href'].class # => String
gsub Ew. How many occurrences of the target string do you anticipate to find and replace? If only one, which is extremely likely in a URL "href", instead use sub because it's more efficient; It only runs once and moves on whereas gsub looks through the string at least one additional time to see if it needs to run again.
/.php/ doesn't mean what you think it does, and it's a very subtle bug in waiting. /.php/ means "some character followed by "php", but you most likely meant "a period followed by 'php'". This was something I used to see all the time because other programmers I worked with didn't bother to figure out what they were doing, and being the senior guy it was my job to pick their code apart and find bugs. Instead you should use /\.php/ which removes the special meaning of ., resulting in your desired pattern which is not going to trigger if it encounters "aphp" or something similar. See "Metacharacters and Escapes" and the following section on that page for more information.
On top of the above, the pattern needs to be anchored to avoid wasting more CPU. /\.php/ will cause the regular expression engine to start at the beginning of the string and walk through it until it reaches the end. As strings get longer that process gets slower, and in production code that is processing GB of data it can slow down a system markedly. Instead, using an anchor like /\.php$/ or /\.php\z/ gives the engine a hint where it should start looking and can result in big speedups. I've got some answers on SO that go into this, and the included benchmarks show how they help. See "Anchors" for more information.
That should help you but I didn't try modifying your code to see if it did. When asking questions about bugs in your code we need the minimum code necessary to reproduce the problem. That lets us help you more quickly and efficiently. Please see "ask" and the linked pages and "mcve".
I have an array made up of several strings that I am searching for in another array, like so:
strings_array = ["string1", "string2", "string3"]
main_array = [ ## this is populated with string values outside of my script ## ]
main_array.each { |item|
if strings_array.any? { |x| main_array.include?(x) }
main_array.delete(item)
end
}
This is a simplified version of what my script is actually doing, but that's the gist. It works as is, but I'm wondering how I can make it so that the strings_array can include strings made out of regex. So let's say I have a string in the main_array called "string-4385", and I want to delete any string that is composed of string- + a series of integers (without manually adding in the numerical suffix). I tried this:
strings_array = ["string1", "string2", "string3", /string-\d+/.to_s]
This doesn't work, but that's the logic I'm aiming for. Basically, is there a way to include a string with regex like this within an array? Or is there perhaps a better way than this .any? and include? combo that does the job (without needing to type out the complete string value)?
Thank you for any help!
You can use methods like keep_if and delete_if, so if you want to delete strings that match a regex you could do something like this:
array = ['string-123', 'test']
array.delete_if{|n| n[/string-\d+/] }
That will delete the strings in the array that do not match your regex. Same thing with keep_if method.
Hope it helps!
A good way to do this is with Regexp.union, which combines multiple regular expressions into a single regex handy for matching.
patterns = [/pattern1/, /pattern2/, /string-\d+/]
regex = Regexp.union(patterns)
main_array.delete_if{|string| string.match(regex)}
So I am trying to read a rather large XML file into a String. Currently joining a list of .readLines() like this:
def is = zipFile.getInputStream(entry)
def content = is.getText('UTF-8')
def xmlBodyList = content.readLines()
return xmlBodyList[1..xmlBodyList.size].join("")
However I am getting this output in console:
java.lang.IndexOutOfBoundsException: toIndex = 21859
I don't need any explanation on IndexOutOfBoundsExceptions, but I am having a hard time figuring out how to program around this issue.
How can I implement this differently, so it allows for a large enough file size?
About Good way to avoid java.lang.IndexOutOfBoundsException
error is here:
return xmlBodyList[1..xmlBodyList.size].join("")
A good way to check variables before accessing and you can use relative range accessor:
assert xmlBodyList.size>1 //check value
return xmlBodyList[1..-1].join("") //use relative indexes -1 = the last one
About large files processing
If you need to iterate through all the lines and execute some operation here is an example:
def stream = zipFile.getInputStream(entry)
stream.eachLine("UTF-8"){line, index->
if(index>1){ //skip first line
//do something here with each line from file
println "$line $index"
}
}
there are a lot of additional groovy methods over java.io.InputStream that could help you to process large file without loading it into memory:
http://docs.groovy-lang.org/latest/html/groovy-jdk/java/io/InputStream.html
My app passes to different methods a json_element for which the keys are different, and sometimes empty.
To handle it, I have been hard-coding the extraction with the following sample code:
def act_on_ruby_tag(json_element)
begin
# logger.progname = __method__
logger.debug json_element
code = json_element['CODE']['$'] unless json_element['CODE'].nil?
predicate = json_element['PREDICATE']['$'] unless json_element['PREDICATE'].nil?
replace = json_element['REPLACE-KEY']['$'] unless json_element['REPLACE-KEY'].nil?
hash = json_element['HASH']['$'] unless json_element['HASH'].nil?
I would like to eliminate hardcoding the values, and not quite sure how.
I started to think through it as follows:
keys = json_element.keys
keys.each do |k|
set_key = k.downcase
instance_variable_set("#" + set_key, json_element[k]['$']) unless json_element[k].nil?
end
And then use #code for example in the rest of the method.
I was going to try to turn into a method and then replace all this hardcoded code.
But I wasn't entirely sure if this is a good path.
It's almost always better to return a hash structure from a method where you have things like { code: ... } rather than setting arbitrary instance variables. If you return them in a consistent container, it's easier for callers to deal with delivering that to the right location, storing it for later, or picking out what they want and discarding the rest.
It's also a good idea to try and break up one big, clunky step with a series of smaller, lighter operations. This makes the code a lot easier to follow:
def extract(json)
json.reject do |k, v|
v.nil?
end.map do |k, v|
[ k.downcase, v['$'] ]
end.to_h
end
Then you get this:
extract(
'TEST' => { '$' => 'value' },
'CODE' => { '$' => 'code' },
'NULL' => nil
)
# => {"test"=>"value", "code"=>"code"}
If you want to persist this whole thing as an instance variable, that's a fairly typical pattern, but it will have a predictable name that's not at the mercy of whatever arbitrary JSON document you're consuming.
An alternative is to hard-code the keys in a constant like:
KEYS = %w[ CODE PREDICATE ... ]
Then use that instead, or one step further, define that in a YAML or JSON file you can read-in for configuration purposes. It really depends on how often these will change, and what sort of expectations you have about the irregularity of the input.
This is a slightly more terse way to do what your original code does.
code, predicate, replace, hash = json_element.values_at *%w{
CODE PREDICATE REPLACE-KEY HASH
}.map { |x| x.fetch("$", nil) if x }
I have a lot of data in several hundred .mat-files where I want to extract specific data from. All the names of my .mat-files have specific numbers to identify the content like Number1_Number2_Number3_Number4.mat:
01_33_06_121.mat
01_24_12_124.mat
02_45_15_118.mat
02_33_11_190.mat
01_33_34_142.mat
Now I want to extract for example all the data from files with Number1=01 or Number1=02 and Number2=33.
Before I start to write a program from scratch, I would like to know, if there is a simple way to do this with Matlab. Does anybody know how I can solve this problem in a fast way?
Thanks a lot!
There are multiple ways you can do this; on top of my head following can work:
Obtain all the file names into an array
allFiles = dir( 'folder' );
allNames = { allFiles.name };
Loop through your file names and compare against the condition using the regex
for i=1:size(allNames)
if regexp(allNames, pattern, 'match')
disp(allNames)
end
end