How to parse data and store it into variables using Nokogiri and Ruby

How to parse data and store it into variables using Nokogiri and Ruby - arrays

When I assign variable names such as service_names and name_array they are nil and nothing goes to the class variable ##product_names.
I used Pry to try the code without storing it into a variable and it works. It has the values I need.
I had this split up in more variables before to make cleaner code, for example:
require 'pry'
require 'rubygems'
require 'open-uri'
require 'nokogiri'
class KefotoScraper::CLI
##product_names =[]
PAGE_URL = "https://kefotos.mx/"
def call
binding.pry
puts "These are the services that Kefoto offers:"
#list_products
puts "which service would you like to select?"
#selection = gets.chomp
view_price_range
puts "Would you like to go back to the service menu? y/n"
answer = gets.chomp
if answer == "y"
call
end
end
private
def home_html
# #home_html ||=
# HTTParty.get root_path
Nokogiri::HTML(open(PAGE_URL))
end
#
# # TODO: read about ruby memoization
# def home_node
#
# #home_node ||=
# Nokogiri::HTML(PAGE_URL)
# end
def service_names
#service_names = home_html.css(".nav-link").map do
|link| link['href'].to_s.gsub(/.php/, "")
end
#service_names.each do |pr|
##product_names << pr
end
end
def list_products
i = 1
n = 0
while ##product_names.length < n
##product_names.each do |list_item|
puts "#{i} #{list_item[n]}"
i += 1
n += 1
end
end
end
def view_price_range
price_range = []
#service_links.each do |link|
if #service = link
link.css(".row").map {|price| price["p"].value}
price_range << p
end
price_range
end
def service_links
#service_links ||=
home_html.css(".nav-item").map { |link| link['href'] }
end
end
end
##product_names should contain the code that comes out of
home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "")
which later I turn back to an array.
This is what it looks like in Pry:
9] pry(#<KefotoScraper::CLI>)> home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "").split(",")
=> ["[\"foto-enmarcada\"", " \"impresion-fotografica\"", " \"photobooks\"", " \"impresion-directa-canvas\"", " \"impresion-acrilico\"", " \"fotoregalos\"]"]
[10] pry(#<KefotoScraper::CLI>)> home_html.css(".nav-link").map { |link| link['href'] }.to_s.gsub(/.php/, "").split(",")[0]
=> "[\"foto-enmarcada\""

Nokogiri's command-line IRB is your friend. Use nokogiri "https://kefotos.mx/" at the shell to start it up:
irb(main):006:0> #doc.css('.nav-link[href]').map { |l| l['href'].sub(/\.php$/, '') }
=> ["foto-enmarcada", "impresion-fotografica", "photobooks", "impresion-directa-canvas", "impresion-acrilico", "fotoregalos"]
That tells us it's not dynamic HTML and shows how I'd retrieve those values. Since an a tag doesn't have to contain href parameters I guarded against retrieving any such tags by accident.
You've got bugs, potential bugs and bad practices. Here are some untested but likely to work ways to fix them:
Running the code results in:
uninitialized constant KefotoScraper (NameError)
In your code you have #service and #service_links which are never initialized so...?
Don't do this because it's cruel:
def home_html
Nokogiri::HTML(open(PAGE_URL))
end
Every time you call home_html you (re)open and (re)read the page from the remote site and wasting your and their CPU and network time. Instead, cache the parsed document in a variable kind of like you did in your commented-out line using HTTParty. It's much more friendly to not hit sites repeatedly and helps avoid getting banned.
Moving on:
def service_names
#service_names = home_html.css(".nav-link").map do
|link| link['href'].to_s.gsub(/.php/, "")
end
#service_names.each do |pr|
##product_names << pr
end
end
I'd use something like get_product_names and return the array like I did in Nokogiri above:
def get_product_names
get_html.css('.nav-link[href]').map { |l|
l['href'].sub(/\.php$/, '')
}
end
:
:
##product_names = get_product_names()
Here's why I'd do it another way. You used:
link['href'].to_s.gsub(/.php/, "")
to_s is redundant because link['href'] is already returning a string. Stringizing a string wastes brain cycles when rereading/debugging the code. Be kind to yourself and don't do that.
require 'nokogiri'
html = '<a href="foo">'
doc = Nokogiri::HTML(html)
doc.at('a')['href'] # => "foo"
doc.at('a')['href'].class # => String
gsub Ew. How many occurrences of the target string do you anticipate to find and replace? If only one, which is extremely likely in a URL "href", instead use sub because it's more efficient; It only runs once and moves on whereas gsub looks through the string at least one additional time to see if it needs to run again.
/.php/ doesn't mean what you think it does, and it's a very subtle bug in waiting. /.php/ means "some character followed by "php", but you most likely meant "a period followed by 'php'". This was something I used to see all the time because other programmers I worked with didn't bother to figure out what they were doing, and being the senior guy it was my job to pick their code apart and find bugs. Instead you should use /\.php/ which removes the special meaning of ., resulting in your desired pattern which is not going to trigger if it encounters "aphp" or something similar. See "Metacharacters and Escapes" and the following section on that page for more information.
On top of the above, the pattern needs to be anchored to avoid wasting more CPU. /\.php/ will cause the regular expression engine to start at the beginning of the string and walk through it until it reaches the end. As strings get longer that process gets slower, and in production code that is processing GB of data it can slow down a system markedly. Instead, using an anchor like /\.php$/ or /\.php\z/ gives the engine a hint where it should start looking and can result in big speedups. I've got some answers on SO that go into this, and the included benchmarks show how they help. See "Anchors" for more information.
That should help you but I didn't try modifying your code to see if it did. When asking questions about bugs in your code we need the minimum code necessary to reproduce the problem. That lets us help you more quickly and efficiently. Please see "ask" and the linked pages and "mcve".

Related

How can I allow my try block to work multiple times?

So I've been trying to do a loop when working with classes but I find that fixing errors work once. Here's my code:
class team:
def __init__(self,budget):
self.budget = budget
def test_func():
team.budget = input("How much money should your baseball team own? ")
try:
team.budget = int(team.budget)
except ValueError:
print("Either you added a dollar sign, put in text or tried something weird. Either way, don't do that.")
team.budget = input("How much money should your baseball team own? ")
test_func()
This try block should block anything that's not an integer, but here's what happens when I cause an error twice:
Is there something you'd recommend to allow the input to happen until the user enters in something acceptable?
Much thanks!

You need to use a loop to repeat the prompt until a valid integer is entered.
def test_func():
while True:
team.budget = input("How much money should your baseball team own? ")
try:
team.budget = int(team.budget)
break # Stop the loop since the input is valid.
except ValueError:
print("Either you added a dollar sign, put in text or tried something weird. Either way, don't do that.")

How to create and loop over an ArrayList of strings in Jenkins Groovy Pipeline

As stated in the title, I'm attempting to loop over an ArrayList of strings in a Jenkins Groovy Pipeline script (using scripted Pipeline syntax). Let me lay out the entire "problem" for you.
I start with a string of filesystem locations separated by spaces: "/var/x /var/y /var/z ... " like so. I loop over this string adding each character to a temp string. And then when I reach a space, I add that temp string to the array and restart. Here's some code showing how I do this:
def full_string = "/var/x /var/y /var/z"
def temp = ""
def arr = [] as ArrayList
full_string.each {
if ( "$it" == " " ) {
arr.add("$temp") <---- have also tried ( arr << "$temp" )
temp = ""
} else {
temp = "$temp" + "$it"
}
}
// if statement to catch last element
See, the problem with this is that if I later go to loop over the array it decides to loop over every individual char instead of the entire /var/x string like I want it to.
I'm new to Groovy so I've been learning as I build this pipeline. Using Jenkins version 2.190.1 if that helps at all. I've looked around on SO and Groovy docs, as well as the pipeline syntax docs on Jenkins. Can't seem to find what I've been looking for. I'm sure that my solution is not the most elegant or efficient, but I will settle for understanding how it works first before trying to squeeze the most performance out of it.
I found this question but this was similarly unhelpful: Dynamically adding elements to ArrayList in Groovy.
Edit: I'm trying to translate old company c-shell build scripts into Jenkins Pipelines. My initial string is an environment variable available on all our nodes that I also need to have available inside the Pipeline.
TL;DR - I need to be able to create an array from space separated values in a string, and then be able to loop over said array and each "element" be a complete string instead of a single char so that I can run pipeline steps properly.

Try running this in your Jenkins script console (your.jenkins.url.yourcompany.com/script):
def full_string = "/var/x /var/y /var/z"
def arr = full_string.split(" ")
for (i in arr) {
println "now got ${i}"
}
Result:
now got /var/x
now got /var/y
now got /var/z

using ruby to extract the values in a hash in a DRY way

My app passes to different methods a json_element for which the keys are different, and sometimes empty.
To handle it, I have been hard-coding the extraction with the following sample code:
def act_on_ruby_tag(json_element)
begin
# logger.progname = __method__
logger.debug json_element
code = json_element['CODE']['$'] unless json_element['CODE'].nil?
predicate = json_element['PREDICATE']['$'] unless json_element['PREDICATE'].nil?
replace = json_element['REPLACE-KEY']['$'] unless json_element['REPLACE-KEY'].nil?
hash = json_element['HASH']['$'] unless json_element['HASH'].nil?
I would like to eliminate hardcoding the values, and not quite sure how.
I started to think through it as follows:
keys = json_element.keys
keys.each do |k|
set_key = k.downcase
instance_variable_set("#" + set_key, json_element[k]['$']) unless json_element[k].nil?
end
And then use #code for example in the rest of the method.
I was going to try to turn into a method and then replace all this hardcoded code.
But I wasn't entirely sure if this is a good path.

It's almost always better to return a hash structure from a method where you have things like { code: ... } rather than setting arbitrary instance variables. If you return them in a consistent container, it's easier for callers to deal with delivering that to the right location, storing it for later, or picking out what they want and discarding the rest.
It's also a good idea to try and break up one big, clunky step with a series of smaller, lighter operations. This makes the code a lot easier to follow:
def extract(json)
json.reject do |k, v|
v.nil?
end.map do |k, v|
[ k.downcase, v['$'] ]
end.to_h
end
Then you get this:
extract(
'TEST' => { '$' => 'value' },
'CODE' => { '$' => 'code' },
'NULL' => nil
)
# => {"test"=>"value", "code"=>"code"}
If you want to persist this whole thing as an instance variable, that's a fairly typical pattern, but it will have a predictable name that's not at the mercy of whatever arbitrary JSON document you're consuming.
An alternative is to hard-code the keys in a constant like:
KEYS = %w[ CODE PREDICATE ... ]
Then use that instead, or one step further, define that in a YAML or JSON file you can read-in for configuration purposes. It really depends on how often these will change, and what sort of expectations you have about the irregularity of the input.

This is a slightly more terse way to do what your original code does.
code, predicate, replace, hash = json_element.values_at *%w{
CODE PREDICATE REPLACE-KEY HASH
}.map { |x| x.fetch("$", nil) if x }

Check if each string from array is contained by another string array

Sorry I'am new on Ruby (just a Java programmer), I have two string arrays:
Array with file paths.
Array with patterns (can be a path or a file)
I need to check each patter over each "file path". I do with this way:
#flag = false
["aa/bb/cc/file1.txt","aa/bb/cc/file2.txt","aa/bb/dd/file3.txt"].each do |source|
["bb/cc/","zz/xx/ee"].each do |to_check|
if source.include?(to_check)
#flag = true
end
end
end
puts #flag
This code is ok, prints "true" because "bb/cc" is in source.
I have seen several posts but can not find a better way. I'm sure there should be functions that allow me to do this in fewer lines.
Is this is possible?

As mentioned by #dodecaphonic use Enumerable#any?. Something like this:
paths.any? { |s| patterns.any? { |p| s[p] } }
where paths and patterns are arrays as defined by the OP.

While that will work, that's going to have geometric scaling problems, that is it has to do N*M tests for a list of N files versus M patterns. You can optimize this a little:
files = ["aa/bb/cc/file1.txt","aa/bb/cc/file2.txt","aa/bb/dd/file3.txt"]
# Create a pattern that matches all desired substrings
pattern = Regexp.union(["bb/cc/","zz/xx/ee"])
# Test until one of them hits, returns true if any matches, false otherwise
files.any? do |file|
file.match(pattern)
end
You can wrap that up in a method if you want. Keep in mind that if the pattern list doesn't change you might want to create that once and keep it around instead of constantly re-generating it.

Locating a dynamic string in a text file

Problem:
Hello, I have been struggling recently in my programming endeavours. I have managed to receive the output below from Google Speech to Text, but I cannot figure out how draw data from this block.
Excerpt 1:
[VoiceMain]: Successfully initialized
{"result":[]}
{"result":[{"alternative":[{"transcript":"hello","confidence":0.46152416},{"transcript":"how low"},{"transcript":"how lo"},{"transcript":"how long"},{"transcript":"Polo"}],"final":true}],"result_index":0}
[VoiceMain]: Successfully initialized
{"result":[]}
{"result":[{"alternative":[{"transcript":"hello"},{"transcript":"how long"},{"transcript":"how low"},{"transcript":"howlong"}],"final":true}],"result_index":0}
Objective:
My goal is to extract the string "hello" (without the quotation marks) from the first transcript of each block and set it equal to a variable. The problem arises when I do not know what the phrase will be. Instead of "hello", the phrase may be a string of any length. Even if it is a different string, I would still like to set it to the same variable to which the phrase "hello" would have been set to.
Furthermore, I would like to extract the number after the word "confidence". In this case, it is 0.46152416. Data type does not matter for the confidence variable. The confidence variable appears to be more difficult to extract from the blocks because it may or may not be present. If it is not present, it must be ignored. If it is present however, it must be detected and stored as a variable.
Also please note that this text block is stored within a file named "CurlOutput.txt".
All help or advice related to solving this problem is greatly appreciated.

You could do this with regex, but then I am assuming you will want to use this as a dict later in your code. So here is a python approach to building this result as a dictionary.
import json
with open('CurlOutput.txt') as f:
lines = f.read().splitlines()
flag = '{"result":[]} '
for line in lines: # Loop through each lin in file
if flag in line: # check if this is a line with data on it
results = json.loads(line.replace(flag, ''))['result'] # Load data as a dict
# If you just want to change first index of alternative
# results[0]['alternative'][0]['transcript'] = 'myNewString'
# If you want to check all alternative for confidence and transcript
for result in results[0]['alternative']: # Loop over each alternative
transcript = result['transcript']
confidence = None
if 'confidence' in result:
confidence = result['confidence']
# now do whatever you want with confidence and transcript.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to parse data and store it into variables using Nokogiri and Ruby - arrays

Related

How can I allow my try block to work multiple times?

How to create and loop over an ArrayList of strings in Jenkins Groovy Pipeline

using ruby to extract the values in a hash in a DRY way

Check if each string from array is contained by another string array

Locating a dynamic string in a text file

Categories

Resources