Rspec how to call ENV variable before driver configuration - selenium-webdriver

I am writing UI tests on a new feature. This feature is iframed in, and appears in all browser configuration except for incognito with third party cookies blocked. I have created a new browser settings option that I can call to run the test suite locally "BROWSER=iframe_present rspec --tag service:test_tag" or the test itself "BROWSER=iframe_present rspec spec/service_folder/example_test_spec.rb".
My goal is to set it up so that only this test file; or service tag ("test_tag") run with the specific browser configuration when automatically through a travis configuration (which I can test locally by running "BROWSER=headless rspec spec/service_folder/example_test_spec.rb").
I've tried to call the 'iframe_present' browser configuration from the test file in a few different ways, but each one hits the byebug I have in the final 'else' browser condition. Perhaps I need to use the ci.travis.sh file? Seems to deal with picking the browser config.
*edit to include spec_helper.rb file
example_test_spec.rb
describe "Validating HQ Extensions menu item", type: :feature, service: "sales_channels1" do
context "Squadlocker" do
# different attempts
# let(:browser_config) { SeleniumTest.browser == iframe_present }
# BROWSER=iframe_present
# before(:all) { SeleniumTest.ENV["BROWSER"] == "iframe_present" }
# before { SeleniumTest.stub(ENV["BROWSER"] => "iframe_present") }
# before { SeleniumTest.stub(ENV["BROWSER"] == "iframe_present") }
before do
allow(ENV).to receive(:[])
.with("BROWSER").and_return("iframe_present")
end
let(:hq_home) { HqHomePage.new }
let(:marketplace) { MarketplacePage.new }
it "some condition check" do
# stuff
end
end
end
env.rb
require 'uri'
require 'capybara/poltergeist'
require 'selenium-webdriver'
require 'webdrivers'
require_relative '../../config/api_config_base'
module SeleniumTest
module_function
# url stuff, unrelated
browser = ENV['BROWSER'] ? ENV['BROWSER'].downcase : ''
puts "browser type: #{browser}"
if browser == 'firefox'
# options
RSpec.configure do |config|
config.before :each do
page.driver.browser.manage.window.maximize
end
end
elsif browser == 'headless'
Capybara.default_driver = :selenium_chrome_headless
Capybara.register_driver :selenium_chrome_headless do |app|
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--incognito')
Capybara::Selenium::Driver.new(
app,
browser: :chrome,
capabilities: [options]
)
end
elsif browser == 'iframe_present'
byebug
# currently matching chrome settings for testing minus incognito setting, will switch to headless
Capybara.default_driver = :selenium_chrome
Capybara.register_driver :selenium_chrome do |app|
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
Capybara::Selenium::Driver.new(
app,
browser: :chrome,
capabilities: [options]
)
end
else
byebug
Capybara.default_driver = :selenium_chrome
Capybara.register_driver :selenium_chrome do |app|
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
Capybara::Selenium::Driver.new(
app,
browser: :chrome,
capabilities: [options]
)
end
RSpec.configure do |config|
config.before :each do
page.driver.browser.manage.window.maximize
end
end
Capybara.javascript_driver = :chrome
end
ci.travis.sh
source ./script/env.travis.sh
echo $TEST_TAG
echo $RSPEC_TAG
echo $BROWSER
echo $TRAVIS_BRANCH
if [[ $TRAVIS_BRANCH == "main" ]]; then
BROWSER=headless ./run_spec_tests.sh "production" "Rspec UI Tests" \
"$TRAVIS_BUILD_NUMBER" "$TEST_AUTOMATION_NOTIFY_CHANNEL" \
"$TEST_AUTOMATION_NOTIFY_CHANNEL" "$AUTHOR_NAME" "" "$RSPEC_TAG" "$BROWSER"
elif [[ $TRAVIS_BRANCH != "testdrive" ]]; then
BROWSER=headless ./run_spec_tests.sh "staging" "Rspec UI Tests" \
"$TRAVIS_BUILD_NUMBER" "$TEST_AUTOMATION_NOTIFY_CHANNEL" \
"$TEST_AUTOMATION_NOTIFY_CHANNEL" "$AUTHOR_NAME" "" "$RSPEC_TAG" "$BROWSER"
fi
spec_helper.rb file
require "capybara/rspec"
require "etc"
Dir[File.dirname(__FILE__) + "/support/*.rb"].each { |f| require f }
Dir[File.dirname(__FILE__) + "/helpers/*.rb"].each { |f| require f }
Dir[File.dirname(__FILE__) + "/page_models/*.rb"].each { |f| require f }
RSpec.configure do |config|
config.include FactoryBot::Syntax::Methods
config.include AffiliateRosteringHelper, type: :feature
config.include AffiliationsHelper, type: :feature
config.include AllureAttachmentHelper
config.include ApiRequestHelper
config.include CSVHelper
config.include DateTimeHelper, type: :feature
config.include UserHelper
config.include Capybara::DSL
# Use color in STDOUT
config.color = true
# Use color not only in STDOUT but also in pagers and files
config.tty = true
config.before(:suite) do
FactoryBot.find_definitions
end
config.formatter = :documentation
config.expect_with :rspec do |expectations|
expectations.include_chain_clauses_in_custom_matcher_descriptions = true
end
config.mock_with :rspec do |mocks|
mocks.verify_partial_doubles = true
end
config.shared_context_metadata_behavior = :apply_to_host_groups
config.filter_run_when_matching :focus
config.before(:each, type: :feature) do
Capybara.current_session.driver.browser.manage.window.resize_to(1500, 1600)
end
if ApiConfigBase.env === 'production'
Capybara.default_max_wait_time = 30
else
Capybara.default_max_wait_time = 45
end
Capybara.raise_server_errors = true
config.verbose_retry = true
config.display_try_failure_messages = true
config.around :each do |ex|
ex.run_with_retry retry: 0 # set back to 3 b4 code review
end
config.formatter = AllureRspecFormatter if ENV["TRAVIS"]
if !ENV["TRAVIS"]
# Instructions on getting your secrets.json file here:
# https://sportngin.atlassian.net/wiki/spaces/DEV/pages/2913271865/Credentials+in+Parameter+Store
JSON.parse(File.read("secrets.json")).each do |key, value|
ENV[key.upcase] = value
end
end
config.after(:each, type: :feature) do |example|
# restart browser sessions for every test
attach_screenshot(example) if ENV["TRAVIS"]
driver = Capybara.current_session.driver
if driver.is_a?(Capybara::Selenium::Driver)
driver.quit
elsif driver.is_a?(Capybara::Poltergeist::Driver)
driver.browser.restart
end
end
end

If I'm understanding correctly, I believe you need the following in your test:
before do
allow(ENV).to receive(:[])
.with("BROWSER").and_return("iframe_present")
end

Related

using Transaction Execution Approval Language (TEAL) can I make my own coin?

Just like ERC-20 Tokens let you make a new currency on the ETH network, is there a way to use TEAL https://github.com/algorand/go-algorand/blob/master/data/transactions/logic/README.md and the algorand network for the same purpose?
i.e. you can do lots of stuff in a "bytecode based stack language that executes inside Algorand transactions" but one simple thing must be just another coin logic with a fixed set of total supply, rules for adding new coin, etc.
i.e. this GUI
Found answer at: https://developer.algorand.org/tutorials/create-dogcoin/
via https://github.com/algorand/go-algorand/issues/2142#issuecomment-835243098
I understand if this question needs to be closed.
Yes, you can use TEAL to create a new token on the Algorand network. Tokens are just a form of smart contracts, and TEAL allows developers to build smart contracts.
However, Algorand has an ASA network (Algorand Standard Asset) which means you can create a token without writing a new smart contract. You can read more about ASAs here.
If you want to do this in TEAL and create a more bespoke token, it is recommended to use PyTEAL. PyTEAL is a Python implementation of TEAL which is much easier to use.
This is an example of an asset in PyTEAL:
from pyteal import *
def approval_program():
on_creation = Seq(
[
Assert(Txn.application_args.length() == Int(1)),
App.globalPut(Bytes("total supply"), Btoi(Txn.application_args[0])),
App.globalPut(Bytes("reserve"), Btoi(Txn.application_args[0])),
App.localPut(Int(0), Bytes("admin"), Int(1)),
App.localPut(Int(0), Bytes("balance"), Int(0)),
Return(Int(1)),
]
)
is_admin = App.localGet(Int(0), Bytes("admin"))
on_closeout = Seq(
[
App.globalPut(
Bytes("reserve"),
App.globalGet(Bytes("reserve"))
+ App.localGet(Int(0), Bytes("balance")),
),
Return(Int(1)),
]
)
register = Seq([App.localPut(Int(0), Bytes("balance"), Int(0)), Return(Int(1))])
# configure the admin status of the account Txn.accounts[1]
# sender must be admin
new_admin_status = Btoi(Txn.application_args[1])
set_admin = Seq(
[
Assert(And(is_admin, Txn.application_args.length() == Int(2))),
App.localPut(Int(1), Bytes("admin"), new_admin_status),
Return(Int(1)),
]
)
# NOTE: The above set_admin code is carefully constructed. If instead we used the following code:
# Seq([
# Assert(Txn.application_args.length() == Int(2)),
# App.localPut(Int(1), Bytes("admin"), new_admin_status),
# Return(is_admin)
# ])
# It would be vulnerable to the following attack: a sender passes in their own address as
# Txn.accounts[1], so then the line App.localPut(Int(1), Bytes("admin"), new_admin_status)
# changes the sender's admin status, meaning the final Return(is_admin) can return anything the
# sender wants. This allows anyone to become an admin!
# move assets from the reserve to Txn.accounts[1]
# sender must be admin
mint_amount = Btoi(Txn.application_args[1])
mint = Seq(
[
Assert(Txn.application_args.length() == Int(2)),
Assert(mint_amount <= App.globalGet(Bytes("reserve"))),
App.globalPut(
Bytes("reserve"), App.globalGet(Bytes("reserve")) - mint_amount
),
App.localPut(
Int(1),
Bytes("balance"),
App.localGet(Int(1), Bytes("balance")) + mint_amount,
),
Return(is_admin),
]
)
# transfer assets from the sender to Txn.accounts[1]
transfer_amount = Btoi(Txn.application_args[1])
transfer = Seq(
[
Assert(Txn.application_args.length() == Int(2)),
Assert(transfer_amount <= App.localGet(Int(0), Bytes("balance"))),
App.localPut(
Int(0),
Bytes("balance"),
App.localGet(Int(0), Bytes("balance")) - transfer_amount,
),
App.localPut(
Int(1),
Bytes("balance"),
App.localGet(Int(1), Bytes("balance")) + transfer_amount,
),
Return(Int(1)),
]
)
program = Cond(
[Txn.application_id() == Int(0), on_creation],
[Txn.on_completion() == OnComplete.DeleteApplication, Return(is_admin)],
[Txn.on_completion() == OnComplete.UpdateApplication, Return(is_admin)],
[Txn.on_completion() == OnComplete.CloseOut, on_closeout],
[Txn.on_completion() == OnComplete.OptIn, register],
[Txn.application_args[0] == Bytes("set admin"), set_admin],
[Txn.application_args[0] == Bytes("mint"), mint],
[Txn.application_args[0] == Bytes("transfer"), transfer],
)
return program
def clear_state_program():
program = Seq(
[
App.globalPut(
Bytes("reserve"),
App.globalGet(Bytes("reserve"))
+ App.localGet(Int(0), Bytes("balance")),
),
Return(Int(1)),
]
)
return program
if __name__ == "__main__":
with open("asset_approval.teal", "w") as f:
compiled = compileTeal(approval_program(), mode=Mode.Application, version=2)
f.write(compiled)
with open("asset_clear_state.teal", "w") as f:
compiled = compileTeal(clear_state_program(), mode=Mode.Application, version=2)
f.write(compiled)

How to run a spider from bat file for multiple urls?

I wanted to prepare a multiple.bat file to run several spiders so I first tried to prepare multiple.bat file for one spider. I got stopped here. I got this error
G:\myVE\vacancies>multiple.bat
G:\myVE\vacancies>scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE
\vacancies\vacancies\spiders\job_spider.py
G:\myVE\vacancies\vacancies\spiders\job_spider.py:12: ScrapyDeprecationWarning:
`Settings.overrides` attribute is deprecated and won't be supported in Scrapy 0.
26, use `Settings.set(name, value, priority='cmdline')` instead
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,applicati
on/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
['http://1nadan.si']
Usage
=====
scrapy crawl [options] <spider>
crawl: error: running 'scrapy crawl' with more than one spider is no longer supp
orted
From this question
How to give URL to scrapy for crawling?
it looks like the problem would be that the spider is reading several urls into start_urls but this is not the case. There is only one url in the spider. And it works normally when started from command line. Why is this error happening? Maybe because I have several spiders in adjacent directories but that does not make sense. My final goal is to split a list of 1300 urls into 130 chunks of 10 urls and launch 130 spiders from multiple.bat file. The aim is to reduce the time of scraping so that I can have results in two hours instead of two days, because now I split 1300 urls into 13 chunks of 100 urls and launch 13 spiders and it takes me two days to scrap everything.
Here is my multiple.bat code
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE\vacancies\vacancies\spiders\job_spider.py
and here is the code for my spider:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
#File to write unique links
#unique = open("G:/myVE/vacancies/unique_urls.txt", "wb")
#izloceni = open("G:/myVE/vacancies/izloceni.txt", "wb")
class JobSpider(scrapy.Spider):
name = "jobs"
#Test sample of SLO companies
start_urls = [
"http://1nadan.si"
]
print start_urls
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
#I would like to see how many unique links we check on every page.
#unique_urls = []
def parse(self, response):
response.selector.remove_namespaces()
#Take url of response, because we would like to stay on the same domain.
net1 = urlparse(response.url).netloc
#print "Net 1 " + str(net1)
#Base url.
base_url = get_base_url(response)
#print "Base url " + str(base_url)
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#print urls
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url.encode('utf-8')) + "\n")
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#This is very strict condition. If seed website loses or gets www., then it will be ignored, as the condition very strictly checks the link.
#o = urlparse(url)
#test = o.scheme + "://" + o.netloc
#print "Url : " + url
#print "Test: " + test
#if test in self.start_urls:
# print "Test OK"
#if test not in self.start_urls:
#print "Test NOT OK - continue"
#izloceni.write(str(url) + "\n")
#continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
net2 = urlparse(url).netloc
if net2 != net1:
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
#Twitter, Facebook
'://twitter.com', '://mobile.twitter.com', 'www.facebook.com', 'www.twitter.com'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
url = url.encode('utf-8')
#Ignore ftp.
if url.startswith("ftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url) + "\n")
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
Your help is greatly appreciated !
DONE
I have found the solution by writing the programme, that creates lots of spiders, 122 in my case, by copying and modifying initial spider. Modification means that each spider reads next ten urls from the list, so that spiders consecutively read all the list, 10 urls each, and start working in parallel. This way 123 spiders are released at the same time to go fetching to the network.
At the same time the programme creates .bat file with 123 commands, that release the spiders, so that I don't have to open 123 command lines.
#Programme that generates spiders
#Inital parameter to determine number of spiders. There are 1226 urls, so we set it to 122 spiders, so that the last piece will be 1220 to 1230. There is also initial spider, that crawls webpages 0 to 10, so there will be 123 spiders.
j = 122
#Prepare bat file with commands, that will throw all spiders at the same time to the network.
bat = open("G:/myVE/vacancies_januar/commands.bat", "w")
bat.write("cd \"G:\\myVE\\vacancies_januar\"\n")
bat.write("start scrapy crawl jobs_0_10 -o podjetja_0_10_url.csv -t csv --logfile podjetja_0_10_log.txt\n")
#Loop that grows spiders from initial spider_0_10.
for i in range(0,j):
with open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_0_10.py", "r+") as prgm:
program = prgm.read()
#Just replace 0_10 with 10_20 and so on.
program = program.replace("0_10", str((i+1)*10)+"_"+str((i+1)*10+10))
program = program.replace("0:10", str((i+1)*10)+":"+str((i+1)*10+10))
#Generate new spider.
dest = open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_"+str((i+1)*10)+"_"+str((i+1)*10+10)+".py", "w")
dest.write(program)
#At the same time write the command into bat file.
bat.write("start scrapy crawl jobs_"+str((i+1)*10)+"_"+str((i+1)*10+10)+" -o podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_url.csv -t csv --logfile podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_log.txt\n")
Why are you specifying the path to the Python spider? Isn't specifying the spider name (jobs) enough?
I would expect this would work just as well:
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
As for splitting the job, why not write a Python wrapper that takes the number of concurrent spiders, divides the URL list into that many pieces, and launches the spider(s)?
EDIT
Caveat: I am not well versed in scrapy use.
Here's a sample program that, given a (big?) file of urls, splits them into smaller chunks and creates a new process for each one, attempting to invoke scrapy in each child process. This would effectively run <n> scrapy processes at once on different sets of URLs.
#!python2
import multiprocessing,subprocess,sys,tempfile,math,os
def run_chunk( spider, proj_dir, urllist ):
os.chdir(proj_dir)
with tempfile.NamedTemporaryFile(mode='wt',prefix='urllist') as urlfile:
urlfile.write("\n".join(urllist))
urlfile.flush()
command = [
'scrapy',
'crawl',
'-a','urls='+urlfile.name,
spider,
]
subprocess.check_call(command)
print("Child Finished!")
# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464#312464
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == '__main__':
# ... or use argparse or some other run-time configuration tool
spider = 'jobs'
input_urls = 'biglist.urls.txt' # one URL per line
project_dir = 'jobs_scrapy_dir'
num_children = 10
# Split the URLs into chunks; assign chunks to workers.
urls = open(input_urls,'rt').readlines()
per_chunk = int(math.ceil(len(urls)//num_children))
workers = [ multiprocessing.Process( target=run_chunk,
args=(spider,project_dir,chunk) )
for chunk in chunks(urls,per_chunk) ]
# Start all the workers.
for w in workers:
w.start()
for w in workers:
w.join()
print("Finished!")
Based on a very cursory reading, scrapy has some concept of parallelization, so I cannot say that this is the best way to make use of scrapy. The code sample here works insofar as it splits the file into pieces and launches child processes. The command given to subprocess.check_call() invocation will probably need to be tweaked in order to pass a file full of urls to a spider instance.
Limitations
The entire file of URL's is read into memory at once, then split into pieces. This means that 2x the space of the URL file is used. There are smarter ways of doing this job. My implementation is just a quick demo of one possibility.
The last chunk may be significantly smaller than all the others. The process will likely take as long as a full chunk, so this probably doesn't matter much, but balancing the load more evenly may be advantageous.
The scrapy syntax may not be correct, and the spider may have to be updated to accept a file parameter.
I did not test the scrapy invocation. The OP didn't post any project details, and the script itself didn't work out of the box, so I had no way to really test that part. Fixing the project/invocation is left as an exercise for the reader.

ruby, tk lib. Output from getopenfile

I am trying to load multiple files via ruby/tk lib and put them into array:
def openFiles
return Tk.getOpenFile( 'title' => 'Select Files',
'multiple' => true,
'defaultextension' => 'csv',
'filetypes' => "{{Comma Seperated Values} {.csv}} {TXT {.txt}} {All files {.*}}")
end
and then in code
filess = TkVariable.new()
button1 = TkButton.new(root){
text 'Open Files'
command (proc {filess.value = openFiles; puts filess; puts filess.class; puts filess.inspect})
}.grid(:column => 1, :row => 1, :sticky => 'we')
The problem is that I can not manage to get the output as array and I do not know if it is possible or I will have to somehow parse the output. Hm? Please help. Thank you.
this is the output, when I click on the button:
C:\file1
C:\file2
TkVariable
#<TkVariable: v00000>
I think it should be: (for the array part)
['C:\file1','C:\file2']
TkVariable implements #to_a, which you can use to convert its value into the Array you want.
button1 = TkButton.new(root) {
text 'Open Files'
command (proc do
filess.value = openFiles
puts filess.to_a.class
puts filess.to_a.inspect
end)
}.grid(:column => 1, :row => 1, :sticky => 'we')
Array
["C:\file1", "C:\file2"]
This worked for me using Ruby 2.2.5 (with Tk 8.5.12) on Windows 7:
require 'tk'
def extract_filenames_as_ruby_array(file_list_string)
::TkVariable.new(file_list_string).list
end
def files_open
descriptions = %w[
Comma\ Separated\ Values
Text\ Files
All\ Files
]
extensions = %w[ {.csv} {.txt} * ]
types = descriptions.zip(extensions).map {|d,e| "{#{d}} #{e}" }
file_list_string = ::Tk.getOpenFile \
filetypes: types,
multiple: true,
title: 'Select Files'
extract_filenames_as_ruby_array file_list_string
end
def lambda_files_open
#lambda_files_open ||= ::Kernel.lambda do
files = files_open
puts files
end
end
def main
b_button_1
::Tk.mainloop
end
# Tk objects:
def b_button_1
#b_button_1 ||= begin
b = ::Tk::Tile::Button.new root
b.command lambda_files_open
b.text 'Open Files'
b.grid column: 1, row: 1, sticky: :we
end
end
def root
#root ||= ::TkRoot.new
end
main
For reference, Tk.getOpenFile is explained in the Tk Commands and Ruby documentation.

kernel_require.rb:54:in 'require': Cannot load such file

When I execute, I am getting an error: 64x/lib/ruby/site_ruby/2.0.0/rubygems/core_ext/Kernel_require.rb:54:in 'require': Cannot load such file
require "net/http"
require "uri"
require "nokogiri"
uri = URI.parse("http://www.google.com")
response = Net::HTTP.get_response(uri)
puts parse_body(response.body)
def parse_body(response)
begin
return Nokogiri::XML(response) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
return "caught exception: #{e}"
end
end
Try using require_relative generally when ever you get this issue. (not the best way though!)
Try this
$:.unshift File.join(File.dirname(__FILE__), ".")

waf - build works, custom build targets fail

The waf command waf build shows compiler errors (if there are any) while waf debug or waf release does not and always fails, utilizing the following wscript file (or maybe the wscript file has some other shortcomings I am currently not aware of):
APPNAME = 'waftest'
VERSION = '0.0.1'
def configure(ctx):
ctx.load('compiler_c')
ctx.define('VERSION', VERSION)
ctx.define('GETTEXT_PACKAGE', APPNAME)
ctx.check_cfg(atleast_pkgconfig_version='0.1.1')
ctx.check_cfg(package='glib-2.0', uselib_store='GLIB', args=['--cflags', '--libs'], mandatory=True)
ctx.check_cfg(package='gobject-2.0', uselib_store='GOBJECT', args=['--cflags', '--libs'], mandatory=True)
ctx.check_cfg(package='gtk+-3.0', uselib_store='GTK3', args=['--cflags', '--libs'], mandatory=True)
ctx.check_cfg(package='libxml-2.0', uselib_store='XML', args=['--cflags', '--libs'], mandatory=True)
ctx.check_large_file(mandatory=False)
ctx.check_endianness(mandatory=False)
ctx.check_inline(mandatory=False)
ctx.setenv('debug')
ctx.env.CFLAGS = ['-g', '-Wall']
ctx.define('DEBUG',1)
ctx.setenv('release')
ctx.env.CFLAGS = ['-O2', '-Wall']
ctx.define('RELEASE',1)
def pre(ctx):
print ('Building [[[' + ctx.variant + ']]] ...')
def post(ctx):
print ('Building is complete.')
def build(ctx):
ctx.add_pre_fun(pre)
ctx.add_post_fun(post)
# if not ctx.variant:
# ctx.fatal('Do "waf debug" or "waf release"')
exe = ctx.program(
features = ['c', 'cprogram'],
target = APPNAME+'.bin',
source = ctx.path.ant_glob(['src/*.c']),
includes = ['src/'],
export_includes = ['src/'],
uselib = 'GOBJECT GLIB GTK3 XML'
)
# for item in exe.includes:
# print(item)
from waflib.Build import BuildContext
class release(BuildContext):
cmd = 'release'
variant = 'release'
class debug(BuildContext):
cmd = 'debug'
variant = 'debug'
Error resulting from waf debug :
Build failed
-> task in 'waftest.bin' failed (exit status -1):
{task 46697488: c qqq.c -> qqq.c.1.o}
[useless filepaths]
I had a look at the waf demos, read the wafbook at section 6.2.2 but those did not supply me with valuable information in order to fix this issue.
What's wrong, and how do I fix it?
You need to do at least the following:
def configure(ctx):
...
ctx.setenv('debug')
ctx.load('compiler_c')
...
Since the cfg.setenv function resets whole previous environment. If you want to save previous environment, you can do cfg.setenv('debug', env=cfg.env.derive()).
Also, you don't need to explicitly specify the features = ['c', 'cprogram'], since, it's redundant, when you call bld.program(...).
P.S. Don't forget to reconfigure after modifying wscript file.

Resources