Just like ERC-20 Tokens let you make a new currency on the ETH network, is there a way to use TEAL https://github.com/algorand/go-algorand/blob/master/data/transactions/logic/README.md and the algorand network for the same purpose?
i.e. you can do lots of stuff in a "bytecode based stack language that executes inside Algorand transactions" but one simple thing must be just another coin logic with a fixed set of total supply, rules for adding new coin, etc.
i.e. this GUI
Found answer at: https://developer.algorand.org/tutorials/create-dogcoin/
via https://github.com/algorand/go-algorand/issues/2142#issuecomment-835243098
I understand if this question needs to be closed.
Yes, you can use TEAL to create a new token on the Algorand network. Tokens are just a form of smart contracts, and TEAL allows developers to build smart contracts.
However, Algorand has an ASA network (Algorand Standard Asset) which means you can create a token without writing a new smart contract. You can read more about ASAs here.
If you want to do this in TEAL and create a more bespoke token, it is recommended to use PyTEAL. PyTEAL is a Python implementation of TEAL which is much easier to use.
This is an example of an asset in PyTEAL:
from pyteal import *
def approval_program():
on_creation = Seq(
[
Assert(Txn.application_args.length() == Int(1)),
App.globalPut(Bytes("total supply"), Btoi(Txn.application_args[0])),
App.globalPut(Bytes("reserve"), Btoi(Txn.application_args[0])),
App.localPut(Int(0), Bytes("admin"), Int(1)),
App.localPut(Int(0), Bytes("balance"), Int(0)),
Return(Int(1)),
]
)
is_admin = App.localGet(Int(0), Bytes("admin"))
on_closeout = Seq(
[
App.globalPut(
Bytes("reserve"),
App.globalGet(Bytes("reserve"))
+ App.localGet(Int(0), Bytes("balance")),
),
Return(Int(1)),
]
)
register = Seq([App.localPut(Int(0), Bytes("balance"), Int(0)), Return(Int(1))])
# configure the admin status of the account Txn.accounts[1]
# sender must be admin
new_admin_status = Btoi(Txn.application_args[1])
set_admin = Seq(
[
Assert(And(is_admin, Txn.application_args.length() == Int(2))),
App.localPut(Int(1), Bytes("admin"), new_admin_status),
Return(Int(1)),
]
)
# NOTE: The above set_admin code is carefully constructed. If instead we used the following code:
# Seq([
# Assert(Txn.application_args.length() == Int(2)),
# App.localPut(Int(1), Bytes("admin"), new_admin_status),
# Return(is_admin)
# ])
# It would be vulnerable to the following attack: a sender passes in their own address as
# Txn.accounts[1], so then the line App.localPut(Int(1), Bytes("admin"), new_admin_status)
# changes the sender's admin status, meaning the final Return(is_admin) can return anything the
# sender wants. This allows anyone to become an admin!
# move assets from the reserve to Txn.accounts[1]
# sender must be admin
mint_amount = Btoi(Txn.application_args[1])
mint = Seq(
[
Assert(Txn.application_args.length() == Int(2)),
Assert(mint_amount <= App.globalGet(Bytes("reserve"))),
App.globalPut(
Bytes("reserve"), App.globalGet(Bytes("reserve")) - mint_amount
),
App.localPut(
Int(1),
Bytes("balance"),
App.localGet(Int(1), Bytes("balance")) + mint_amount,
),
Return(is_admin),
]
)
# transfer assets from the sender to Txn.accounts[1]
transfer_amount = Btoi(Txn.application_args[1])
transfer = Seq(
[
Assert(Txn.application_args.length() == Int(2)),
Assert(transfer_amount <= App.localGet(Int(0), Bytes("balance"))),
App.localPut(
Int(0),
Bytes("balance"),
App.localGet(Int(0), Bytes("balance")) - transfer_amount,
),
App.localPut(
Int(1),
Bytes("balance"),
App.localGet(Int(1), Bytes("balance")) + transfer_amount,
),
Return(Int(1)),
]
)
program = Cond(
[Txn.application_id() == Int(0), on_creation],
[Txn.on_completion() == OnComplete.DeleteApplication, Return(is_admin)],
[Txn.on_completion() == OnComplete.UpdateApplication, Return(is_admin)],
[Txn.on_completion() == OnComplete.CloseOut, on_closeout],
[Txn.on_completion() == OnComplete.OptIn, register],
[Txn.application_args[0] == Bytes("set admin"), set_admin],
[Txn.application_args[0] == Bytes("mint"), mint],
[Txn.application_args[0] == Bytes("transfer"), transfer],
)
return program
def clear_state_program():
program = Seq(
[
App.globalPut(
Bytes("reserve"),
App.globalGet(Bytes("reserve"))
+ App.localGet(Int(0), Bytes("balance")),
),
Return(Int(1)),
]
)
return program
if __name__ == "__main__":
with open("asset_approval.teal", "w") as f:
compiled = compileTeal(approval_program(), mode=Mode.Application, version=2)
f.write(compiled)
with open("asset_clear_state.teal", "w") as f:
compiled = compileTeal(clear_state_program(), mode=Mode.Application, version=2)
f.write(compiled)
I wanted to prepare a multiple.bat file to run several spiders so I first tried to prepare multiple.bat file for one spider. I got stopped here. I got this error
G:\myVE\vacancies>multiple.bat
G:\myVE\vacancies>scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE
\vacancies\vacancies\spiders\job_spider.py
G:\myVE\vacancies\vacancies\spiders\job_spider.py:12: ScrapyDeprecationWarning:
`Settings.overrides` attribute is deprecated and won't be supported in Scrapy 0.
26, use `Settings.set(name, value, priority='cmdline')` instead
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,applicati
on/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
['http://1nadan.si']
Usage
=====
scrapy crawl [options] <spider>
crawl: error: running 'scrapy crawl' with more than one spider is no longer supp
orted
From this question
How to give URL to scrapy for crawling?
it looks like the problem would be that the spider is reading several urls into start_urls but this is not the case. There is only one url in the spider. And it works normally when started from command line. Why is this error happening? Maybe because I have several spiders in adjacent directories but that does not make sense. My final goal is to split a list of 1300 urls into 130 chunks of 10 urls and launch 130 spiders from multiple.bat file. The aim is to reduce the time of scraping so that I can have results in two hours instead of two days, because now I split 1300 urls into 13 chunks of 100 urls and launch 13 spiders and it takes me two days to scrap everything.
Here is my multiple.bat code
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE\vacancies\vacancies\spiders\job_spider.py
and here is the code for my spider:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
#File to write unique links
#unique = open("G:/myVE/vacancies/unique_urls.txt", "wb")
#izloceni = open("G:/myVE/vacancies/izloceni.txt", "wb")
class JobSpider(scrapy.Spider):
name = "jobs"
#Test sample of SLO companies
start_urls = [
"http://1nadan.si"
]
print start_urls
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
#I would like to see how many unique links we check on every page.
#unique_urls = []
def parse(self, response):
response.selector.remove_namespaces()
#Take url of response, because we would like to stay on the same domain.
net1 = urlparse(response.url).netloc
#print "Net 1 " + str(net1)
#Base url.
base_url = get_base_url(response)
#print "Base url " + str(base_url)
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#print urls
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url.encode('utf-8')) + "\n")
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#This is very strict condition. If seed website loses or gets www., then it will be ignored, as the condition very strictly checks the link.
#o = urlparse(url)
#test = o.scheme + "://" + o.netloc
#print "Url : " + url
#print "Test: " + test
#if test in self.start_urls:
# print "Test OK"
#if test not in self.start_urls:
#print "Test NOT OK - continue"
#izloceni.write(str(url) + "\n")
#continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
net2 = urlparse(url).netloc
if net2 != net1:
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
#Twitter, Facebook
'://twitter.com', '://mobile.twitter.com', 'www.facebook.com', 'www.twitter.com'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
url = url.encode('utf-8')
#Ignore ftp.
if url.startswith("ftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url) + "\n")
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
Your help is greatly appreciated !
DONE
I have found the solution by writing the programme, that creates lots of spiders, 122 in my case, by copying and modifying initial spider. Modification means that each spider reads next ten urls from the list, so that spiders consecutively read all the list, 10 urls each, and start working in parallel. This way 123 spiders are released at the same time to go fetching to the network.
At the same time the programme creates .bat file with 123 commands, that release the spiders, so that I don't have to open 123 command lines.
#Programme that generates spiders
#Inital parameter to determine number of spiders. There are 1226 urls, so we set it to 122 spiders, so that the last piece will be 1220 to 1230. There is also initial spider, that crawls webpages 0 to 10, so there will be 123 spiders.
j = 122
#Prepare bat file with commands, that will throw all spiders at the same time to the network.
bat = open("G:/myVE/vacancies_januar/commands.bat", "w")
bat.write("cd \"G:\\myVE\\vacancies_januar\"\n")
bat.write("start scrapy crawl jobs_0_10 -o podjetja_0_10_url.csv -t csv --logfile podjetja_0_10_log.txt\n")
#Loop that grows spiders from initial spider_0_10.
for i in range(0,j):
with open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_0_10.py", "r+") as prgm:
program = prgm.read()
#Just replace 0_10 with 10_20 and so on.
program = program.replace("0_10", str((i+1)*10)+"_"+str((i+1)*10+10))
program = program.replace("0:10", str((i+1)*10)+":"+str((i+1)*10+10))
#Generate new spider.
dest = open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_"+str((i+1)*10)+"_"+str((i+1)*10+10)+".py", "w")
dest.write(program)
#At the same time write the command into bat file.
bat.write("start scrapy crawl jobs_"+str((i+1)*10)+"_"+str((i+1)*10+10)+" -o podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_url.csv -t csv --logfile podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_log.txt\n")
Why are you specifying the path to the Python spider? Isn't specifying the spider name (jobs) enough?
I would expect this would work just as well:
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
As for splitting the job, why not write a Python wrapper that takes the number of concurrent spiders, divides the URL list into that many pieces, and launches the spider(s)?
EDIT
Caveat: I am not well versed in scrapy use.
Here's a sample program that, given a (big?) file of urls, splits them into smaller chunks and creates a new process for each one, attempting to invoke scrapy in each child process. This would effectively run <n> scrapy processes at once on different sets of URLs.
#!python2
import multiprocessing,subprocess,sys,tempfile,math,os
def run_chunk( spider, proj_dir, urllist ):
os.chdir(proj_dir)
with tempfile.NamedTemporaryFile(mode='wt',prefix='urllist') as urlfile:
urlfile.write("\n".join(urllist))
urlfile.flush()
command = [
'scrapy',
'crawl',
'-a','urls='+urlfile.name,
spider,
]
subprocess.check_call(command)
print("Child Finished!")
# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464#312464
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == '__main__':
# ... or use argparse or some other run-time configuration tool
spider = 'jobs'
input_urls = 'biglist.urls.txt' # one URL per line
project_dir = 'jobs_scrapy_dir'
num_children = 10
# Split the URLs into chunks; assign chunks to workers.
urls = open(input_urls,'rt').readlines()
per_chunk = int(math.ceil(len(urls)//num_children))
workers = [ multiprocessing.Process( target=run_chunk,
args=(spider,project_dir,chunk) )
for chunk in chunks(urls,per_chunk) ]
# Start all the workers.
for w in workers:
w.start()
for w in workers:
w.join()
print("Finished!")
Based on a very cursory reading, scrapy has some concept of parallelization, so I cannot say that this is the best way to make use of scrapy. The code sample here works insofar as it splits the file into pieces and launches child processes. The command given to subprocess.check_call() invocation will probably need to be tweaked in order to pass a file full of urls to a spider instance.
Limitations
The entire file of URL's is read into memory at once, then split into pieces. This means that 2x the space of the URL file is used. There are smarter ways of doing this job. My implementation is just a quick demo of one possibility.
The last chunk may be significantly smaller than all the others. The process will likely take as long as a full chunk, so this probably doesn't matter much, but balancing the load more evenly may be advantageous.
The scrapy syntax may not be correct, and the spider may have to be updated to accept a file parameter.
I did not test the scrapy invocation. The OP didn't post any project details, and the script itself didn't work out of the box, so I had no way to really test that part. Fixing the project/invocation is left as an exercise for the reader.
The waf command waf build shows compiler errors (if there are any) while waf debug or waf release does not and always fails, utilizing the following wscript file (or maybe the wscript file has some other shortcomings I am currently not aware of):
APPNAME = 'waftest'
VERSION = '0.0.1'
def configure(ctx):
ctx.load('compiler_c')
ctx.define('VERSION', VERSION)
ctx.define('GETTEXT_PACKAGE', APPNAME)
ctx.check_cfg(atleast_pkgconfig_version='0.1.1')
ctx.check_cfg(package='glib-2.0', uselib_store='GLIB', args=['--cflags', '--libs'], mandatory=True)
ctx.check_cfg(package='gobject-2.0', uselib_store='GOBJECT', args=['--cflags', '--libs'], mandatory=True)
ctx.check_cfg(package='gtk+-3.0', uselib_store='GTK3', args=['--cflags', '--libs'], mandatory=True)
ctx.check_cfg(package='libxml-2.0', uselib_store='XML', args=['--cflags', '--libs'], mandatory=True)
ctx.check_large_file(mandatory=False)
ctx.check_endianness(mandatory=False)
ctx.check_inline(mandatory=False)
ctx.setenv('debug')
ctx.env.CFLAGS = ['-g', '-Wall']
ctx.define('DEBUG',1)
ctx.setenv('release')
ctx.env.CFLAGS = ['-O2', '-Wall']
ctx.define('RELEASE',1)
def pre(ctx):
print ('Building [[[' + ctx.variant + ']]] ...')
def post(ctx):
print ('Building is complete.')
def build(ctx):
ctx.add_pre_fun(pre)
ctx.add_post_fun(post)
# if not ctx.variant:
# ctx.fatal('Do "waf debug" or "waf release"')
exe = ctx.program(
features = ['c', 'cprogram'],
target = APPNAME+'.bin',
source = ctx.path.ant_glob(['src/*.c']),
includes = ['src/'],
export_includes = ['src/'],
uselib = 'GOBJECT GLIB GTK3 XML'
)
# for item in exe.includes:
# print(item)
from waflib.Build import BuildContext
class release(BuildContext):
cmd = 'release'
variant = 'release'
class debug(BuildContext):
cmd = 'debug'
variant = 'debug'
Error resulting from waf debug :
Build failed
-> task in 'waftest.bin' failed (exit status -1):
{task 46697488: c qqq.c -> qqq.c.1.o}
[useless filepaths]
I had a look at the waf demos, read the wafbook at section 6.2.2 but those did not supply me with valuable information in order to fix this issue.
What's wrong, and how do I fix it?
You need to do at least the following:
def configure(ctx):
...
ctx.setenv('debug')
ctx.load('compiler_c')
...
Since the cfg.setenv function resets whole previous environment. If you want to save previous environment, you can do cfg.setenv('debug', env=cfg.env.derive()).
Also, you don't need to explicitly specify the features = ['c', 'cprogram'], since, it's redundant, when you call bld.program(...).
P.S. Don't forget to reconfigure after modifying wscript file.