I wanted to prepare a multiple.bat file to run several spiders so I first tried to prepare multiple.bat file for one spider. I got stopped here. I got this error
G:\myVE\vacancies>multiple.bat
G:\myVE\vacancies>scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE
\vacancies\vacancies\spiders\job_spider.py
G:\myVE\vacancies\vacancies\spiders\job_spider.py:12: ScrapyDeprecationWarning:
`Settings.overrides` attribute is deprecated and won't be supported in Scrapy 0.
26, use `Settings.set(name, value, priority='cmdline')` instead
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,applicati
on/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
['http://1nadan.si']
Usage
=====
scrapy crawl [options] <spider>
crawl: error: running 'scrapy crawl' with more than one spider is no longer supp
orted
From this question
How to give URL to scrapy for crawling?
it looks like the problem would be that the spider is reading several urls into start_urls but this is not the case. There is only one url in the spider. And it works normally when started from command line. Why is this error happening? Maybe because I have several spiders in adjacent directories but that does not make sense. My final goal is to split a list of 1300 urls into 130 chunks of 10 urls and launch 130 spiders from multiple.bat file. The aim is to reduce the time of scraping so that I can have results in two hours instead of two days, because now I split 1300 urls into 13 chunks of 100 urls and launch 13 spiders and it takes me two days to scrap everything.
Here is my multiple.bat code
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE\vacancies\vacancies\spiders\job_spider.py
and here is the code for my spider:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
#File to write unique links
#unique = open("G:/myVE/vacancies/unique_urls.txt", "wb")
#izloceni = open("G:/myVE/vacancies/izloceni.txt", "wb")
class JobSpider(scrapy.Spider):
name = "jobs"
#Test sample of SLO companies
start_urls = [
"http://1nadan.si"
]
print start_urls
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
#I would like to see how many unique links we check on every page.
#unique_urls = []
def parse(self, response):
response.selector.remove_namespaces()
#Take url of response, because we would like to stay on the same domain.
net1 = urlparse(response.url).netloc
#print "Net 1 " + str(net1)
#Base url.
base_url = get_base_url(response)
#print "Base url " + str(base_url)
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#print urls
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url.encode('utf-8')) + "\n")
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#This is very strict condition. If seed website loses or gets www., then it will be ignored, as the condition very strictly checks the link.
#o = urlparse(url)
#test = o.scheme + "://" + o.netloc
#print "Url : " + url
#print "Test: " + test
#if test in self.start_urls:
# print "Test OK"
#if test not in self.start_urls:
#print "Test NOT OK - continue"
#izloceni.write(str(url) + "\n")
#continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
net2 = urlparse(url).netloc
if net2 != net1:
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
#Twitter, Facebook
'://twitter.com', '://mobile.twitter.com', 'www.facebook.com', 'www.twitter.com'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
url = url.encode('utf-8')
#Ignore ftp.
if url.startswith("ftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url) + "\n")
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
Your help is greatly appreciated !
DONE
I have found the solution by writing the programme, that creates lots of spiders, 122 in my case, by copying and modifying initial spider. Modification means that each spider reads next ten urls from the list, so that spiders consecutively read all the list, 10 urls each, and start working in parallel. This way 123 spiders are released at the same time to go fetching to the network.
At the same time the programme creates .bat file with 123 commands, that release the spiders, so that I don't have to open 123 command lines.
#Programme that generates spiders
#Inital parameter to determine number of spiders. There are 1226 urls, so we set it to 122 spiders, so that the last piece will be 1220 to 1230. There is also initial spider, that crawls webpages 0 to 10, so there will be 123 spiders.
j = 122
#Prepare bat file with commands, that will throw all spiders at the same time to the network.
bat = open("G:/myVE/vacancies_januar/commands.bat", "w")
bat.write("cd \"G:\\myVE\\vacancies_januar\"\n")
bat.write("start scrapy crawl jobs_0_10 -o podjetja_0_10_url.csv -t csv --logfile podjetja_0_10_log.txt\n")
#Loop that grows spiders from initial spider_0_10.
for i in range(0,j):
with open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_0_10.py", "r+") as prgm:
program = prgm.read()
#Just replace 0_10 with 10_20 and so on.
program = program.replace("0_10", str((i+1)*10)+"_"+str((i+1)*10+10))
program = program.replace("0:10", str((i+1)*10)+":"+str((i+1)*10+10))
#Generate new spider.
dest = open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_"+str((i+1)*10)+"_"+str((i+1)*10+10)+".py", "w")
dest.write(program)
#At the same time write the command into bat file.
bat.write("start scrapy crawl jobs_"+str((i+1)*10)+"_"+str((i+1)*10+10)+" -o podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_url.csv -t csv --logfile podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_log.txt\n")
Why are you specifying the path to the Python spider? Isn't specifying the spider name (jobs) enough?
I would expect this would work just as well:
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
As for splitting the job, why not write a Python wrapper that takes the number of concurrent spiders, divides the URL list into that many pieces, and launches the spider(s)?
EDIT
Caveat: I am not well versed in scrapy use.
Here's a sample program that, given a (big?) file of urls, splits them into smaller chunks and creates a new process for each one, attempting to invoke scrapy in each child process. This would effectively run <n> scrapy processes at once on different sets of URLs.
#!python2
import multiprocessing,subprocess,sys,tempfile,math,os
def run_chunk( spider, proj_dir, urllist ):
os.chdir(proj_dir)
with tempfile.NamedTemporaryFile(mode='wt',prefix='urllist') as urlfile:
urlfile.write("\n".join(urllist))
urlfile.flush()
command = [
'scrapy',
'crawl',
'-a','urls='+urlfile.name,
spider,
]
subprocess.check_call(command)
print("Child Finished!")
# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464#312464
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == '__main__':
# ... or use argparse or some other run-time configuration tool
spider = 'jobs'
input_urls = 'biglist.urls.txt' # one URL per line
project_dir = 'jobs_scrapy_dir'
num_children = 10
# Split the URLs into chunks; assign chunks to workers.
urls = open(input_urls,'rt').readlines()
per_chunk = int(math.ceil(len(urls)//num_children))
workers = [ multiprocessing.Process( target=run_chunk,
args=(spider,project_dir,chunk) )
for chunk in chunks(urls,per_chunk) ]
# Start all the workers.
for w in workers:
w.start()
for w in workers:
w.join()
print("Finished!")
Based on a very cursory reading, scrapy has some concept of parallelization, so I cannot say that this is the best way to make use of scrapy. The code sample here works insofar as it splits the file into pieces and launches child processes. The command given to subprocess.check_call() invocation will probably need to be tweaked in order to pass a file full of urls to a spider instance.
Limitations
The entire file of URL's is read into memory at once, then split into pieces. This means that 2x the space of the URL file is used. There are smarter ways of doing this job. My implementation is just a quick demo of one possibility.
The last chunk may be significantly smaller than all the others. The process will likely take as long as a full chunk, so this probably doesn't matter much, but balancing the load more evenly may be advantageous.
The scrapy syntax may not be correct, and the spider may have to be updated to accept a file parameter.
I did not test the scrapy invocation. The OP didn't post any project details, and the script itself didn't work out of the box, so I had no way to really test that part. Fixing the project/invocation is left as an exercise for the reader.