TICKScript never resets Level to OK - database

I’m writing a TickScript that acts on a series of points that can have exactly two outcomes.
Either the result is pass or “not pass” (usually some variant of exit NUM).
The script I have looks sort of like this:
// RP: autogen
// Monitor the result of updates
// WARNING if the result is anything other than pass
batch
|query('''SELECT * FROM "mydb"."autogen"."measurement"''')
.period(25h)
.every(24h)
.groupBy('host')
|alert()
.id('kapacitor/{{ .TaskName }}/{{ .Group }}')
.infoReset(lambda: TRUE)
.warn(lambda: "result" != 'pass')
.message(
'{{ index .Tags "host" }}' +
'{{ if eq .Level "OK" }} are updating again.' +
'{{ else }}' +
'are failing to update.' +
'{{ end }}'
)
.idField('id')
.levelField('level')
.messageField('description')
.stateChangesOnly()
#alertFilterAdapter()
#alertFilter()
The script does seem to sort of do its thing, but has a critical issue of never setting the Level back to OK.
If I feed influx these 4 points:
time host name result
---- ---- ---- ------
1544079584447374994 fakeS176 /usr/bin/yum update -y pass
1544079584447374994 fakeS177 /usr/bin/yum update -y exit 1
1544129084447375177 fakeS176 /usr/bin/yum update -y exit 1
1544129084447375177 fakeS177 /usr/bin/yum update -y pass
I would expect 1 warning, and 1 OK. Where all of the timestamps listed above are within the 25 hour period.
However what actually happens is that I get 2 warns and no OKs.
Could someone give some advice on how to move forward?

Update - a coworker told me about a nodes I had no idea about. Adding a last() node and adding an as(), then removing the infoReset() node seemed to do it.
// RP: autogen
// Monitor the result of updates
// WARNING if the result is anything other than pass
batch
|query('''SELECT * FROM "mydb"."autogen"."measurement"''')
.period(25h)
.every(24h)
.groupBy('host')
|last('result')
.as('result')
|alert()
.id('kapacitor/{{ .TaskName }}/{{ .Group }}')
.warn(lambda: "result" != 'pass')
.message(
'{{ index .Tags "host" }}' +
'{{ if eq .Level "OK" }} are updating again.' +
'{{ else }}' +
'are failing to update.' +
'{{ end }}'
)
.idField('id')
.levelField('level')
.messageField('description')
.stateChangesOnly()
#alertFilterAdapter()
#alertFilter()
Screw this blasted language.

Related

How to add options to ntpd

I'd like to add a new option to ntpd however I couldn't find how to generate ntpd/ntpd-opts{.c, .h} after adding some lines to ntpd/ntpdbase-opts.def e.g.,
$ git diff ntpd/ntpdbase-opts.def
diff --git a/ntpd/ntpdbase-opts.def b/ntpd/ntpdbase-opts.def
index 66b953528..a790cbd51 100644
--- a/ntpd/ntpdbase-opts.def
+++ b/ntpd/ntpdbase-opts.def
## -479,3 +479,13 ## flag = {
the server to be discovered via mDNS client lookup.
_EndOfDoc_;
};
+
+flag = {
+ name = foo;
+ value = F;
+ arg-type = number;
+ descrip = "Some new option";
+ doc = <<- _EndOfDoc_
+ For testing purpose only.
+ _EndOfDoc_;
+};
Do you have any ideas?
how to generate ntpd/ntpd-opts{.c, .h} after adding some lines to ntpd/ntpdbase-opts.def
It is just in build scripts. Just compile https://github.com/ntp-project/ntp/blob/master-no-authorname/INSTALL#L30 it normally and make will pick it up.
https://github.com/ntp-project/ntp/blob/master-no-authorname/ntpd/Makefile.am#L304
https://github.com/ntp-project/ntp/blob/master-no-authorname/ntpd/Makefile.am#L183
In addition to #KamilCuk's answer, we need to do the following to add custom options:
Edit *.def file
Run bootstrap script
Run configure script with --disable-local-libopts option
Run make
For example,
$ git diff ntpd/ntpdbase-opts.def
diff --git a/ntpd/ntpdbase-opts.def b/ntpd/ntpdbase-opts.def
index 66b953528..a790cbd51 100644
--- a/ntpd/ntpdbase-opts.def
+++ b/ntpd/ntpdbase-opts.def
## -479,3 +479,13 ## flag = {
the server to be discovered via mDNS client lookup.
_EndOfDoc_;
};
+
+flag = {
+ name = foo;
+ value = F;
+ arg-type = number;
+ descrip = "Some new option";
+ doc = <<- _EndOfDoc_
+ For testing purpose only.
+ _EndOfDoc_;
+};
This change yields:
$ ./ntpd --help
ntpd - NTP daemon program - Ver. 4.2.8p15
Usage: ntpd [ -<flag> [<val>] | --<name>[{=| }<val>] ]... \
[ <server1> ... <serverN> ]
Flg Arg Option-Name Description
-4 no ipv4 Force IPv4 DNS name resolution
- prohibits the option 'ipv6'
...
-F Num foo Some new option
opt version output version information and exit
-? no help display extended usage information and exit
-! no more-help extended usage information passed thru pager
Options are specified by doubled hyphens and their name or by a single
hyphen and the flag character.
...

Get iterator index in go template (consul-template)

I'm trying to get a simple index that I can append to output of a Go template snippet using consul-template. Looked around a bit and couldn't figure out the simple solution. Basically, given this input
backend web_back
balance roundrobin
{{range service "web-busybox" "passing"}}
server {{ .Name }} {{ .Address }}:80 check
{{ end }}
I would like to see web-busybox-n 10.1.1.1:80 check
Where n is the current index in the range loop. Is this possible with range and maps?
There is no iteration number when ranging over maps (only a value and an optional key). You can achieve what you want with custom functions.
One possible solution that uses an inc() function to increment an index variable in each iteration:
func main() {
t := template.Must(template.New("").Funcs(template.FuncMap{
"inc": func(i int) int { return i + 1 },
}).Parse(src))
m := map[string]string{
"one": "first",
"two": "second",
"three": "third",
}
fmt.Println(t.Execute(os.Stdout, m))
}
const src = `{{$idx := 0}}
{{range $key, $value := .}}
index: {{$idx}} key: {{ $key }} value: {{ $value }}
{{$idx = (inc $idx)}}
{{end}}`
This outputs (try it on the Go Payground) (compacted output):
index: 0 key: one value: first
index: 1 key: three value: third
index: 2 key: two value: second
See similar / related questions:
Go template remove the last comma in range loop
Join range block in go template
Golang code to repeat an html code n times
The example below looks for all servers providing the pmm service, but will only create the command to register with the first pmm server found (when $index == 0)
{{- range $index, $service := service "pmm" -}}
{{- if eq $index 0 -}}
sudo pmm-admin config --server {{ $service.Address }}
{{- end -}}
{{- end -}}

How can i get the logs of past months in postgres DB .

Issue :
Someone has added a junk column in one of my table.I want to figure it out from the logs as when and from where this activity has been performed.
Please Help regarding this issue.
Make sure enable logging in postgresql.conf
1.log_destination = 'stderr' #log_destination = 'stderr,csvlog,syslog'
2.logging_collector = on #need restart
3.log_directory = 'pg_log'
4.log_file_name = 'postgresql-%Y-%m-%d_%H%M%S.log'
5.log_rotation_age = 1d
6.log_rotation_size = 10MB
7.log_min_error_statement = error
8.log_min_duration_statement = 5000 # -1 = disable ; 0 = ALL ; 5000 = 5sec
9.log_line_prefix = '|%m|%r|%d|%u|%e|'
10.log_statment = 'ddl' # 'none' | 'ddl' | 'mod' | 'all'
#prefer 'ddl' because the log output will be 'ddl' and 'query min duration'
If you don't enable it, make sure enable it now.
if you don't have log the last attempt is pg_xlogdump your xlog file under pg_xlog and look for DDL

How to run a spider from bat file for multiple urls?

I wanted to prepare a multiple.bat file to run several spiders so I first tried to prepare multiple.bat file for one spider. I got stopped here. I got this error
G:\myVE\vacancies>multiple.bat
G:\myVE\vacancies>scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE
\vacancies\vacancies\spiders\job_spider.py
G:\myVE\vacancies\vacancies\spiders\job_spider.py:12: ScrapyDeprecationWarning:
`Settings.overrides` attribute is deprecated and won't be supported in Scrapy 0.
26, use `Settings.set(name, value, priority='cmdline')` instead
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,applicati
on/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
['http://1nadan.si']
Usage
=====
scrapy crawl [options] <spider>
crawl: error: running 'scrapy crawl' with more than one spider is no longer supp
orted
From this question
How to give URL to scrapy for crawling?
it looks like the problem would be that the spider is reading several urls into start_urls but this is not the case. There is only one url in the spider. And it works normally when started from command line. Why is this error happening? Maybe because I have several spiders in adjacent directories but that does not make sense. My final goal is to split a list of 1300 urls into 130 chunks of 10 urls and launch 130 spiders from multiple.bat file. The aim is to reduce the time of scraping so that I can have results in two hours instead of two days, because now I split 1300 urls into 13 chunks of 100 urls and launch 13 spiders and it takes me two days to scrap everything.
Here is my multiple.bat code
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE\vacancies\vacancies\spiders\job_spider.py
and here is the code for my spider:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
#File to write unique links
#unique = open("G:/myVE/vacancies/unique_urls.txt", "wb")
#izloceni = open("G:/myVE/vacancies/izloceni.txt", "wb")
class JobSpider(scrapy.Spider):
name = "jobs"
#Test sample of SLO companies
start_urls = [
"http://1nadan.si"
]
print start_urls
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
#I would like to see how many unique links we check on every page.
#unique_urls = []
def parse(self, response):
response.selector.remove_namespaces()
#Take url of response, because we would like to stay on the same domain.
net1 = urlparse(response.url).netloc
#print "Net 1 " + str(net1)
#Base url.
base_url = get_base_url(response)
#print "Base url " + str(base_url)
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#print urls
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url.encode('utf-8')) + "\n")
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#This is very strict condition. If seed website loses or gets www., then it will be ignored, as the condition very strictly checks the link.
#o = urlparse(url)
#test = o.scheme + "://" + o.netloc
#print "Url : " + url
#print "Test: " + test
#if test in self.start_urls:
# print "Test OK"
#if test not in self.start_urls:
#print "Test NOT OK - continue"
#izloceni.write(str(url) + "\n")
#continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
net2 = urlparse(url).netloc
if net2 != net1:
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
#Twitter, Facebook
'://twitter.com', '://mobile.twitter.com', 'www.facebook.com', 'www.twitter.com'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
url = url.encode('utf-8')
#Ignore ftp.
if url.startswith("ftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url) + "\n")
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
Your help is greatly appreciated !
DONE
I have found the solution by writing the programme, that creates lots of spiders, 122 in my case, by copying and modifying initial spider. Modification means that each spider reads next ten urls from the list, so that spiders consecutively read all the list, 10 urls each, and start working in parallel. This way 123 spiders are released at the same time to go fetching to the network.
At the same time the programme creates .bat file with 123 commands, that release the spiders, so that I don't have to open 123 command lines.
#Programme that generates spiders
#Inital parameter to determine number of spiders. There are 1226 urls, so we set it to 122 spiders, so that the last piece will be 1220 to 1230. There is also initial spider, that crawls webpages 0 to 10, so there will be 123 spiders.
j = 122
#Prepare bat file with commands, that will throw all spiders at the same time to the network.
bat = open("G:/myVE/vacancies_januar/commands.bat", "w")
bat.write("cd \"G:\\myVE\\vacancies_januar\"\n")
bat.write("start scrapy crawl jobs_0_10 -o podjetja_0_10_url.csv -t csv --logfile podjetja_0_10_log.txt\n")
#Loop that grows spiders from initial spider_0_10.
for i in range(0,j):
with open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_0_10.py", "r+") as prgm:
program = prgm.read()
#Just replace 0_10 with 10_20 and so on.
program = program.replace("0_10", str((i+1)*10)+"_"+str((i+1)*10+10))
program = program.replace("0:10", str((i+1)*10)+":"+str((i+1)*10+10))
#Generate new spider.
dest = open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_"+str((i+1)*10)+"_"+str((i+1)*10+10)+".py", "w")
dest.write(program)
#At the same time write the command into bat file.
bat.write("start scrapy crawl jobs_"+str((i+1)*10)+"_"+str((i+1)*10+10)+" -o podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_url.csv -t csv --logfile podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_log.txt\n")
Why are you specifying the path to the Python spider? Isn't specifying the spider name (jobs) enough?
I would expect this would work just as well:
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
As for splitting the job, why not write a Python wrapper that takes the number of concurrent spiders, divides the URL list into that many pieces, and launches the spider(s)?
EDIT
Caveat: I am not well versed in scrapy use.
Here's a sample program that, given a (big?) file of urls, splits them into smaller chunks and creates a new process for each one, attempting to invoke scrapy in each child process. This would effectively run <n> scrapy processes at once on different sets of URLs.
#!python2
import multiprocessing,subprocess,sys,tempfile,math,os
def run_chunk( spider, proj_dir, urllist ):
os.chdir(proj_dir)
with tempfile.NamedTemporaryFile(mode='wt',prefix='urllist') as urlfile:
urlfile.write("\n".join(urllist))
urlfile.flush()
command = [
'scrapy',
'crawl',
'-a','urls='+urlfile.name,
spider,
]
subprocess.check_call(command)
print("Child Finished!")
# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464#312464
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == '__main__':
# ... or use argparse or some other run-time configuration tool
spider = 'jobs'
input_urls = 'biglist.urls.txt' # one URL per line
project_dir = 'jobs_scrapy_dir'
num_children = 10
# Split the URLs into chunks; assign chunks to workers.
urls = open(input_urls,'rt').readlines()
per_chunk = int(math.ceil(len(urls)//num_children))
workers = [ multiprocessing.Process( target=run_chunk,
args=(spider,project_dir,chunk) )
for chunk in chunks(urls,per_chunk) ]
# Start all the workers.
for w in workers:
w.start()
for w in workers:
w.join()
print("Finished!")
Based on a very cursory reading, scrapy has some concept of parallelization, so I cannot say that this is the best way to make use of scrapy. The code sample here works insofar as it splits the file into pieces and launches child processes. The command given to subprocess.check_call() invocation will probably need to be tweaked in order to pass a file full of urls to a spider instance.
Limitations
The entire file of URL's is read into memory at once, then split into pieces. This means that 2x the space of the URL file is used. There are smarter ways of doing this job. My implementation is just a quick demo of one possibility.
The last chunk may be significantly smaller than all the others. The process will likely take as long as a full chunk, so this probably doesn't matter much, but balancing the load more evenly may be advantageous.
The scrapy syntax may not be correct, and the spider may have to be updated to accept a file parameter.
I did not test the scrapy invocation. The OP didn't post any project details, and the script itself didn't work out of the box, so I had no way to really test that part. Fixing the project/invocation is left as an exercise for the reader.

Multiple outputs with a for loop in praat

I have a script where I have multiple folders each with three audio files in them ID#_1, ID#_2, and ID#_3. The user can input a string of different ID#s, one after the other, and then the script recognizes the different IDs and runs the code for each of them.
I have a for loop set up for this -
form Settings
comment Enter the IDs of the different subjects
sentence subjectIDs
endform
numOfSubjects = length(subjectIDs$)/4
for i from 0 to (numOfSubjects - 1)
subjectID$ = mid$(subjectIDs$, 1 + 4*i, 4 + 4*i)
outFile$ = subjectID$ + "/SubjectResponseOnsets" + subjectID$ + ".txt"
path$ = subjectID$ + "/" + subjectID$
#firstOutput
#secondOutput
#thirdOutput'
Each of these procedures is defined previously in the code, and they basically output certain ranges from the audio files out to a text file.
The code seems to work fine and generate the output file correctly when one ID is given, but when I try to run it with more than one ID at a time, only the text file for the first ID is outputted.
The for loop does not seem to be working well, but the code does work fine in the first run.
I would greatly appreciate any help!
I don't know if I understood well what your script was trying to do, since the snippet you pasted was incomplete. It's best if you provide code that is executable as is. In this case, you were missing the closing endfor, and you were calling some procedures that were not defined in your snippet (not even as placeholders). I had to write some dummy procedures just to make it run.
Since you also didn't say how your script was failing, it was unclear what needed to be fixed. So I took a stab at making it work.
It sounded as if your ID splitting code was giving you some problems. I took the split procedure from the utils plugin available through CPrAN, which makes inputting the IDs easier (full disclosure: I wrote that plugin).
form Settings
comment Enter the IDs of the different subjects
sentence subjectIDs 01 02 03
endform
#split: " ", subjectIDs$
numOfSubjects = split.length
for i to numOfSubjects
subjectID$ = split.return$[i]
path$ = subjectID$
outFile$ = path$ + "/SubjectResponseOnsets" + subjectID$ + ".txt"
# Make sure output directory exists
createDirectory: path$
#firstOutput
#secondOutput
#thirdOutput
endfor
procedure firstOutput ()
appendFileLine: outFile$, "First"
endproc
procedure secondOutput ()
appendFileLine: outFile$, "Second"
endproc
procedure thirdOutput ()
appendFileLine: outFile$, "Third"
endproc
# split procedure from the utils CPrAN plugin
# http://cpran.net/plugins/utils
procedure split (.sep$, .str$)
.seplen = length(.sep$)
.length = 0
repeat
.strlen = length(.str$)
.sep = index(.str$, .sep$)
if .sep > 0
.part$ = left$(.str$, .sep-1)
.str$ = mid$(.str$, .sep+.seplen, .strlen)
else
.part$ = .str$
endif
.length = .length+1
.return$[.length] = .part$
until .sep = 0
endproc
If this is not what you are having trouble with, you'll have to be more specific.

Resources