How to run a spider from bat file for multiple urls? - batch-file

I wanted to prepare a multiple.bat file to run several spiders so I first tried to prepare multiple.bat file for one spider. I got stopped here. I got this error
G:\myVE\vacancies>multiple.bat
G:\myVE\vacancies>scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE
\vacancies\vacancies\spiders\job_spider.py
G:\myVE\vacancies\vacancies\spiders\job_spider.py:12: ScrapyDeprecationWarning:
`Settings.overrides` attribute is deprecated and won't be supported in Scrapy 0.
26, use `Settings.set(name, value, priority='cmdline')` instead
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,applicati
on/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
['http://1nadan.si']
Usage
=====
scrapy crawl [options] <spider>
crawl: error: running 'scrapy crawl' with more than one spider is no longer supp
orted
From this question
How to give URL to scrapy for crawling?
it looks like the problem would be that the spider is reading several urls into start_urls but this is not the case. There is only one url in the spider. And it works normally when started from command line. Why is this error happening? Maybe because I have several spiders in adjacent directories but that does not make sense. My final goal is to split a list of 1300 urls into 130 chunks of 10 urls and launch 130 spiders from multiple.bat file. The aim is to reduce the time of scraping so that I can have results in two hours instead of two days, because now I split 1300 urls into 13 chunks of 100 urls and launch 13 spiders and it takes me two days to scrap everything.
Here is my multiple.bat code
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt G:\myVE\vacancies\vacancies\spiders\job_spider.py
and here is the code for my spider:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
#start_time = time.time()
# We run the programme in the command line with this command:
# scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
# We get two output files
# 1) urls.csv
# 2) log.txt
# Url whitelist.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/url_whitelist.txt", "r+") as kw:
url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)
# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/SIIT/JV_Marko_Boro/Detector/kljucne_besede/tab_whitelist.txt", "r+") as kw:
tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)
#File to write unique links
#unique = open("G:/myVE/vacancies/unique_urls.txt", "wb")
#izloceni = open("G:/myVE/vacancies/izloceni.txt", "wb")
class JobSpider(scrapy.Spider):
name = "jobs"
#Test sample of SLO companies
start_urls = [
"http://1nadan.si"
]
print start_urls
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
#I would like to see how many unique links we check on every page.
#unique_urls = []
def parse(self, response):
response.selector.remove_namespaces()
#Take url of response, because we would like to stay on the same domain.
net1 = urlparse(response.url).netloc
#print "Net 1 " + str(net1)
#Base url.
base_url = get_base_url(response)
#print "Base url " + str(base_url)
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#print urls
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url.encode('utf-8')) + "\n")
#Ignore ftp and sftp.
if url.startswith("ftp") or url.startswith("sftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#This is very strict condition. If seed website loses or gets www., then it will be ignored, as the condition very strictly checks the link.
#o = urlparse(url)
#test = o.scheme + "://" + o.netloc
#print "Url : " + url
#print "Test: " + test
#if test in self.start_urls:
# print "Test OK"
#if test not in self.start_urls:
#print "Test NOT OK - continue"
#izloceni.write(str(url) + "\n")
#continue
#Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
net2 = urlparse(url).netloc
if net2 != net1:
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
#Twitter, Facebook
'://twitter.com', '://mobile.twitter.com', 'www.facebook.com', 'www.twitter.com'
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
url = url.strip()
url = url.encode('utf-8')
#Ignore ftp.
if url.startswith("ftp"):
continue
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['%', '~',
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
]):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#Counting unique links.
#if url not in self.unique_urls:
# self.unique_urls.append(url)
# unique.write(str(url) + "\n")
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
#We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy.
tabs = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
# Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
# That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
tabs = [tab.encode('utf-8') for tab in tabs]
tabs = [tab.replace('\t', '') for tab in tabs]
tabs = [tab.replace('\n', '') for tab in tabs]
tab_empty = True
for tab in tabs:
if tab != '':
tab_empty = False
if tab_empty == True:
tabs = []
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
# Starting keyword_url is zero, then we add keywords as we find them in url. This is for tracking purposes.
keyword_url = ''
#if any(x in url for x in keywords):
for keyword in url_whitelist:
if keyword in url:
keyword_url = keyword_url + keyword + ' '
# If we find at least one keyword in url, we continue.
if keyword_url != '':
#1. Tabs are empty.
if tabs == []:
#print "No text for url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = keyword_url
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
#2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for tab in tabs:
keyword_url_tab = ''
for key in tab_whitelist:
if key in tab:
keyword_url_tab = keyword_url_tab + key + ' '
if keyword_url_tab != '':
# keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab.
keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab
#if any(x in text for x in keywords):
#We found url that includes one of the magic words and also the tab includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = keyword_url_tab
#item["keyword_tab"] = ' '
print url
#We return the item.
yield item
else:
for tab in tabs:
#print "TABS " + str(tabs)
#print "TAB " + str(type(tab))
keyword_tab = ''
for key in tab_whitelist:
#print "KEY " + str(type(key))
if key in tab:
keyword_tab = keyword_tab + key + ' '
if keyword_tab != '':
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["url"] = url
#item["keyword_url"] = ' '
#item["keyword_url_tab"] = ' '
#item["keyword_tab"] = keyword_tab
print url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
Your help is greatly appreciated !
DONE
I have found the solution by writing the programme, that creates lots of spiders, 122 in my case, by copying and modifying initial spider. Modification means that each spider reads next ten urls from the list, so that spiders consecutively read all the list, 10 urls each, and start working in parallel. This way 123 spiders are released at the same time to go fetching to the network.
At the same time the programme creates .bat file with 123 commands, that release the spiders, so that I don't have to open 123 command lines.
#Programme that generates spiders
#Inital parameter to determine number of spiders. There are 1226 urls, so we set it to 122 spiders, so that the last piece will be 1220 to 1230. There is also initial spider, that crawls webpages 0 to 10, so there will be 123 spiders.
j = 122
#Prepare bat file with commands, that will throw all spiders at the same time to the network.
bat = open("G:/myVE/vacancies_januar/commands.bat", "w")
bat.write("cd \"G:\\myVE\\vacancies_januar\"\n")
bat.write("start scrapy crawl jobs_0_10 -o podjetja_0_10_url.csv -t csv --logfile podjetja_0_10_log.txt\n")
#Loop that grows spiders from initial spider_0_10.
for i in range(0,j):
with open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_0_10.py", "r+") as prgm:
program = prgm.read()
#Just replace 0_10 with 10_20 and so on.
program = program.replace("0_10", str((i+1)*10)+"_"+str((i+1)*10+10))
program = program.replace("0:10", str((i+1)*10)+":"+str((i+1)*10+10))
#Generate new spider.
dest = open("G:/myVE/vacancies_januar/vacancies/spiders/job_spider_"+str((i+1)*10)+"_"+str((i+1)*10+10)+".py", "w")
dest.write(program)
#At the same time write the command into bat file.
bat.write("start scrapy crawl jobs_"+str((i+1)*10)+"_"+str((i+1)*10+10)+" -o podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_url.csv -t csv --logfile podjetja_"+str((i+1)*10)+"_"+str((i+1)*10+10)+"_log.txt\n")

Why are you specifying the path to the Python spider? Isn't specifying the spider name (jobs) enough?
I would expect this would work just as well:
scrapy crawl jobs -o urls.csv -t csv --logfile log.txt
As for splitting the job, why not write a Python wrapper that takes the number of concurrent spiders, divides the URL list into that many pieces, and launches the spider(s)?
EDIT
Caveat: I am not well versed in scrapy use.
Here's a sample program that, given a (big?) file of urls, splits them into smaller chunks and creates a new process for each one, attempting to invoke scrapy in each child process. This would effectively run <n> scrapy processes at once on different sets of URLs.
#!python2
import multiprocessing,subprocess,sys,tempfile,math,os
def run_chunk( spider, proj_dir, urllist ):
os.chdir(proj_dir)
with tempfile.NamedTemporaryFile(mode='wt',prefix='urllist') as urlfile:
urlfile.write("\n".join(urllist))
urlfile.flush()
command = [
'scrapy',
'crawl',
'-a','urls='+urlfile.name,
spider,
]
subprocess.check_call(command)
print("Child Finished!")
# http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python/312464#312464
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == '__main__':
# ... or use argparse or some other run-time configuration tool
spider = 'jobs'
input_urls = 'biglist.urls.txt' # one URL per line
project_dir = 'jobs_scrapy_dir'
num_children = 10
# Split the URLs into chunks; assign chunks to workers.
urls = open(input_urls,'rt').readlines()
per_chunk = int(math.ceil(len(urls)//num_children))
workers = [ multiprocessing.Process( target=run_chunk,
args=(spider,project_dir,chunk) )
for chunk in chunks(urls,per_chunk) ]
# Start all the workers.
for w in workers:
w.start()
for w in workers:
w.join()
print("Finished!")
Based on a very cursory reading, scrapy has some concept of parallelization, so I cannot say that this is the best way to make use of scrapy. The code sample here works insofar as it splits the file into pieces and launches child processes. The command given to subprocess.check_call() invocation will probably need to be tweaked in order to pass a file full of urls to a spider instance.
Limitations
The entire file of URL's is read into memory at once, then split into pieces. This means that 2x the space of the URL file is used. There are smarter ways of doing this job. My implementation is just a quick demo of one possibility.
The last chunk may be significantly smaller than all the others. The process will likely take as long as a full chunk, so this probably doesn't matter much, but balancing the load more evenly may be advantageous.
The scrapy syntax may not be correct, and the spider may have to be updated to accept a file parameter.
I did not test the scrapy invocation. The OP didn't post any project details, and the script itself didn't work out of the box, so I had no way to really test that part. Fixing the project/invocation is left as an exercise for the reader.

Related

Change ID in multiple FASTA files

I need to rename multiple sequences in multiple fasta files and I found this script in order to do so for a single ID:
original_file = "./original.fasta"
corrected_file = "./corrected.fasta"
with open(original_file) as original, open(corrected_file, 'w') as corrected:
records = SeqIO.parse(original_file, 'fasta')
for record in records:
print record.id
if record.id == 'foo':
record.id = 'bar'
record.description = 'bar' # <- Add this line
print record.id
SeqIO.write(record, corrected, 'fasta')
Each fasta file corresponds to a single organism, but it is not specified in the IDs. I have the original fasta files (because these have been translated) with the same filenames but different directories and include in their IDs the name of each organism.
I wanted to figure out how to loop through all these fasta files and rename each ID in each file with the corresponding organism name.
ok my effort, got to use my own input folders/files since they where not specified in question
/old folder contains files :
MW628877.1.fasta :
>MW628877.1 Streptococcus agalactiae strain RYG82 DNA gyrase subunit A (gyrA) gene, complete cds
ATGCAAGATAAAAATTTAGTAGATGTTAATCTAACTAGTGAAATGAAAACGAGTTTTATCGATTACGCCA
TGAGTGTCATTGTTGCTCGTGCACTTCCAGATGTTAGAGATGGTTTAAAACCTGTTCATCGTCGTATTTT
>KY347969.1 Neisseria gonorrhoeae strain 1448 DNA gyrase subunit A (gyrA) gene, partial cds
CGGCGCGTACCGTACGCGATGCACGAGCTGAAAAATAACTGGAATGCCGCCTACAAAAAATCGGCGCGCA
TCGTCGGCGACGTCATCGGTAAATACCACCCCCACGGCGATTTCGCAGTTTACGGCACCATCGTCCGTAT
MG995190.1.fasta :
>MG995190.1 Mycobacterium tuberculosis strain UKR100 GyrA (gyrA) gene, complete cds
ATGACAGACACGACGTTGCCGCCTGACGACTCGCTCGACCGGATCGAACCGGTTGACATCCAGCAGGAGA
TGCAGCGCAGCTACATCGACTATGCGATGAGCGTGATCGTCGGCCGCGCGCTGCCGGAGGTGCGCGACGG
and an /empty folder.
/new folder contains files :
MW628877.1.fasta :
>MW628877.1
MQDKNLVDVNLTSEMKTSFIDYAMSVIVARALPDVRDGLKPVHRRI
>KY347969.1
RRVPYAMHELKNNWNAAYKKSARIVGDVIGKYHPHGDFAVYGTIVR
MG995190.1.fasta :
>MG995190.1
MTDTTLPPDDSLDRIEPVDIQQEMQRSYIDYAMSVIVGRALPEVRD
my code is :
from Bio import SeqIO
from os import scandir
old = './old'
new = './new'
old_ids_dict = {}
for filename in scandir(old):
if filename.is_file():
print(filename)
for seq_record in SeqIO.parse(filename, "fasta"):
old_ids_dict[seq_record.id] = ' '.join(seq_record.description.split(' ')[1:3])
print('_____________________')
print('old ids ---> ',old_ids_dict)
print('_____________________')
for filename in scandir(new):
if filename.is_file():
sequences = []
for seq_record in SeqIO.parse(filename, "fasta"):
if seq_record.id in old_ids_dict.keys():
print('### ', seq_record.id,' ', old_ids_dict[seq_record.id])
seq_record.id += '.'+old_ids_dict[seq_record.id]
seq_record.description = ''
print('-->', seq_record.id)
print(seq_record)
sequences.append(seq_record)
SeqIO.write(sequences, filename, 'fasta')
check how it works, it actually overwrites both files in new folder,
as pointed out by #Vovin in his comment it needs to be adapted per your files template from-to.
I am sure there is more than a way to do this, probably better and more pythonic than may way, I am learning too. Let us know

shiny selecting specific columns from uploaded data frame

I have merged different sources of code to make an app that allows one to upload a file (data frame).
However, beyond this I would also like to make it possible to select specific columns from the data frame and analyse them. This is difficult however as one must predefine the given data frame in order to be able to refer to it in the ui.R script....
So when a previously undefined data frame is uploaded to the site, one can not revere to it in the ui.R as it is defined in the server....
predefined variables
vchoices <- 1:ncol(mtcars)
names(vchoices) <- names(mtcars)
ui.R
runApp(
ui = basicPage(
h2('The uploaded file data'),
dataTableOutput('mytable'),
fileInput('file', 'Choose info-file to upload',
accept = c(
'text/csv',
'text/comma-separated-values',
'text/tab-separated-values',
'text/plain',
'.csv',
'.tsv'
)
),
actionButton("choice", "incorporate external information"),
selectInput("columns", "Select Columns", choices=vchoices, inline = T),
#notice that the 'choices' in selectInput are set to the predefined
#variables above whereas I would like to set them equal to the
#not yet defined uploaded file below in server.R
tableOutput("table_display")
))
Notice that the 'choices' in selectInput are set to the predefined variables above whereas I would like to set them equal to the not yet defined uploaded file below in server.R
server.R
server = function(input, output) {
info <- eventReactive(input$choice, {
inFile <- input$file
if (is.null(inFile))
return(NULL)
isolate(f<-read.table(inFile$datapath, header = T,
sep = "\t"))
f
})
output$table_display<-renderTable({
f<-info()
f<-subset(f, select=input$columns) #subsetting takes place here
head(f)
})
}
Does anyone know of a way to refer to a variable that's defined in in the server, in the ui and thus allow for interactive manipulation?
You can use a family of functions update*Input - in this case updateSelectInput. Its first argument has to be session and you also have to add session to server <- function(input, output) to be able to update your widget.
You could make an update of the widget immediately after clicking on the actionButton - so, you had to use updateSelectInput within eventReactive.
Let's take a look how we can do that:
First, you can save the names of columns of the new uploaded dataset in a variable, say, vars and then pass it to the function updateSelectInput.
(The choices of the selectInput are initially set to NULL - we don't need to specify them before because they are going to be updated anyway)
info <- eventReactive(input$choice, {
inFile <- input$file
# Instead # if (is.null(inFile)) ... use "req"
req(inFile)
# Changes in read.table
f <- read.table(inFile$datapath, header = input$header, sep = input$sep, quote = input$quote)
vars <- names(f)
# Update select input immediately after clicking on the action button.
updateSelectInput(session, "columns","Select Columns", choices = vars)
f
})
I've added a small upload interface to your code.
The other way would be to define widgets on the server side and then to pass them to the client side via renderUI function. You can find here an example.
Full example:
library(shiny)
ui <- fluidPage(
h2('The uploaded file data'),
dataTableOutput('mytable'),
fileInput('file', 'Choose info-file to upload',
accept = c(
'text/csv',
'text/comma-separated-values',
'text/tab-separated-values',
'text/plain',
'.csv',
'.tsv'
)
),
# Taken from: http://shiny.rstudio.com/gallery/file-upload.html
tags$hr(),
checkboxInput('header', 'Header', TRUE),
radioButtons('sep', 'Separator',
c(Comma=',',
Semicolon=';',
Tab='\t'),
','),
radioButtons('quote', 'Quote',
c(None='',
'Double Quote'='"',
'Single Quote'="'"),
'"'),
################################################################
actionButton("choice", "incorporate external information"),
selectInput("columns", "Select Columns", choices = NULL), # no choices before uploading
tableOutput("table_display")
)
server <- function(input, output, session) { # added session for updateSelectInput
info <- eventReactive(input$choice, {
inFile <- input$file
# Instead # if (is.null(inFile)) ... use "req"
req(inFile)
# Changes in read.table
f <- read.table(inFile$datapath, header = input$header, sep = input$sep, quote = input$quote)
vars <- names(f)
# Update select input immediately after clicking on the action button.
updateSelectInput(session, "columns","Select Columns", choices = vars)
f
})
output$table_display <- renderTable({
f <- info()
f <- subset(f, select = input$columns) #subsetting takes place here
head(f)
})
}
shinyApp(ui, server)

Handling Hebrew files and folders with Python 3.4

I used Python 3.4 to create a programm that goes through E-mails and saves specific attachments to a file server.
Each file is saved to a specific destination depending on the sender's E-mail's address.
My problem is that the destination folders and the attachments are both in Hebrew and for a few attachments I get an error that the path does not exsist.
Now that's not possible because It can fail for one attachment but not for the others on the same Mail (the destination folder is decided by the sender's address).
I want to debug the issue but I cannot get python to display the file path it is trying to save correctly. (it's mixed hebrew and english and it always displays the path in a big mess, although it works correctly 95% of the time when the file is being saved to the file server)
So my questions are:
what should I add to this code so that it will proccess Hewbrew correctly?
Should I encode or decode somthing?
Are there characters I should avoid when proccessing the files?
here's the main piece of code that fails:
try:
found_attachments = False
for att in msg.Attachments:
_, extension = split_filename(str(att))
# check if attachment is not inline
if str(att) not in msg.HTMLBody:
if extension in database[sender][TYPES]:
file = create_file(str(att), database[sender][PATH], database[sender][FORMAT], time_stamp)
# This is where the program fails:
att.SaveAsFile(file)
print("Created:", file)
found_attachments = True
if found_attachments:
items_processed.append(msg)
else:
items_no_att.append(msg)
except:
print("Error with attachment: " + str(att) + " , in: " + str(msg))
and the create file function:
def create_file(att, location, format, timestamp):
"""
process an attachment to make it a file
:param att: the name of the attachment
:param location: the path to the file
:param format: the format of the file
:param timestamp: the time and date the attachment was created
:return: return the file created
"""
# create the file by the given format
if format == "":
output_file = location + "\\" + att
else:
# split file to name and type
filename, extension = split_filename(att)
# extract and format the time sent on
time = str(timestamp.time()).replace(":", ".")[:-3]
# extract and format the date sent on
day = str(timestamp.date())
day = day[-2:] + day[4:-2] + day[:4]
# initiate the output file
output_file = format
# add the original file name where needed
output_file = output_file.replace(FILENAME, filename)
# add the sent date where needed
output_file = output_file.replace(DATE, day)
# add the time sent where needed
output_file = output_file.replace(TIME, time)
# add the path and type
output_file = location + "\\" + output_file + "." + extension
print(output_file)
# add an index to the file if necessary and return it
index = get_file_index(output_file)
if index:
filename, extension = split_filename(output_file)
return filename + "(" + str(index) + ")." + extension
else:
return output_file
Thanks in advance, I would be happy to explain more or supply more code if needed.
I found out that the promlem was not using Hebrew. I found that there's a limit on the number of chars that the (path + filename) can hold (255 chars).
The files that failed excided that limit and that caused the problem

Multiple outputs with a for loop in praat

I have a script where I have multiple folders each with three audio files in them ID#_1, ID#_2, and ID#_3. The user can input a string of different ID#s, one after the other, and then the script recognizes the different IDs and runs the code for each of them.
I have a for loop set up for this -
form Settings
comment Enter the IDs of the different subjects
sentence subjectIDs
endform
numOfSubjects = length(subjectIDs$)/4
for i from 0 to (numOfSubjects - 1)
subjectID$ = mid$(subjectIDs$, 1 + 4*i, 4 + 4*i)
outFile$ = subjectID$ + "/SubjectResponseOnsets" + subjectID$ + ".txt"
path$ = subjectID$ + "/" + subjectID$
#firstOutput
#secondOutput
#thirdOutput'
Each of these procedures is defined previously in the code, and they basically output certain ranges from the audio files out to a text file.
The code seems to work fine and generate the output file correctly when one ID is given, but when I try to run it with more than one ID at a time, only the text file for the first ID is outputted.
The for loop does not seem to be working well, but the code does work fine in the first run.
I would greatly appreciate any help!
I don't know if I understood well what your script was trying to do, since the snippet you pasted was incomplete. It's best if you provide code that is executable as is. In this case, you were missing the closing endfor, and you were calling some procedures that were not defined in your snippet (not even as placeholders). I had to write some dummy procedures just to make it run.
Since you also didn't say how your script was failing, it was unclear what needed to be fixed. So I took a stab at making it work.
It sounded as if your ID splitting code was giving you some problems. I took the split procedure from the utils plugin available through CPrAN, which makes inputting the IDs easier (full disclosure: I wrote that plugin).
form Settings
comment Enter the IDs of the different subjects
sentence subjectIDs 01 02 03
endform
#split: " ", subjectIDs$
numOfSubjects = split.length
for i to numOfSubjects
subjectID$ = split.return$[i]
path$ = subjectID$
outFile$ = path$ + "/SubjectResponseOnsets" + subjectID$ + ".txt"
# Make sure output directory exists
createDirectory: path$
#firstOutput
#secondOutput
#thirdOutput
endfor
procedure firstOutput ()
appendFileLine: outFile$, "First"
endproc
procedure secondOutput ()
appendFileLine: outFile$, "Second"
endproc
procedure thirdOutput ()
appendFileLine: outFile$, "Third"
endproc
# split procedure from the utils CPrAN plugin
# http://cpran.net/plugins/utils
procedure split (.sep$, .str$)
.seplen = length(.sep$)
.length = 0
repeat
.strlen = length(.str$)
.sep = index(.str$, .sep$)
if .sep > 0
.part$ = left$(.str$, .sep-1)
.str$ = mid$(.str$, .sep+.seplen, .strlen)
else
.part$ = .str$
endif
.length = .length+1
.return$[.length] = .part$
until .sep = 0
endproc
If this is not what you are having trouble with, you'll have to be more specific.

Read from text file and assign data to new variable

Python 3 program allows people to choose from list of employee names.
Data held on text file look like this: ('larry', 3, 100)
(being the persons name, weeks worked and payment)
I need a way to assign each part of the text file to a new variable,
so that the user can enter a new amount of weeks and the program calculates the new payment.
Below is my code and attempt at figuring it out.
import os
choices = [f for f in os.listdir(os.curdir) if f.endswith(".txt")]
print (choices)
emp_choice = input("choose an employee:")
file = open(emp_choice + ".txt")
data = file.readlines()
name = data[0]
weeks_worked = data[1]
weekly_payment= data[2]
new_weeks = int(input ("Enter new number of weeks"))
new_payment = new_weeks * weekly_payment
print (name + "will now be paid" + str(new_payment))
currently you are assigning the first three lines form the file to name, weeks_worked and weekly_payment. but what you want (i think) is to separate a single line, formatted as ('larry', 3, 100) (does each file have only one line?).
so you probably want code like:
from re import compile
# your code to choose file
line_format = compile(r"\s*\(\s*'([^']*)'\s*,\s*(\d+)\s*,\s*(\d+)\s*\)")
file = open(emp_choice + ".txt")
line = file.readline() # read the first line only
match = line_format.match(line)
if match:
name, weeks_worked, weekly_payment = match.groups()
else:
raise Exception('Could not match %s' % line)
# your code to update information
the regular expression looks complicated, but is really quite simple:
\(...\) matches the parentheses in the line
\s* matches optional spaces (it's not clear to me if you have spaces or not
in various places between words, so this matches just in case)
\d+ matches a number (1 or more digits)
[^']* matches anything except a quote (so matches the name)
(...) (without the \ backslashes) indicates a group that you want to read
afterwards by calling .groups()
and these are built from simpler parts (like * and + and \d) which are described at http://docs.python.org/2/library/re.html
if you want to repeat this for many lines, you probably want something like:
name, weeks_worked, weekly_payment = [], [], []
for line in file.readlines():
match = line_format.match(line)
if match:
name.append(match.group(1))
weeks_worked.append(match.group(2))
weekly_payment.append(match.group(3))
else:
raise ...

Resources