Scrapy : Why should I use yield for multiple request? - request

Simply I need three conditions.
1) Log-in
2) Multiple request
3) Synchronous request ( sequential like 'C' )
I realized 'yield' should be used for multiple request.
But I think 'yield' works differently with 'C' and not sequential.
So I want to use request without 'yield' like below.
But crawl method wasn`t called ordinarily.
How can I call crawl method sequentially like C ?
class HotdaySpider(scrapy.Spider):
name = "hotday"
allowed_domains = ["test.com"]
login_page = "http://www.test.com"
start_urls = ["http://www.test.com"]
maxnum = 27982
runcnt = 10
def parse(self, response):
return [FormRequest.from_response(response,formname='login_form',formdata={'id': 'id', 'password': 'password'}, callback=self.after_login)]
def after_login(self, response):
global maxnum
global runcnt
i = 0
while i < runcnt :
**Request(url="http://www.test.com/view.php?idx=" + str(maxnum) + "/",callback=self.crawl)**
i = i + 1
def crawl(self, response):
global maxnum
filename = 'hotday.html'
with open(filename, 'wb') as f:
f.write(unicode(response.body.decode(response.encoding)).encode('utf-8'))
maxnum = maxnum + 1

When you return a list of requests (that's what you do when you yield many of them) Scrapy will schedule them and you can't control the order in which the responses will come.
If you want to process one response at a time and in order, you would have to return only one request in your after_login method and construct the next request in your crawl method.
def after_login(self, response):
return Request(url="http://www.test.com/view.php?idx=0/", callback=self.crawl)
def crawl(self, response):
global maxnum
global runcnt
filename = 'hotday.html'
with open(filename, 'wb') as f:
f.write(unicode(response.body.decode(response.encoding)).encode('utf-8'))
maxnum = maxnum + 1
next_page = int(re.search('\?idx=(\d*)', response.request.url).group(1)) + 1
if < runcnt:
return Request(url="http://www.test.com/view.php?idx=" + next_page + "/", callback=self.crawl)

Related

How to use cursors in Google App Engine for fetching records from a db database in batches?

I am trying to fetch data from a db database in batches and copy it to an ndb database, using a cursor. My code is doing it successfully for the first batch, but not fetching any further records. I did not find much information on cursors, please help me here.
Here is my code snippet: def post(self):
a = 0
chunk_size = 2
next_cursor = self.request.get("cursor")
query = db.GqlQuery("select * from BooksPost")
while a == 0:
if next_cursor:
query.with_cursor(start_cursor = next_cursor)
else:
a = 1
results = query.fetch(chunk_size)
for result in results:
nbook1 = result.bookname
nauthor1 = result.authorname
nbook1 = nBooksPost(nbookname = nbook1, nauthorname = nauthor1)
nbook1.put()
next_cursor = self.request.get("cursor")
Basically, how do I set the next cursor to iterate over?
def post(self):
chunk_size = 10
has_more_results = True
query = db.GqlQuery("select * from Post")
cursor = self.request.get('cursor', None)
#cursor = query.cursor()
if cursor:
query.with_cursor(cursor)
while has_more_results == True:
results = query.fetch(chunk_size)
new_cursor = query.cursor()
print("count: %d, results %d" % (query.count(), len(results)))
if query.count(1) == 1:
has_more_results = True
else:
has_more_results = False
for result in results:
#do this
query.with_cursor(new_cursor)

Stream of items, each producing n{ http request } > merge{response} > wrapInHeaderAndFooter{data} > http-request

I am trying to solve a classic ETL problem using streaming. I have a batch of segments, Each segment holds information about the records associated for that segment like number of records, url to retrieve etc, to issue a http request to collect data. I need to extract the records from a source with paging size of 100 records, merge the pages of records for each segment, wrap in a xml header and footer. Now send each xml payload per segment to a target.
{http}
page 1
/ \
seg 1 > page 2 -> merge -> wrapHeaderAndFooter -> http target
/ \ /
/ page n
/
/
batch - seg 2 " -> http target
\ seg n " -> http target
val loadSegment: Flow[Segment, Response, NotUsed] = {
Flow[Segment].mapAsync(parallelism = 5) { segment =>
val pages: Source[ByteString, NotUsed] = pagedPayload(segment).map(page => page.payload)
//Using source concatenation to prepend and append
val wrappedInXML: Source[ByteString, NotUsed] = xmlRootStartTag ++ pages ++ xmlRootEndTag
val httpEntity: HttpEntity = HttpEntity(MediaTypes.`application/octet-stream`, pages)
invokeTargetLoad(httpEntity, request, segment)
}
}
def pagedPayload(segment: Segment): Source[Payload, NotUsed] = {
val totalPages: Int = calculateTotalPages(segment.instanceCount)
Source(0 until totalPages).mapAsyncUnordered(parallelism = 5)(i => {
sendPayloadRequest(request, segment, i).mapTo[Try[Payload]].map(_.get)
})
}
val batch: Batch = someBatch
Source(batch.segments)
.via(loadSegment)
.runWith(Sink.ignore)
.andThen {
case Success(value) => log("success")
case Failure(error) => report(error)
}
Is there a better approach? I am trying to use the HttpEntity.Chunked encoding to stream the pages. Sometimes the first request from the source can take longer time due to warm up and the target truncates the stream with no data. Is there a way to delay the actual connection to target until we have the first page in stream?
I would have more liked to do something like below. if it's possible how to implement methods wrapXMLHeader & toHttpEntity
val splitPages: Flow[BuildSequenceSegment, Seq[PageRequest], NotUsed] = ???
val requestPayload: Flow[Seq[PageRequest], Seq[PageResponse], NotUsed] = ???
val wrapXMLHeader: Flow[Seq[PageResponse], Seq[PageResponse], NotUsed] = ???
val toHttpEntity: Flow[Seq[PageResponse], HttpEntity.Chunked, NotUsed] = ???
val invokeTargetLoad: Flow[HttpEntity.Chunked, RestResponse, NotUsed] = ???
Source(batch.segments)
.via(splitPages)
.via(requestPayload)
.via(wrapXMLHeader)
.via(toHttpEntity)
.via(invokeTargetLoad)
.runWith(Sink.ignore)

Batch Request for get in gmail API

I have a list of around 2500 mail ids and I'm stuck to only use requests library, so so far i do it this way to get mail headers
mail_ids = ['']
for mail_id in mails_ids:
res = requests.get(
'https://www.googleapis.com/gmail/v1/users/me/messages/{}?
format=metadata'.format(mail_id), headers=headers).json()
mail_headers = res['payload']['headers']
...
But its very inefficient and i would rather like to POST list of Ids instead, but on their documentation https://developers.google.com/gmail/api/v1/reference/users/messages/get, i don't see BatchGet, any workaround? I'm using Flask framework Thanks a lot
This is a bit late, but in case it helps anyone, here's the code I used to do a batch get of emails:
First I get a list of relevant emails. Change the request according to your needs, I'm getting only sent emails for a certain time period:
query = "https://www.googleapis.com/gmail/v1/users/me/messages?labelIds=SENT&q=after:2020-07-25 before:2020-07-31"
response = requests.get(query, headers=header)
events = json.loads(response.content)
email_tokens = events['messages']
while 'nextPageToken' in events:
response = requests.get(query+f"&pageToken={events['nextPageToken']}",
headers=header)
events = json.loads(response.content)
email_tokens += events['messages']
Then I'm batching a get request to get 100 emails at a time, and parsing only the json part of the email and putting it into a list called emails. Note that there's some repeated code here, so you may want to refactor it into a method. You'll have to set your access token here:
emails = []
access_token = '1234'
header = {'Authorization': 'Bearer ' + access_token}
batch_header = header.copy()
batch_header['Content-Type'] = 'multipart/mixed; boundary="email_id"'
data = ''
ctr = 0
for token_dict in email_tokens:
data += f'--email_id\nContent-Type: application/http\n\nGET /gmail/v1/users/me/messages/{token_dict["id"]}?format=full\n\n'
if ctr == 99:
data += '--email_id--'
print(data)
r = requests.post(f"https://www.googleapis.com/batch/gmail/v1",
headers=batch_header, data=data)
bodies = r.content.decode().split('\r\n')
for body in bodies:
if body.startswith('{'):
parsed_body = json.loads(body)
emails.append(parsed_body)
ctr = 0
data = ''
continue
ctr+=1
data += '--email_id--'
r = requests.post(f"https://www.googleapis.com/batch/gmail/v1",
headers=batch_header, data=data)
bodies = r.content.decode().split('\r\n')
for body in bodies:
if body.startswith('{'):
parsed_body = json.loads(body)
emails.append(parsed_body)
[Optional] Finally, I'm decoding the text in the email and storing only the last sent email instead of the whole thread. The regex used here splits on strings that I found were usually at the end of emails. For instance, On Tue, Jun 23, 2020, x#gmail.com said...:
import re
import base64
gmail_split_regex = r'On [a-zA-z]{3}, ([a-zA-z]{3}|\d{2}) ([a-zA-z]{3}|\d{2}),? \d{4}'
for email in emails:
if 'parts' not in email['payload']:
continue
for part in email['payload']['parts']:
if part['mimeType'] == 'text/plain':
if 'uniqueBody' not in email:
plainText = str(base64.urlsafe_b64decode(bytes(str(part['body']['data']), encoding='utf-8')))
email['uniqueBody'] = {'content': re.split(gmail_split_regex, plainText)[0]}
elif 'parts' in part:
for sub_part in part['parts']:
if sub_part['mimeType'] == 'text/plain':
if 'uniqueBody' not in email:
plainText = str(base64.urlsafe_b64decode(bytes(str(sub_part['body']['data']), encoding='utf-8')))
email['uniqueBody'] = {'content': re.split(gmail_split_regex, plainText)[0]}

Run another DAG with TriggerDagRunOperator multiple times

i have a DAG (DAG1) where i copy a bunch of files. I would then like to kick off another DAG (DAG2) for each file that was copied. As the number of files copied will vary per DAG1 run, i would like to essentially loop over the files and call DAG2 with the appropriate parameters.
eg:
with DAG( 'DAG1',
description="copy files over",
schedule_interval="* * * * *",
max_active_runs=1
) as dag:
t_rsync = RsyncOperator( task_id='rsync_data',
source='/source/',
target='/destination/' )
t_trigger_preprocessing = TriggerDagRunOperator( task_id='trigger_preprocessing',
trigger_daq_id='DAG2',
python_callable=trigger
)
t_rsync >> t_trigger_preprocessing
i was hoping to use the python_callable trigger to pull the relevant xcom data from t_rsync and then trigger DAG2; but its not clear to me how to do this.
i would prefer to put the logic of calling DAG2 here to simplify the contents of DAG2 (and also provide stacking schematics with the max_active_runs)
ended up writing my own operator:
class TriggerMultipleDagRunOperator(TriggerDagRunOperator):
def execute(self, context):
count = 0
for dro in self.python_callable(context):
if dro:
with create_session() as session:
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
conf=dro.payload,
external_trigger=True)
session.add(dr)
session.commit()
count = count + 1
else:
self.log.info("Criteria not met, moving on")
if count == 0:
raise AirflowSkipException('No external dags triggered')
with a python_callable like
def trigger_preprocessing(context):
for base_filename,_ in found.items():
exp = context['ti'].xcom_pull( task_ids='parse_config', key='experiment')
run_id='%s__%s' % (exp['microscope'], datetime.utcnow().replace(microsecond=0).isoformat())
dro = DagRunOrder(run_id=run_id)
d = {
'directory': context['ti'].xcom_pull( task_ids='parse_config', key='experiment_directory'),
'base': base_filename,
'experiment': exp['name'],
}
LOG.info('triggering dag %s with %s' % (run_id,d))
dro.payload = d
yield dro
return
and then tie it all together with:
t_trigger_preprocessing = TriggerMultipleDagRunOperator( task_id='trigger_preprocessing',
trigger_dag_id='preprocessing',
python_callable=trigger_preprocessing
)

cron job throwing DeadlineExceededError

I am currently working on a google cloud project in free trial mode. I have cron job to fetch the data from a data vendor and store it in the data store. I wrote the code to fetch the data couple of weeks ago and it was all working fine but all of sudden , i started receiving error "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded" for last two days. I believe cron job is supposed to timeout only after 60 minutes any idea why i am getting the error?.
cron task
def run():
try:
config = cron.config
actual_data_source = config['xxx']['xxxx']
original_data_source = actual_data_source
company_list = cron.rest_client.load(config, "companies", '')
if not company_list:
logging.info("Company list is empty")
return "Ok"
for row in company_list:
company_repository.save(row,original_data_source, actual_data_source)
return "OK"
Repository code
def save( dto, org_ds , act_dp):
try:
key = 'FIN/%s' % (dto['ticker'])
company = CompanyInfo(id=key)
company.stock_code = key
company.ticker = dto['ticker']
company.name = dto['name']
company.original_data_source = org_ds
company.actual_data_provider = act_dp
company.put()
return company
except Exception:
logging.exception("company_repository: error occurred saving the company
record ")
raise
RestClient
def load(config, resource, filter):
try:
username = config['xxxx']['xxxx']
password = config['xxxx']['xxxx']
headers = {"Authorization": "Basic %s" % base64.b64encode(username + ":"
+ password)}
if filter:
from_date = filter['from']
to_date = filter['to']
ticker = filter['ticker']
start_date = datetime.strptime(from_date, '%Y%m%d').strftime("%Y-%m-%d")
end_date = datetime.strptime(to_date, '%Y%m%d').strftime("%Y-%m-%d")
current_page = 1
data = []
while True:
if (filter):
url = config['xxxx']["endpoints"][resource] % (ticker, current_page, start_date, end_date)
else:
url = config['xxxx']["endpoints"][resource] % (current_page)
response = urlfetch.fetch(
url=url,
deadline=60,
method=urlfetch.GET,
headers=headers,
follow_redirects=False,
)
if response.status_code != 200:
logging.error("xxxx GET received status code %d!" % (response.status_code))
logging.error("error happend for url: %s with headers %s", url, headers)
return 'Sorry, xxxx API request failed', 500
db = json.loads(response.content)
if not db['data']:
break
data.extend(db['data'])
if db['total_pages'] == current_page:
break
current_page += 1
return data
except Exception:
logging.exception("Error occured with xxxx API request")
raise
I'm guessing this is the same question as this, but now with more code:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded
I modified your code to write to the database after each urlfetch. If there are more pages, then it relaunches itself in a deferred task, which should be well before the 10 minute timeout.
Uncaught exceptions in a deferred task cause it to retry, so be mindful of that.
It was unclear to me how actual_data_source & original_data_source worked, but I think you should be able to modify that part.
crontask
def run(current_page=0):
try:
config = cron.config
actual_data_source = config['xxx']['xxxx']
original_data_source = actual_data_source
data, more = cron.rest_client.load(config, "companies", '', current_page)
for row in data:
company_repository.save(row, original_data_source, actual_data_source)
# fetch the rest
if more:
deferred.defer(run, current_page + 1)
except Exception as e:
logging.exception("run() experienced an error: %s" % e)
RestClient
def load(config, resource, filter, current_page):
try:
username = config['xxxx']['xxxx']
password = config['xxxx']['xxxx']
headers = {"Authorization": "Basic %s" % base64.b64encode(username + ":"
+ password)}
if filter:
from_date = filter['from']
to_date = filter['to']
ticker = filter['ticker']
start_date = datetime.strptime(from_date, '%Y%m%d').strftime("%Y-%m-%d")
end_date = datetime.strptime(to_date, '%Y%m%d').strftime("%Y-%m-%d")
url = config['xxxx']["endpoints"][resource] % (ticker, current_page, start_date, end_date)
else:
url = config['xxxx']["endpoints"][resource] % (current_page)
response = urlfetch.fetch(
url=url,
deadline=60,
method=urlfetch.GET,
headers=headers,
follow_redirects=False,
)
if response.status_code != 200:
logging.error("xxxx GET received status code %d!" % (response.status_code))
logging.error("error happend for url: %s with headers %s", url, headers)
return [], False
db = json.loads(response.content)
return db['data'], (db['total_pages'] != current_page)
except Exception as e:
logging.exception("Error occured with xxxx API request: %s" % e)
return [], False
I would prefer to write this as a comment, but I need more reputation to do that.
What happens when you run the actual data fetch directly instead of
through the cron job?
Have you tried measuring a time delta from the start to the end of
the job?
Has the number of companies being retrieved increased dramatically?
You appear to be doing some form of stock quote aggregation - is it
possible that the provider has started blocking you?

Resources