cron job throwing DeadlineExceededError - google-app-engine

I am currently working on a google cloud project in free trial mode. I have cron job to fetch the data from a data vendor and store it in the data store. I wrote the code to fetch the data couple of weeks ago and it was all working fine but all of sudden , i started receiving error "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded" for last two days. I believe cron job is supposed to timeout only after 60 minutes any idea why i am getting the error?.
cron task
def run():
try:
config = cron.config
actual_data_source = config['xxx']['xxxx']
original_data_source = actual_data_source
company_list = cron.rest_client.load(config, "companies", '')
if not company_list:
logging.info("Company list is empty")
return "Ok"
for row in company_list:
company_repository.save(row,original_data_source, actual_data_source)
return "OK"
Repository code
def save( dto, org_ds , act_dp):
try:
key = 'FIN/%s' % (dto['ticker'])
company = CompanyInfo(id=key)
company.stock_code = key
company.ticker = dto['ticker']
company.name = dto['name']
company.original_data_source = org_ds
company.actual_data_provider = act_dp
company.put()
return company
except Exception:
logging.exception("company_repository: error occurred saving the company
record ")
raise
RestClient
def load(config, resource, filter):
try:
username = config['xxxx']['xxxx']
password = config['xxxx']['xxxx']
headers = {"Authorization": "Basic %s" % base64.b64encode(username + ":"
+ password)}
if filter:
from_date = filter['from']
to_date = filter['to']
ticker = filter['ticker']
start_date = datetime.strptime(from_date, '%Y%m%d').strftime("%Y-%m-%d")
end_date = datetime.strptime(to_date, '%Y%m%d').strftime("%Y-%m-%d")
current_page = 1
data = []
while True:
if (filter):
url = config['xxxx']["endpoints"][resource] % (ticker, current_page, start_date, end_date)
else:
url = config['xxxx']["endpoints"][resource] % (current_page)
response = urlfetch.fetch(
url=url,
deadline=60,
method=urlfetch.GET,
headers=headers,
follow_redirects=False,
)
if response.status_code != 200:
logging.error("xxxx GET received status code %d!" % (response.status_code))
logging.error("error happend for url: %s with headers %s", url, headers)
return 'Sorry, xxxx API request failed', 500
db = json.loads(response.content)
if not db['data']:
break
data.extend(db['data'])
if db['total_pages'] == current_page:
break
current_page += 1
return data
except Exception:
logging.exception("Error occured with xxxx API request")
raise

I'm guessing this is the same question as this, but now with more code:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded
I modified your code to write to the database after each urlfetch. If there are more pages, then it relaunches itself in a deferred task, which should be well before the 10 minute timeout.
Uncaught exceptions in a deferred task cause it to retry, so be mindful of that.
It was unclear to me how actual_data_source & original_data_source worked, but I think you should be able to modify that part.
crontask
def run(current_page=0):
try:
config = cron.config
actual_data_source = config['xxx']['xxxx']
original_data_source = actual_data_source
data, more = cron.rest_client.load(config, "companies", '', current_page)
for row in data:
company_repository.save(row, original_data_source, actual_data_source)
# fetch the rest
if more:
deferred.defer(run, current_page + 1)
except Exception as e:
logging.exception("run() experienced an error: %s" % e)
RestClient
def load(config, resource, filter, current_page):
try:
username = config['xxxx']['xxxx']
password = config['xxxx']['xxxx']
headers = {"Authorization": "Basic %s" % base64.b64encode(username + ":"
+ password)}
if filter:
from_date = filter['from']
to_date = filter['to']
ticker = filter['ticker']
start_date = datetime.strptime(from_date, '%Y%m%d').strftime("%Y-%m-%d")
end_date = datetime.strptime(to_date, '%Y%m%d').strftime("%Y-%m-%d")
url = config['xxxx']["endpoints"][resource] % (ticker, current_page, start_date, end_date)
else:
url = config['xxxx']["endpoints"][resource] % (current_page)
response = urlfetch.fetch(
url=url,
deadline=60,
method=urlfetch.GET,
headers=headers,
follow_redirects=False,
)
if response.status_code != 200:
logging.error("xxxx GET received status code %d!" % (response.status_code))
logging.error("error happend for url: %s with headers %s", url, headers)
return [], False
db = json.loads(response.content)
return db['data'], (db['total_pages'] != current_page)
except Exception as e:
logging.exception("Error occured with xxxx API request: %s" % e)
return [], False

I would prefer to write this as a comment, but I need more reputation to do that.
What happens when you run the actual data fetch directly instead of
through the cron job?
Have you tried measuring a time delta from the start to the end of
the job?
Has the number of companies being retrieved increased dramatically?
You appear to be doing some form of stock quote aggregation - is it
possible that the provider has started blocking you?

Related

Snowflake Python Connector: Copy Command Status and Error Handling

According to the Snowflake docs, when a user executes a copy command it will return 1 of 3 status values:
loaded
load failed
partially loaded
My question is if I use the Python Snowflake Connector (see example code below) to execute a copy command is an exception raised if the status returned is load failed or partially loaded?
Thank you!
copy_dml = 'copy into database.schema.table ' \
'from #fully_qualified_stage pattern = \'.*'+ table_name +'.*[.]json\' ' \
'file_format = (format_name = fully_qualified_json_format) ' \
'force = true;'
try:
import snowflake.connector
#-------------------------------------------------------------------------------------------------------------------------------
#snowflake variables
snowflake_warehouse = credentials.iloc[0]['snowflake_warehouse']
snowflake_account = credentials.iloc[0]['snowflake_account']
snowflake_role = credentials.iloc[0]['snowflake_role']
snowflake_username = credentials.iloc[0]['Username']
snowflake_password = credentials.iloc[0]['Password']
snowflake_connection = ''
cs = ''#snowflake connection cursor
exec_copy_dml = ''
copy_result_field_metadata = ''
copy_result = ''
snowflake_copy_result_df = ''
#-------------------------------------------------------------------------------------------------------------------------------
# load JSON file(s) into Snowflake
snowflake_connection = snowflake.connector.connect(
user = snowflake_username,
password = snowflake_password,
account = snowflake_account,
warehouse = snowflake_warehouse,
role = snowflake_role)
cs = snowflake_connection.cursor()
exec_copy_dml = cs.execute(copy_dml)
copy_result = exec_copy_dml.fetchall()
copy_result_field_metadata = cs.description
snowflake_copy_result_df = snowflake_results_df(copy_result_field_metadata,copy_result)
except snowflake.connector.errors.ProgrammingError as copy_error:
copy_exception_message = "There was a problem loading JSON files to Snowflake," + \
"a snowflake.connector.errors.ProgrammingError exception was raised."
print(copy_exception_message)
raise
except Exception as error_message:
raise
finally:
snowflake_connection.close()
I believe it won't raise exception for load status, you have to check the load status and take necessary action if required.
After you issue your COPY INTO dml, you can run the following query -
SELECT * FROM TABLE(VALIDATE(TABLE_NAME, job_id => '_last'))
This will give you details on the files that you were trying to load. It will normally return empty, unless you encountered issues upload.
You can save this save results in an object and make necessary control adjustments.

unable to extra/list all event log on watson assistant wrokspace

Please help I was trying to call watson assistant endpoint
https://gateway.watsonplatform.net/assistant/api/v1/workspaces/myworkspace/logs?version=2018-09-20 to get all the list of events
and filter by date range using this params
var param =
{ workspace_id: '{myworkspace}',
page_limit: 100000,
filter: 'response_timestamp%3C2018-17-12,response_timestamp%3E2019-01-01'}
apparently I got any empty response below.
{
"logs": [],
"pagination": {}
}
Couple of things to check.
1. You have 2018-17-12 which is a metric date. This translates to "12th day of the 17th month of 2018".
2. Assuming the date should be a valid one, your search says "Documents that are Before 17th Dec 2018 and after 1st Jan 2019". Which would return no documents.
3. Logs are only generated when you call the message() method through the API. So check your logging page in the tooling to see if you even have logs.
4. If you have a lite account logs are only stored for 7 days and then deleted. To keep logs longer you need to upgrade to a standard account.
Although not directly related to your issue, be aware that page_limit has an upper hard coded limit (IIRC 200-300?). So you may ask for 100,000 records, but it won't give it to you.
This is sample python code (unsupported) that is using pagination to read the logs:
from watson_developer_cloud import AssistantV1
username = '...'
password = '...'
workspace_id = '....'
url = '...'
version = '2018-09-20'
c = AssistantV1(url=url, version=version, username=username, password=password)
totalpages = 999
pagelimit = 200
logs = []
page_count = 1
cursor = None
count = 0
x = { 'pagination': 'DUMMY' }
while x['pagination']:
if page_count > totalpages:
break
print('Reading page {}. '.format(page_count), end='')
x = c.list_logs(workspace_id=workspace_id,cursor=cursor,page_limit=pagelimit)
if x is None: break
print('Status: {}'.format(x.get_status_code()))
x = x.get_result()
logs.append(x['logs'])
count = count + len(x['logs'])
page_count = page_count + 1
if 'pagination' in x and 'next_url' in x['pagination']:
p = x['pagination']['next_url']
u = urlparse(p)
query = parse_qs(u.query)
cursor = query['cursor'][0]
Your logs object should contain the logs.
I believe the limit is 500, and then we return a pagination URL so you can get the next 500. I dont think this is the issue but once you start getting logs back its good to know

Batch Request for get in gmail API

I have a list of around 2500 mail ids and I'm stuck to only use requests library, so so far i do it this way to get mail headers
mail_ids = ['']
for mail_id in mails_ids:
res = requests.get(
'https://www.googleapis.com/gmail/v1/users/me/messages/{}?
format=metadata'.format(mail_id), headers=headers).json()
mail_headers = res['payload']['headers']
...
But its very inefficient and i would rather like to POST list of Ids instead, but on their documentation https://developers.google.com/gmail/api/v1/reference/users/messages/get, i don't see BatchGet, any workaround? I'm using Flask framework Thanks a lot
This is a bit late, but in case it helps anyone, here's the code I used to do a batch get of emails:
First I get a list of relevant emails. Change the request according to your needs, I'm getting only sent emails for a certain time period:
query = "https://www.googleapis.com/gmail/v1/users/me/messages?labelIds=SENT&q=after:2020-07-25 before:2020-07-31"
response = requests.get(query, headers=header)
events = json.loads(response.content)
email_tokens = events['messages']
while 'nextPageToken' in events:
response = requests.get(query+f"&pageToken={events['nextPageToken']}",
headers=header)
events = json.loads(response.content)
email_tokens += events['messages']
Then I'm batching a get request to get 100 emails at a time, and parsing only the json part of the email and putting it into a list called emails. Note that there's some repeated code here, so you may want to refactor it into a method. You'll have to set your access token here:
emails = []
access_token = '1234'
header = {'Authorization': 'Bearer ' + access_token}
batch_header = header.copy()
batch_header['Content-Type'] = 'multipart/mixed; boundary="email_id"'
data = ''
ctr = 0
for token_dict in email_tokens:
data += f'--email_id\nContent-Type: application/http\n\nGET /gmail/v1/users/me/messages/{token_dict["id"]}?format=full\n\n'
if ctr == 99:
data += '--email_id--'
print(data)
r = requests.post(f"https://www.googleapis.com/batch/gmail/v1",
headers=batch_header, data=data)
bodies = r.content.decode().split('\r\n')
for body in bodies:
if body.startswith('{'):
parsed_body = json.loads(body)
emails.append(parsed_body)
ctr = 0
data = ''
continue
ctr+=1
data += '--email_id--'
r = requests.post(f"https://www.googleapis.com/batch/gmail/v1",
headers=batch_header, data=data)
bodies = r.content.decode().split('\r\n')
for body in bodies:
if body.startswith('{'):
parsed_body = json.loads(body)
emails.append(parsed_body)
[Optional] Finally, I'm decoding the text in the email and storing only the last sent email instead of the whole thread. The regex used here splits on strings that I found were usually at the end of emails. For instance, On Tue, Jun 23, 2020, x#gmail.com said...:
import re
import base64
gmail_split_regex = r'On [a-zA-z]{3}, ([a-zA-z]{3}|\d{2}) ([a-zA-z]{3}|\d{2}),? \d{4}'
for email in emails:
if 'parts' not in email['payload']:
continue
for part in email['payload']['parts']:
if part['mimeType'] == 'text/plain':
if 'uniqueBody' not in email:
plainText = str(base64.urlsafe_b64decode(bytes(str(part['body']['data']), encoding='utf-8')))
email['uniqueBody'] = {'content': re.split(gmail_split_regex, plainText)[0]}
elif 'parts' in part:
for sub_part in part['parts']:
if sub_part['mimeType'] == 'text/plain':
if 'uniqueBody' not in email:
plainText = str(base64.urlsafe_b64decode(bytes(str(sub_part['body']['data']), encoding='utf-8')))
email['uniqueBody'] = {'content': re.split(gmail_split_regex, plainText)[0]}

Inconsistent slow performance of Google App Engine

I have an API endpoint, which only perform read request. Usually, the entire operation can be completed under 100ms
[03/Aug/2015:19:35:53 -0700] "GET /query?email=xxx%40aol.com&hash=xxx HTTP/1.1" 200 186 - "myapp-1.0.6o" "xxx.appspot.com" ms=74 cpu_ms=27 cpm_usd=0.000021 instance=00c61b117c55f2b00cdd73904665675ced040765 app_engine_release=1.9.24
However, it is not uncommon I can see some request, surge up to 20 seconds!
[03/Aug/2015:18:59:50 -0700] "GET /query?email=yyy%40gmail.com&hash=yyy HTTP/1.1" 200 193 - "myapp-1.0.6o" "xxx.appspot.com" ms=18288 cpu_ms=61 cpm_usd=0.000022 instance=00c61b117c55f2b00cdd73904665675ced040765 app_engine_release=1.9.24
Now, the server load is still pretty light. It gets 1 - 5 requests within a minute. We still fall under free quota usage.
The API python code is pretty straight forward. I don't think it is the culprit of such slow operation.
class User(ndb.Model):
email = ndb.StringProperty(required = True)
timestamp = ndb.DateTimeProperty(required = True)
class QueryHandler(webapp2.RequestHandler):
def get(self):
email = self.request.get('email')
hash = self.request.get('hash')
expected_hash = Utils.hash(email)
result = {
'email' : email,
'user_timestamp' : 0,
'server_timestamp' : 0,
'free_trial_duration' : _FREE_TRIAL_DURATION
}
user_timestamp = 0
if hash == expected_hash:
user_timestamp = memcache.get(email)
if user_timestamp is None:
user = User.get_by_id(email)
if user is not None:
user_timestamp = int(time.mktime(user.timestamp.timetuple()))
result['user_timestamp'] = user_timestamp
memcache.add(email, user_timestamp, _MEMCACHE_DURATION)
else:
result['user_timestamp'] = user_timestamp
else:
logging.debug('QueryHandler, email = ' + email + ', hash = ' + hash + ', expected_hash = ' + expected_hash)
server_timestamp = max(int(time.time()), user_timestamp)
result['server_timestamp'] = server_timestamp
self.response.headers['Content-Type'] = 'application/json'
json_result = json.encode(result)
self.response.out.write(json_result)
logging.debug('QueryHandler, result = ' + json_result)
app = webapp2.WSGIApplication([
('/query', QueryHandler),
], debug=False)
It seems more to Google App Engine server quality. Is there anything we can do, to avoid unreasonable slowness of API consumption?

Scrapy : Why should I use yield for multiple request?

Simply I need three conditions.
1) Log-in
2) Multiple request
3) Synchronous request ( sequential like 'C' )
I realized 'yield' should be used for multiple request.
But I think 'yield' works differently with 'C' and not sequential.
So I want to use request without 'yield' like below.
But crawl method wasn`t called ordinarily.
How can I call crawl method sequentially like C ?
class HotdaySpider(scrapy.Spider):
name = "hotday"
allowed_domains = ["test.com"]
login_page = "http://www.test.com"
start_urls = ["http://www.test.com"]
maxnum = 27982
runcnt = 10
def parse(self, response):
return [FormRequest.from_response(response,formname='login_form',formdata={'id': 'id', 'password': 'password'}, callback=self.after_login)]
def after_login(self, response):
global maxnum
global runcnt
i = 0
while i < runcnt :
**Request(url="http://www.test.com/view.php?idx=" + str(maxnum) + "/",callback=self.crawl)**
i = i + 1
def crawl(self, response):
global maxnum
filename = 'hotday.html'
with open(filename, 'wb') as f:
f.write(unicode(response.body.decode(response.encoding)).encode('utf-8'))
maxnum = maxnum + 1
When you return a list of requests (that's what you do when you yield many of them) Scrapy will schedule them and you can't control the order in which the responses will come.
If you want to process one response at a time and in order, you would have to return only one request in your after_login method and construct the next request in your crawl method.
def after_login(self, response):
return Request(url="http://www.test.com/view.php?idx=0/", callback=self.crawl)
def crawl(self, response):
global maxnum
global runcnt
filename = 'hotday.html'
with open(filename, 'wb') as f:
f.write(unicode(response.body.decode(response.encoding)).encode('utf-8'))
maxnum = maxnum + 1
next_page = int(re.search('\?idx=(\d*)', response.request.url).group(1)) + 1
if < runcnt:
return Request(url="http://www.test.com/view.php?idx=" + next_page + "/", callback=self.crawl)

Resources