I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following :
authenticate using a service account
submit a job to a cluster
The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month.
I wanted to add the creation of a cluster in my python script but the call is asynchronous (it makes an HTTP request) and thus my job is submitted after the cluster creation call but before the cluster is really up and running.
How could I do ?
I'd like something cleaner than just waiting for a few minutes in my script !
Thanks
EDIT : Here's what my code looks like so far :
To launch the job
class EnqueueTaskHandler(webapp2.RequestHandler):
def get(self):
task = taskqueue.add(
url='/run',
target='worker')
self.response.write(
'Task {} enqueued, ETA {}.'.format(task.name, task.eta))
app = webapp2.WSGIApplication([('/launch', EnqueueTaskHandler)], debug=True)
The job
class CronEventHandler(webapp2.RequestHandler):
def create_cluster(self, dataproc, project, zone, region, cluster_name):
zone_uri = 'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(project, zone)
cluster_data = {...}
dataproc.projects().regions().clusters().create(
projectId=project,
region=region,
body=cluster_data).execute()
def wait_for_cluster(self, dataproc, project, region, clustername):
print('Waiting for cluster to run...')
while True:
result = dataproc.projects().regions().clusters().get(
projectId=project,
region=region,
clusterName=clustername).execute()
# Handle exceptions
if result['status']['state'] != 'RUNNING':
time.sleep(60)
else:
return result
def wait_for_job(self, dataproc, project, region, job_id):
print('Waiting for job to finish...')
while True:
result = dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId=job_id).execute()
# Handle exceptions
print(result['status']['state'])
if result['status']['state'] == 'ERROR' or result['status']['state'] == 'DONE':
return result
else:
time.sleep(60)
def submit_job(self, dataproc, project, region, clusterName):
job = {...}
result = dataproc.projects().regions().jobs().submit(projectId=project,region=region,body=job).execute()
return result['reference']['jobId']
def post(self):
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project = '...'
region = "..."
zone = "..."
clusterName = '...'
self.create_cluster(dataproc, project, zone, region, clusterName)
self.wait_for_cluster(dataproc, project, region, clusterName)
job_id = self.submit_job(dataproc,project,region,clusterName)
self.wait_for_job(dataproc,project,region,job_id)
dataproc.projects().regions().clusters().delete(projectId=project, region=region, clusterName=clusterName).execute()
self.response.write("JOB SENT")
app = webapp2.WSGIApplication([('/run', CronEventHandler)], debug=True)
Everything works until the deletion of the cluster. At this point I get a "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded." Any idea ?
In addition to general polling either through list or get requests on the Cluster or the Operation returned with the CreateCluster request, for single-use clusters like this you can also consider using the Dataproc Workflows API and possibly its InstantiateInline interface if you don't want to use full-fledged workflow templates; in this API you use a single request to specify cluster settings along with jobs to submit, and the jobs will automatically run as soon as the cluster is ready to take it, after which the cluster will be deleted automatically.
You can use the Google Cloud Dataproc API to create, delete and list clusters.
The list operation can be (repeatedly) performed after create and delete operations to confirm that they completed successfully, since it provides the ClusterStatus of the clusters in the results with the relevant State information:
UNKNOWN The cluster state is unknown.
CREATING The cluster is being created and set up. It is not ready for use.
RUNNING The cluster is currently running and healthy. It is ready for use.
ERROR The cluster encountered an error. It is not ready for use.
DELETING The cluster is being deleted. It cannot be used.
UPDATING The cluster is being updated. It continues to accept and process jobs.
To prevent plain waiting between the (repeated) list invocations (in general not a good thing to do on GAE) you can enqueue delayed tasks in a push task queue (with the relevant context information) allowing you to perform such list operations at a later time. For example, in python, see taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
If at the task execution time the result indicates the operation of interest is still in progress simply enqueue another such delayed task - effectively polling but without an actual wait/sleep.
Related
We are using pubsub & a cloud function to process a stream of incoming data. I am setting up a dead letter topic to handle cases where a message cannot be processed, as described at Cloud Pub/Sub > Guides > Handling message failures.
I've configured a subscription on the dead-letter topic to retain messages for 7 days, we're doing this using terraform:
resource "google_pubsub_subscription" "dead_letter_monitoring" {
project = var.project_id
name = "var.dead_letter_sub_name
topic = google_pubsub_topic.dead_letter.name
expiration_policy { ttl = "" }
message_retention_duration = "604800s" # 7 days
retain_acked_messages = true
ack_deadline_seconds = 600
}
We've tested our cloud function robustly and consequently our expectation is that messages will appear on this dead-letter topic very very rarely, perhaps never. Nevertheless we're putting it in place just to make sure that we catch any anomalies.
Given the rarity of which we expect messages to appear on the dead-letter-topic we need to set up an alert to send an email when such a message appears. Is it possible to do this? I've taken a look through the alerts one can create at https://console.cloud.google.com/monitoring/alerting/policies/create however I didn't see anything that could accomplish this.
I know that I could write a cloud function to consume a message from the subscription and act upon it accordingly however I'd rather not have to do that, a monitoring alert feels like a much more elegant way of achieving this.
is this possible?
Yes, you can use Cloud Monitoring for that. Create a new policy and perform that configuration
Select PubSub Topic and Published message. Observe the value every minute and count them (aligner in the advanced option). Now, in the config, when it's above 0 from the most recent value, the alert is raised.
To filter on only your topic you can add a filter by topic_id on your topic name.
Then, configure your alert to send an email. It should work!
We are on Google App engine standard environment, F2 instance (generation 1 - python 2.7). We have a reporting module that follows this flow.
Worker Task is initiated in a queue.
task = taskqueue.add(
url='/backendreport',
target='worker',
queue_name = 'generate-reports',
params={
"task_data" : task_data
})
In the worker class, we query Google datastore and write the data to a Google Sheet. We paginate through the records to find additional report elements. When we find additional page, we call the same task again to spawn another write, so it can fetch the next set of report elements and write them to Google sheet.
in the backendreport.py we have the following code.
class BackendReport():
# Query google datastore to find the records(paginated)
result = self.service.spreadsheets().values().update(
spreadsheetId=spreadsheet_Id,
range=range_name,
valueInputOption=value_input_option,
body=resource_body).execute()
# If pagination finds additional records
task = taskqueue.add(
url='/backendreport',
target='worker',
queue_name = 'generate-reports',
params={
"task_data" : task_data
})
We run the same BackendReport (with pagination) as a front end job (not as a task). The pagination works without any error - meaning we fetch each page of records and display to the front end. But when we execute the tasks iteratively it fails with the soft memory limit issue. We were under the impression that every time a task is called (for each pagination) it should act independently and there shouldn't be any memory constraints. What are we doing wrong here?
Why doesn't GCP spin a different instance when the soft memory limit is reached - automatically (our instance class is F2).
The error message says soft memory limit of 512 MB reached after servicing 3 requests total - does this mean that the backendreport module spun up 3 requests - does it mean there were 3 tasks calls (/backendreport)?
Why doesn't GCP spin a different instance when the soft memory limit is reached
One of the primary mechanisms for when app engine decides to spin up a new instance is max_concurrent_requests. You can checkout all of the automatic_scaling params you can configure here:
https://cloud.google.com/appengine/docs/standard/python/config/appref#scaling_elements
does this mean that the backendreport module spun up 3 requests - does it mean there were 3 tasks calls (/backendreport)?
I think so. To be sure, you can open up Logs viewer, find the log where this was printed and filter your logs by that instance-id to see all the requests it handled that lead to that point.
you're creating multiple tasks in Cloud Tasks, but there's no limitation for the dispatching queue there, and as the queue tries to dispatch multiple tasks at the same time, it reaches the memory limit. So the limitations you want to set in place is really max_concurrent_requests, however not for the instances in app.yaml, it should be set for the queue dispatching in queue.yaml, so only one task at a time is dispatched:
- name: generate-reports
rate: 1/s
max_concurrent_requests: 1
I am using the deferred task queues library with GAE. Every day I need to send a piece of text to all users connected to a certain page in my app. My app has multiple pages connected, so for each page, I want to go over all users, and send them a daily message. I am using a cursor to iterate over the table of Users in batches of 800. If there are more than 800 users, I want to remember where the cursor left off, and start another task with the other users.
I just want to make sure that with my algorithm I am going to send all users only one message. I want to make sure I won't miss any users, and that no user will receive the same message twice.
Does this look like the proper algorithm to handle my situation?
def send_news(page_cursor=None, page_batch_size=1,
user_cursor=None, user_batch_size=800):
p_query = PageProfile.query(PageProfile.subscribed==True)
all_pages, next_page_cursor, page_more = p_query.fetch_page(page_batch_size,
start_cursor=page_cursor)
for page in all_pages:
if page.page_news_url and page.subscribed:
query = User.query(User.subscribed==True, User.page_id == page.page_id)
all_users, next_user_cursor, user_more = query.fetch_page(user_batch_size, start_cursor=user_cursor)
for user in all_users:
user.sendNews()
# If there are more users on this page, remember the cursor
# and get the next 800 users on this same page
if user_more:
deferred.defer(send_news, page_cursor=page_cursor, user_cursor=next_user_cursor)
# If there are more pages left, use another deferred queue to
# send the daily news to users in that page
if page_more:
deferred.defer(send_news, page_cursor=next_page_cursor)
return "OK"
You could wrap your user.sendNews() into another deferred task with specific name which will ensure that it's created only once.
interval = int(time.time()) / (60 * 60 * 24)
args = ('positional_arguments_for_object')
kwargs = {'param': 'value'}
task_name = '_'.join([
'user_name',
'page_name'
str(interval_num)
])
# with interval presented in the name we are sure that the task name for the same page and same user will stay same for 24 hours
try:
deferred.defer(obj, _name=task_name, _queue='my-queue', _url='/_ah/queue/deferred', *args, **kwargs)
except (taskqueue.TaskAlreadyExistsError):
pass
# task with such name already exists, likely wasn't executed yet
except (taskqueue.TombstonedTaskError)
pass
# task with such name was created not long time ago and this name isn't available to use
# this should reset once a week or so
Note that as far as I remember App Engine does not guarantee that the task will be executed only once, in some edge cases it could be executed twice or more times and ideally they should be idempotent. If such edge cases are important for you – you could transactionally read/write some flag in the datastore for each task, and before executing the task you check if that entity is there to cancel the execution.
I have a backends that is normally invoked by a cron to run a few times every day. Yesterday, I noticed it was restarting without stopping. I dont see a place in my code where that invocation is happening. Rather, the task queue seems to indicate it is running due to re-tries due to errors. One error is that status is saved to bigQuery and that is failing because a quoto is exceeded. But this seems to generate an infinite loop. Is this a bug in app engine or I am doing something wrong? Is there a way to indicate to not restart a task if it fails? My other app engine tasks that terminate without 200 status dont do that...
Here is a trace of the queue from which the restarts keep happening:
Here is the logging showing continous running
And here is the http header inside the logging
UPDATE1
Here is the cron:
<?xml version="1.0" encoding="UTF-8"?>
<cronentries>
<cron>
<url>/uploadToBigQueryStatus</url>
<description>Check fileNameSaved Status</description>
<schedule>every 15 minutes from 02:30 to 03:30</schedule>
<timezone>US/Pacific</timezone>
<target>checkuploadstatus-backend</target>
</cron>
</cronentries>
UPDATE 2
As for the comment about catching the error: The error I believe is that the biqQuery job fails because a quota has been hit. Strange thing is that it happened yesterday, and the quota should have been reset, so the error should have good away for at least a while. I dont understand why the task retries, I never selected that option that I am aware of.
I killed the servlet and emptied the task queue so at least it is stopped. But I dont know the root cause. IF BQ table quota was the reason, that shouldnt cause an infinite retry!
UPDATE 3
I have not trapped the servlet call that produced the error that led to the infinite retry. But I checked this cron activated servlet today and found I had another non-200 result. The return value this time was 500 and it is caused by a DataStore time-out exception.
Here is the screen shot of the return that show 500 return code.
Here is the exception info page 1
And the following data
The offending code line is the for loop iterating on the data store query
if (keys[0] != null) {
/* Define the query */
q = new Query(bucket).setAncestor(keys[0]);
pq = datastore.prepare(q);
gotResult = false;
// First system time stamp
Date date= new Timestamp(new Date().getTime());
Timestamp timeStampNow = new Timestamp(date.getTime());
for (Entity result : pq.asIterable()) {
I will add a try-catch on this for loop as it is crashing in this iteration.
if (keys[0] != null) {
/* Define the query */
q = new Query(bucket).setAncestor(keys[0]);
pq = datastore.prepare(q);
gotResult = false;
// First system time stamp
Date date= new Timestamp(new Date().getTime());
Timestamp timeStampNow = new Timestamp(date.getTime());
try {
for (Entity result : pq.asIterable()) {
Hopefully, the data store read will not crash the servlet but it will render a failure. At leas the cron will run again and pickup other non-handled results.
By the way, is this a java error or app engine? I see a lot of these data store time outs and I will add a try-catch around all the result loops. Still, it should not cause the infinite retry that I experienced. I will see if I can find the actual crash..problem is that it overloaded my logging...More later.
UPDATE 4
I went back to the logs to see when the inifinite loop began. In the logs below, I opened the run that is at the head of the continuous running. YOu can see that it fails with 500 every 5th time. It is not the cron that invoked it, it was me calling the servlet to check biq query upload status (I write to the data store the job info, then read it back in servlet and write to bigQuery the job status and if done, erase the data store entry.) I cannot explain the steady 500 errors every 5th call, but it is always the Data Store Timeout exception.
UPDATE 5
Can the infinite retries be happening because of the queue configuration?
CheckUploadStatus
20/s
10
100
10
200
2
I just noticed another task queue had a 500 return code and it was continuously retrying. I did some search and found some people have tried to configure
the queues for no retry. They said that didnt work.
See this link:
Google App Engine: task_retry_limit doesn't work?
But one re-try is possible? That is far better than infinite.
It is contradictory that Google enforces quotas but seems to prefer infinite retries. I would much prefer block the retries by default on non-200 return code and then have NO QUOTAS!!!
According to Retrying cron jobs that fail:
If a cron job's request handler returns a status code that is not in
the range 200–299 (inclusive) App Engine considers the job to have
failed. By default, failed jobs are not retried.
To set failed jobs to be retried:
Include a retry-parameters block in your cron.xml file.
Choose and set the retry parameters in the retry-parameters block.
Your cron config doesn't specify the necessary retry parameters, so the jobs returning the 500 code should, indeed, not be retried, as you expect.
So this looks like a bug. Possibly a variant of the (older) known issue 10075 - the 503 code mentioned there might have changed in the mean time - but it is also a quota-related failure.
The suggestion from GAEfan's comment is likely a good workaround:
You will need to catch the error, and send a 200 response to stop the
task queue from retrying. – GAEfan 1 hour ago
The previous guys made som problem in our Google App Engine app. Currently, the app is saving entities with NULL values, but it would be better if we could clean up all thees values.
Here is the ndb.Modal:
class Day(ndb.Model):
date = ndb.DateProperty(required=True, indexed=True)
items = ndb.StringProperty(repeated=True, indexed=False)
reason = ndb.StringProperty(name="cancelled", indexed=False)
is_hole = ndb.ComputedProperty(lambda s: not bool(s.items or s.reason))
Somehow, we need to delete all Days where is_hole is true.
It's around 4 000 000 entities where around 2 000 000 should be deleted on the server.
Code so far
I thought it would be good to first count how many entities we should delete using this code:
count = Day.query(Day.is_hole != False).count(10000)
This (with the limit of 10 000) takes around 5 seconds to run. Without the limit, it would case a DeadLineException.
For deleting, I've tried this code:
ndb.delete_multi([key for key in Day.query(Day.is_hole != False).fetch(10000, keys_only=True)])
This (with the limit) takes around 30 seconds.
Question
How can I faster delete all Day where is_hole != False?
(We are using Python)
No, there is not faster way to delete entities - deadline is fixed.
But there are some tricks.
You can make deadline longer if you will use https://cloud.google.com/appengine/docs/python/taskqueue/ you can put some task in queue generate next task after first task (recurrence).
Another option similar to task queue is to make after deleting some of bad record redirect to same handler which is deleting while the last record will be deleted. Need browser open till the end.
if at_least_one_bad_record:
delete_some_records (not longer than 30s)
spawn again this task or redirect to this handler (next call will have next 30s)
Remember that it has exit point if no more good records. It will delete all matching record without clicking again.
Best way is to use MapReduce which will run in task queue and also you can do sharding to parallel the work. Here is the python code. Let me know, if you need any clarification
main.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from mapreduce.input_readers import InputReader
from google.appengine.api import app_identity
def deleteEntity(entity):
yield op.db.Delete(entity)
class DeleteEntitiesPipeline(base_handler.PipelineBase):
def run(self):
bucket_name = (app_identity.get_default_gcs_bucket_name())
yield mapreduce_pipeline.MapPipeline(
"job_name",
"main.deleteEntity",
"mapreduce.input_readers.DatastoreInputReader",
params={
"entity_kind": 'models.Day',
"filters": [("is_hole", "=", True)],
"bucket_name": bucket_name
},
shards=5)
class StartDelete(webapp2.RequestHandler):
def get(self):
pipeline = DeleteEntitiesPipeline()
pipeline.start()
application = webapp2.WSGIApplication([
('/deleteentities', StartDelete),
], debug=True)