How do I create push queue tasks for multiple queues - google-app-engine

I have defined two Google App Engine push queues called "default" and "fast". How do I create a task in the "fast" queue?
Here is the queue.yaml:
queue:
- name: default
rate: 20/s
bucket_size: 10
- name: fast
rate: 50/s
bucket_size: 10
I have tried multiple things such as modifying the url parameter, but everything lands in the default queue.
Does anybody have code that shows how to send tasks to multiple queues within the same module?

taskque takes an argument called queue_name.
from google.appengine.api import taskqueue
task = taskqueue.add(
url='/your_task_handler_url',
params={'param1': 'paramval'}, queue_name='fast')

Related

How can I multithread dasha.ai queue for calling some users simultaneously?

Is it possible to call some users simultaneously in dasha.ai? Maximum count of simultaneous calls and how can I implement that
Simultaneous calls already implemented via Conversation Queue
Simultaneous calls limits
Instance limit
Instance limit is set by you in sdk application.
It must be initialized through parameter concurrency in application.start method (default value - 1)
await application.start({ concurrency: 10 });
Group limit
All users have a Default group which does not have a limit (theoretically infinite)
However if you are using a custom group - you should set max-concurrency to a number of simultaneous calls in it
You can set and update max-concurrency via dasha cli
Example for creating group
dasha group create group_name --max-concurrency=50
and updating group
dasha group update group_name --max-concurrency=50
Which group will be used by your application (instance) is defined by deploy method:
dasha.deploy("app", { groupName: "Default" });
Customer limit
This limit can be changed only on demand (you can't manually change it, atleast for now)
Application
You need to specify how to handle calls via Conversation Queue: for calls that can be started you must specify ready event
application.queue.on("ready", async (key, conversation) => {
conversation.input = getInput(key); // getInput must return an object that consist of input variables for a call
const result = await conversation.execute(); // start conversation/call
});
This event will be started for each call asynchronously
To make simultaneous calls - push as many calls as you need into a queue (example for adding one call into queue, that must be started within one hour or it will be timed out):
application.queue.push("some_unique_key", {
after: new Date(Date.now()),
before: new Date(Date.now() + 60 * 60 * 1000)
});
While you have free limits and calls in queue that ready to start - they will be processed as soon as possible

Exceeded soft memory limit of 512 MB with 532 MB after servicing 3 requests total. Consider setting a larger instance class in app.yaml

We are on Google App engine standard environment, F2 instance (generation 1 - python 2.7). We have a reporting module that follows this flow.
Worker Task is initiated in a queue.
task = taskqueue.add(
url='/backendreport',
target='worker',
queue_name = 'generate-reports',
params={
"task_data" : task_data
})
In the worker class, we query Google datastore and write the data to a Google Sheet. We paginate through the records to find additional report elements. When we find additional page, we call the same task again to spawn another write, so it can fetch the next set of report elements and write them to Google sheet.
in the backendreport.py we have the following code.
class BackendReport():
# Query google datastore to find the records(paginated)
result = self.service.spreadsheets().values().update(
spreadsheetId=spreadsheet_Id,
range=range_name,
valueInputOption=value_input_option,
body=resource_body).execute()
# If pagination finds additional records
task = taskqueue.add(
url='/backendreport',
target='worker',
queue_name = 'generate-reports',
params={
"task_data" : task_data
})
We run the same BackendReport (with pagination) as a front end job (not as a task). The pagination works without any error - meaning we fetch each page of records and display to the front end. But when we execute the tasks iteratively it fails with the soft memory limit issue. We were under the impression that every time a task is called (for each pagination) it should act independently and there shouldn't be any memory constraints. What are we doing wrong here?
Why doesn't GCP spin a different instance when the soft memory limit is reached - automatically (our instance class is F2).
The error message says soft memory limit of 512 MB reached after servicing 3 requests total - does this mean that the backendreport module spun up 3 requests - does it mean there were 3 tasks calls (/backendreport)?
Why doesn't GCP spin a different instance when the soft memory limit is reached
One of the primary mechanisms for when app engine decides to spin up a new instance is max_concurrent_requests. You can checkout all of the automatic_scaling params you can configure here:
https://cloud.google.com/appengine/docs/standard/python/config/appref#scaling_elements
does this mean that the backendreport module spun up 3 requests - does it mean there were 3 tasks calls (/backendreport)?
I think so. To be sure, you can open up Logs viewer, find the log where this was printed and filter your logs by that instance-id to see all the requests it handled that lead to that point.
you're creating multiple tasks in Cloud Tasks, but there's no limitation for the dispatching queue there, and as the queue tries to dispatch multiple tasks at the same time, it reaches the memory limit. So the limitations you want to set in place is really max_concurrent_requests, however not for the instances in app.yaml, it should be set for the queue dispatching in queue.yaml, so only one task at a time is dispatched:
- name: generate-reports
rate: 1/s
max_concurrent_requests: 1

Creating a cluster before sending a job to dataproc programmatically

I'm trying to schedule a PySpark Job. I followed the GCP documentation and ended up deploying a little python script to App Engine which does the following :
authenticate using a service account
submit a job to a cluster
The problem is, I need the cluster to be up and running otherwise the job won't be sent (duh !) but I don't want the cluster to always be up and running, especially since my job needs to run once a month.
I wanted to add the creation of a cluster in my python script but the call is asynchronous (it makes an HTTP request) and thus my job is submitted after the cluster creation call but before the cluster is really up and running.
How could I do ?
I'd like something cleaner than just waiting for a few minutes in my script !
Thanks
EDIT : Here's what my code looks like so far :
To launch the job
class EnqueueTaskHandler(webapp2.RequestHandler):
def get(self):
task = taskqueue.add(
url='/run',
target='worker')
self.response.write(
'Task {} enqueued, ETA {}.'.format(task.name, task.eta))
app = webapp2.WSGIApplication([('/launch', EnqueueTaskHandler)], debug=True)
The job
class CronEventHandler(webapp2.RequestHandler):
def create_cluster(self, dataproc, project, zone, region, cluster_name):
zone_uri = 'https://www.googleapis.com/compute/v1/projects/{}/zones/{}'.format(project, zone)
cluster_data = {...}
dataproc.projects().regions().clusters().create(
projectId=project,
region=region,
body=cluster_data).execute()
def wait_for_cluster(self, dataproc, project, region, clustername):
print('Waiting for cluster to run...')
while True:
result = dataproc.projects().regions().clusters().get(
projectId=project,
region=region,
clusterName=clustername).execute()
# Handle exceptions
if result['status']['state'] != 'RUNNING':
time.sleep(60)
else:
return result
def wait_for_job(self, dataproc, project, region, job_id):
print('Waiting for job to finish...')
while True:
result = dataproc.projects().regions().jobs().get(
projectId=project,
region=region,
jobId=job_id).execute()
# Handle exceptions
print(result['status']['state'])
if result['status']['state'] == 'ERROR' or result['status']['state'] == 'DONE':
return result
else:
time.sleep(60)
def submit_job(self, dataproc, project, region, clusterName):
job = {...}
result = dataproc.projects().regions().jobs().submit(projectId=project,region=region,body=job).execute()
return result['reference']['jobId']
def post(self):
dataproc = googleapiclient.discovery.build('dataproc', 'v1')
project = '...'
region = "..."
zone = "..."
clusterName = '...'
self.create_cluster(dataproc, project, zone, region, clusterName)
self.wait_for_cluster(dataproc, project, region, clusterName)
job_id = self.submit_job(dataproc,project,region,clusterName)
self.wait_for_job(dataproc,project,region,job_id)
dataproc.projects().regions().clusters().delete(projectId=project, region=region, clusterName=clusterName).execute()
self.response.write("JOB SENT")
app = webapp2.WSGIApplication([('/run', CronEventHandler)], debug=True)
Everything works until the deletion of the cluster. At this point I get a "DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded." Any idea ?
In addition to general polling either through list or get requests on the Cluster or the Operation returned with the CreateCluster request, for single-use clusters like this you can also consider using the Dataproc Workflows API and possibly its InstantiateInline interface if you don't want to use full-fledged workflow templates; in this API you use a single request to specify cluster settings along with jobs to submit, and the jobs will automatically run as soon as the cluster is ready to take it, after which the cluster will be deleted automatically.
You can use the Google Cloud Dataproc API to create, delete and list clusters.
The list operation can be (repeatedly) performed after create and delete operations to confirm that they completed successfully, since it provides the ClusterStatus of the clusters in the results with the relevant State information:
UNKNOWN The cluster state is unknown.
CREATING The cluster is being created and set up. It is not ready for use.
RUNNING The cluster is currently running and healthy. It is ready for use.
ERROR The cluster encountered an error. It is not ready for use.
DELETING The cluster is being deleted. It cannot be used.
UPDATING The cluster is being updated. It continues to accept and process jobs.
To prevent plain waiting between the (repeated) list invocations (in general not a good thing to do on GAE) you can enqueue delayed tasks in a push task queue (with the relevant context information) allowing you to perform such list operations at a later time. For example, in python, see taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
If at the task execution time the result indicates the operation of interest is still in progress simply enqueue another such delayed task - effectively polling but without an actual wait/sleep.

Deleting massive of entities from Google App Engine NDB

The previous guys made som problem in our Google App Engine app. Currently, the app is saving entities with NULL values, but it would be better if we could clean up all thees values.
Here is the ndb.Modal:
class Day(ndb.Model):
date = ndb.DateProperty(required=True, indexed=True)
items = ndb.StringProperty(repeated=True, indexed=False)
reason = ndb.StringProperty(name="cancelled", indexed=False)
is_hole = ndb.ComputedProperty(lambda s: not bool(s.items or s.reason))
Somehow, we need to delete all Days where is_hole is true.
It's around 4 000 000 entities where around 2 000 000 should be deleted on the server.
Code so far
I thought it would be good to first count how many entities we should delete using this code:
count = Day.query(Day.is_hole != False).count(10000)
This (with the limit of 10 000) takes around 5 seconds to run. Without the limit, it would case a DeadLineException.
For deleting, I've tried this code:
ndb.delete_multi([key for key in Day.query(Day.is_hole != False).fetch(10000, keys_only=True)])
This (with the limit) takes around 30 seconds.
Question
How can I faster delete all Day where is_hole != False?
(We are using Python)
No, there is not faster way to delete entities - deadline is fixed.
But there are some tricks.
You can make deadline longer if you will use https://cloud.google.com/appengine/docs/python/taskqueue/ you can put some task in queue generate next task after first task (recurrence).
Another option similar to task queue is to make after deleting some of bad record redirect to same handler which is deleting while the last record will be deleted. Need browser open till the end.
if at_least_one_bad_record:
delete_some_records (not longer than 30s)
spawn again this task or redirect to this handler (next call will have next 30s)
Remember that it has exit point if no more good records. It will delete all matching record without clicking again.
Best way is to use MapReduce which will run in task queue and also you can do sharding to parallel the work. Here is the python code. Let me know, if you need any clarification
main.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from mapreduce.input_readers import InputReader
from google.appengine.api import app_identity
def deleteEntity(entity):
yield op.db.Delete(entity)
class DeleteEntitiesPipeline(base_handler.PipelineBase):
def run(self):
bucket_name = (app_identity.get_default_gcs_bucket_name())
yield mapreduce_pipeline.MapPipeline(
"job_name",
"main.deleteEntity",
"mapreduce.input_readers.DatastoreInputReader",
params={
"entity_kind": 'models.Day',
"filters": [("is_hole", "=", True)],
"bucket_name": bucket_name
},
shards=5)
class StartDelete(webapp2.RequestHandler):
def get(self):
pipeline = DeleteEntitiesPipeline()
pipeline.start()
application = webapp2.WSGIApplication([
('/deleteentities', StartDelete),
], debug=True)

GAE Golang - How to properly schedule a Task Queue to a Backend?

There is little information on how to schedule a Task Queue to a Backend in Google App Engine in Go. In TQ's Reference we can read:
// Additional HTTP headers to pass at the task's execution time.
// To schedule the task to be run with an alternate app version
// or backend, set the "Host" header.
Header http.Header
But there is no explanation on what to really set the "Host" to. In Backends' Overview we can similarly read:
Private backends can be accessed by application administrators, instances of the application, and by App Engine APIs and services (such as Task Queue tasks and Cron jobs) without any special configuration.
But again, no explanation is given.
I tried setting the "Host" value to the name of the backend, but the tasks are executed by the normal application.
t := taskqueue.NewPOSTTask("/", map[string][]string{"key": {key}})
t.Header.Add("Host", "backend")
if _, err := taskqueue.Add(c, t, ""); err != nil {
return
}
What is the correct way to schedule a Backend call in GAE Go?
It's easiest to target a backend by using a named queue. e.g.:
_, err = taskqueue.Add(c, &taskqueue.Task{
Path: "/myProcessorPath",
Payload: myPayload,
}, "myQueueName")
Your queue definition specifies the backend. e.g. for myQueueName, you might have a queue.yaml entry that looks like this:
- name: myQueueName
target: myBackendName
rate: 400/s
max_concurrent_requests: 64
bucket_size: 25
retry_parameters:
task_age_limit: 7d
Use the appengine.BackendHostname function to get the hostname for a backend. That should be usable as the Host header for a task.

Resources