/mapreduce/workerCallback producing http 429 response - google-app-engine

I'm working in Java and was able to kick-off a mapreduce job. The job made it through the ShardedJob stage, but is now stuck on the ExamineStatusAndReturnResult stage. In the task queue I see a number of jobs like: /mapreduce/workerCallback/map-hex-string These jobs are all getting re-queued because the return code is 429 Too Many Requests (https://www.rfc-editor.org/rfc/rfc6585#section-4). I feel as though I'm hitting some sort of quota limit, but I cannot figure out where/why.
How can I tell why these tasks are receiving a 429 response code?

The mapreduce library tries to avoid getting OOM by doing its own estimated memory consumption bookkeeping (this can be tuned by overriding Worker/InputReader/OutputWriter estimateMemoryRequirement methods, and it work best when MR jobs are running in their own instances [module, backend, version]). Upon receiving an MR request from the task queue the mapreduce library will check the request's estimated memory and if that is less than what is currently available the request will be rejected with HTTP error code 429. To minimize such cases you should either increase the amount of available resources (type, number of instances) and/or decrease the parallel load (number of concurrent jobs, shards per job, and avoid any other type of load on the same instances).

Related

What is 'capacity' parameter when talking about Flink async IO?

When using Flink AsyncDataStream#unorderedWait, there's a parameter called 'capacity', quote from flink official doc,
Capacity: This parameter defines how many asynchronous requests may be in progress at the same time. Even though the async I/O approach leads typically to much better throughput, the operator can still be the bottleneck in the streaming application. Limiting the number of concurrent requests ensures that the operator will not accumulate an ever-growing backlog of pending requests, but that it will trigger backpressure once the capacity is exhausted.
I'm not quite get it, is it for the whole job, or it's for a subtask?
Let's say my toy flink app consumes a kafka, and for each kafka message, it makes a http request, when it receives the http response, it sinks it to another kafka topic.
And in this example, the parallelism of kafka source to 50, if I set the 'capacity' to 10, what does that mean? Does it mean that the whole app will make at most 10 http requests at the same time? Or, 10 http requests for each subtask (that results in at most 500 http requests at the same time)?
And another question is, what it the best practice of set the 'capacity' in this scenario?
Many thanks!
The capacity is per instance of the async i/o operator. So in your example, there would be at most 500 concurrent http requests.
You may have to do some benchmarking experiments to see where it makes sense to balance the tradeoffs for your use case. If the capacity is too small then under load you're likely to create backpressure prematurely; if capacity is too large, then under load you're likely to overwhelm the external service, leading to timeouts or other errors.

App Engine - Pull queues max_concurrent_requests limit?

I'm using Google App Engine pull queues to send massive push notifications to APNS, GCM and OneSignal mostly following this architecture: https://cloudplatform.googleblog.com/2013/07/google-app-engine-takes-pain-out-of-sending-ios-push-notifications.html
The problem is that I'm hitting some kind of limit about how many tasks are leased at the same time: my Notification Workers lease 3 notifications at a time, but when there are more than about 30 workers running, leaseTasks() returns an empty array, even when there are hundreds or thousands of pending tasks. As far as I know, there is no limit about how many tasks are leased at the same time, so this behaviour is unexpected.
Have you seen this limit of pull queues in the docs:
If you generate more than 10 LeaseTasks requests per queue per second,
only the first 10 requests will return results. The others will return
no results.
If you have 30 workers, it seems that you could easily hit this limit. Could you lease more tasks at a time and use fewer workers?

Tasks, Cron jobs or Backends for an app

I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.

Task Queue VS. URLFetch

I need to run a script (python) in App Engine for many times.
One possibility is just to run a loop and use urlfetch with a link to the script.
The other one is to open a task with the script URL.
What is the difference between both ways? It seems like Tasks have a quota (100,000 daily free tasks) so why should I use them?
Thanks,
Joel
Briefly:
Bulk adding tasks to the queue will probably be easier, and possibly quicker, than using URLFetch. Although using async url-fetches might help with this.
When a task fails, it will automatically retry. Assuming you check the status of your call, URLFetch might just hang for a while before you get some type of error.
You can control the rate at which tasks are executed. So if you add 1,000 tasks fast you can let them slowly run at 10 / minute (or whatever you want), helping you not blow through your other quotas.
If you enable billing, the free quota is 20,000,000 / tasks per day.
Depending on what you are doing, tasks can be transactionally enqueued, which gives you some really powerful abilities.

Google App Engine: DeadlineExceededError

I have a GAE app that does some heavy processing up front, then is able to do very little processing on subsequent user requests. However, when I deploy my app to the Google's servers, and try to do the heavy processing, I get a DeadlineExceededError. Is there any way around this?
UPDATE: What if I do something through /remote_api? That tolerated the 10 minutes it took to upload the data, so perhaps it's immune to the time limit on requests?
Each script execution has a deadline of 30 seconds. /remote_api is no exception.
You may have a script running locally that takes 10 minutes to complete, however /remote_api is invoked once for every datastore RPC, so all this means is that each individual get, put, query, etc. finished before the deadline.
The bulk loader, task queues, and query cursors are all designed to make it easier to do heavy processing in small chunks. If you need assistance refactoring your processing code take advantage of these, please post some specific details about what you're trying to do.

Resources