What is the benefit / usage of a AppEngine remote procedure call - google-app-engine

I'm trying to grasp the concept of the rpc on appengine. When or Why would i need to use one and what are the benefits?
Do they help with staying within your quota?
Are they more efficient?

When you use the datastore, memcache, URL Fetch, or many of the other services, you are implicitly creating and using an RPC.
Some methods take an optional RPC argument. You can create an RPC with custom settings, such as a deadline, to give you more control over the call. An example of when setting a deadline on datastore operations can be useful is deferring a write to the task queue on a timeout type failure. Setting a lower deadline will ensure you have enough time to try again or insert a task.

rpc on AppEngine is useful when you want to do a URL fetch and you want to do other things while you're waiting for the response to be completed.
Let's say your URL fetch will take 1 second to complete and you have 'other' processing to do for 1 second which you can do while your waiting. You can launch an rpc call, do the 'other' processing, and when the rpc fetch is finished you can continue the request. The request will take a total of 1 second (plus overhead) with rpc as opposed to the conventional approach which would take 2 seconds.

Related

Concurrency & Parallelism in AppEngine

I am learning app-engine and have created a spring based application which has a controller for accepting all in-coming requests. There is just one method in the controller which will be used to populated 5 tables in BigQuery. So, I have 5 separate methods to insert data in BigQuery. I am calling each of these methods one at a time sequentially in my controller method. But, I want to execute these 5 BQ methods in parallel not in sequence. How can I achieve such a parallelism in App-Engine app.
There are a two different strategies you can use on GAE - concurrency and deferred approaches. Both have a few flavours.
Concurrency
There are two basic flavours of this, relying on async APIs or creating background threads.
Most of the GAE platform APIs are asynchronous (or can be) and you can invoke multiple of them at once then block until they've all resolved. In this case, you could make 5 asynchronous calls to BigQuery using the UrlFetchService.
GAE also allows the creation of background threads for the duration of a request. All threads must complete before the result is returned to the client. This is generally the least idiomatic approach for GAE.
Deferred processing
GAE offers two flavours of task queue, push and pull.
Push queues are basically a queued task being executed by a specified URL at a rate you control. They can participate in transactions and have retry rules etc. they can be used to ensure a workload is executed but independently of the initiating request. This is the most idiomatic solution for the general problem of 'background work' on GAE
Pull queues are queues that wait for an initiating request to slurp some data out for processing, usually in bulk. They're triggered by cron jobs typically.
In your case, your best bet is to use async http requests, unless you're using an SDK/API wrapper that doesn't expose this. If not, look to task queues. Almost any app you build will end up using them anyway, and they're very graceful and simple to comprehend.

Best practices to limit the number of calls to Mirror API

I, like everyone else I imagine, have a courtesy limit of 1000 Mirror API calls per day.
I see there's a batching facility that looks promising, but it appears to be able to batch only requests for a single credential. So even one customer, pushing to the API every 60 seconds will be 1440 requests/day. Ideally, 30 seconds is where I'd like to be. 2880 requests/day would be multiplied by the number of customers. It will get really big really fast.
I might be missing something, but I don't see a way around that.
If it were available I could glom all updates across all clients in the 30 second period into one giant message...
Is there a better design pattern to keep cards up-to-date with telemetry that's changing in real-time?
You can send requests to multiple users with a single batch request: instead of setting the Authorization header in the batch request, simply set the Authorization header in each sub-request.
Our Python and Java Quick Start projects have an example of using batch request to send an update to up to 10 users. This is also mentioned in the Building Glass Services with the Google Mirror API I/O session.
Otherwise, you can check the protocol documentation in our reference guide.
As Scarygami mentioned, each sub-request will consume quota so the only optimization is to save on bandwidth and HTTP requests, especially if using gzip encoding.

Schedule a GET request with GAE

I've been researching ways to schedule a GET request with a GAE application. Specifically, I want my application to respond 1 hour after it's been requested by fetching a different URL that points to another app's API.
Are Deferred Tasks the way to handle this?
I also found that Tasks have an "eta" argument that specifies earliest time of execution. Could this be preferred over "_countdown"?
Or investigate Cron jobs? These GET requests won't be happening regularly, so I don't know if Cron jobs are appropriate.
Thanks! Please help me clarify if necessary.
Yes that's a good way to do it, all you have to do is to set the _countdown in your deferred call, which is how many seconds you want to wait until this task will be executed.
Example as from the docs:
deferred.defer(do_something_expensive, _countdown=3600, _queue="myqueue")
Or you can simply use the Task API where you can set all the different parameters on when and how exactly you want this task to be executed. Whatever suits you best you can use either eta or countdown, from GAE perspective it is exactly the same.
As long as you don't want to the second accuracy (say minute accuracy). I would add the request to the datastore implementing a queue of requests. Then have a cron job run every minute looking for requests scheduled for that time period. Then I would submit a task to perform the requests. Name the task so you are unlikely to have the same task re-submitted. The task can retry a couple of times (if it errors) then you can mark the request as completed in your queue.
This way you can handle any number of scheduled requests. You don't end up with thousands of tasks. You can know if requests will run, when they run etc...

Tasks, Cron jobs or Backends for an app

I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.

Async urlfetch on App Engine

My app needs to do many datastore operations on each request. I'd like to run them in parallel to get better response times.
For datastore updates I'm doing batch puts so they all happen asynchronously which saves many milliseconds. App Engine allows up to 500 entities to be updated in parallel.
But I haven't found a built-in function that allows datastore fetches of different kinds to execute in parallel.
Since App Engine does allow urlfetch calls to run asynchronously, I created a getter URL for each kind which returns the query results as JSON-formatted text. Now my app can do async urlfetch calls to these URLs which could parallelize the datastore fetches.
This technique works well with small numbers of parallel requests, but App Engine throws errors when attempting to run more than 5 or 10 of these urlfetch calls at the same time.
I'm only testing now, so each urlfetch is the identical query; since they work fine in small volumes but start failing with more than a handful of simultaneous requests, I'm thinking it must have something to do with the async urlfetch calls.
My questions are:
Is there a limit to the number of urlfetch.create_rpc() calls that can run asynchronously?
The synchronous urlfecth.fetch() function has a 'deadline' parameter that will allow the function to wait up to 10 seconds for a response before failing. Is there any way to tell urlfetch.create_rpc() how long to wait for a response?
What do the errors shown below mean?
Is there a better server-side technique to run datastore fetches of different kinds in parallel?
File "/base/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 501, in get_result
return self.__get_result_hook(self)
File "/base/python_lib/versions/1/google/appengine/api/urlfetch.py", line 331, in _get_fetch_result
raise DownloadError(str(err))
InterruptedError: ('The Wait() request was interrupted by an exception from another callback:', DownloadError('ApplicationError: 5 ',))
Since App Engine allows async urlfetch calls but does not allow async datastore gets, I was trying to use urlfetch RPCs to retrieve from the datastore in parallel.
The lack of async datastore gets is an acknowledged issue:
http://code.google.com/p/googleappengine/issues/detail?id=1889
And there's now a third-party tool that allows async queries:
http://code.google.com/p/asynctools/
"asynctools is a library allowing you to execute Google App Engine API calls in parallel. API calls can be mixed together and queued up and then all are kicked off in parallel."
This is exactly what I was looking for.
While I am afraid that I can't directly answer any of the questions that you pose, I think that I ought to tell you that all of your research along these lines may not lead to you to a working solution for your problem.
The problem is that datastore writes take much longer than reads, so if you find a way to max out the number of reads that can happen, you're code will very run out of time long before it is able to make corresponding writes to all of the entities that you have read.
I would seriously consider rethinking the design of your datastore classes to reduce the number of reads and writes that needs to happen, as this will quickly become a bottleneck for your application.
Have you considered using TaskQueues to do the work of queuing the requests to be executed later?
If the task returns a 4xx status it will be considered failed and will be retried later - so you could pass the error back up and have the task queue handle retrying the requests until the succeed. Also, with some experimentation with bucket sizes and rates, you can probably have the Task Queue slow down the requests enough that you don't max out the database
There's also a nice wrapper (deferred.defer) which makes things even simpler - you can make a deferred call to (almost) any function in your app.

Resources