Passing (large binary) data to appengine task

Passing (large binary) data to appengine task - google-app-engine

I have a task endpoint that needs to process data (say >1MB file) uploaded from a frontend request. However, I do not think I can pass the data from the frontend request via TaskOptions.Builder as I will get the "Task size too large" error.
I need some kind of "temporary" data store for the uploaded data, that can be deleted once the task has successfully processed it.
Option A: Store uploaded data in memcache, pass the key to the task. This is likely going to work most of the time, except when the data is evicted BEFORE the task is processed. If this can be resolved, sounds like a great solution.
Option B: Store the data in datastore (an Entity created just for this purpose). Pass the id to the task. The task is responsible for deleting the entity when it is done.
Option C: Use the Blobstore service. This, IMHO, is similar in concept to Option B.
At the moment, i'm thinking option B is the most feasible way.
Appreciate any advise on the best way to handle these situations.

If you are storing data larger than 1mb, you must use the blobstore. (Yes, you can segment the data in the datastore, but it's not worth the work.) There are two things to look out for, however. Make sure that you write the data to the blobstore in chunks less than 1mb. Also, since the task queue is idempotent, your tasks should not fail if the requested blobstore key does not exist, since a previous task may have deleted it already.

Option B doesn't work too, because maximum entity size is 1mb too. Same limit as for task.
Option A can't give any guarantee, values can expire from the memcache at any time, and may be expired prior to the expiration deadline set for the value.
Imho, Option C is best for your needs

Related

What is a good way to send large data sets to a client through API requests?

A client's system will connect to our system via an API for a data pull. For now this data will be stored in a data mart, and say 50,000 records per request.
I would like to know the most efficient way of delivering the payload which originates in a SQL Azure database.
The API request will be a RESTful. After the request is received, I was thinking that the payload would be retrieved from the database, converted to JSON, and GZIP encoded/transferred over HTTP back to the client.
I'm concerned about processing this may take with many clients connected pulling a lot of data.
Would it be best to just return the straight results in clear text to the client?
Suggestions welcome.
-- UPDATE --
To clarify, this is not a web client that is connecting. The connection is made by another application to receive a one-time, daily data dump, so no pagination.
The data consists primarily of text with one binary field.

First of all : do not optimize prematurely! that means : dont sacrifice simplicity and maintainability of your code for gain you dont event know.
Lets see. 50000 records does not really say anything without specifying size of the record. I would advise you start from basic implementation and optimize when needed. So try this
Implement simple JSON response with that 50000 records, and try to call it from consumer app. Measure size of data and response time - evaluate carefully, if this is really a problem for once a day operation
If yes, turn on compression for that JSON response - this is usually HUGE change with JSON because of lots of repetitive text. One tip here: set content type header to "application/javascript" - Azure have dynamic compression enabled by default for this content type. Again - try it, evaluate if size of data or reponse time is problem
If it is still problem, maybe it is time for some serialization optimization after all, but i would strogly recommend something standard and proved here (no custom CSV mess), for example Google Protocol Buffers : https://code.google.com/p/protobuf-net/

This is a bit long for a comment, so ...
The best method may well be one of those "it depends" answers.
Is the just the database on azure, or is your whole entire hosting on azure. Never did any production on Azure myself.
What are you trying to optimize for -- total round response time, total server CPU time, or perhaps sometime else?
For example, if you database server is azure and but but you web server is local perhaps you can simply optimize the database request and depend on scaling via multiple web servers if needed.
If data the changes with each request, you should never compress it if you are trying to optimize server CPU load, but you should compress it if you are trying to optimize bandwidth usage -- either can be your bottleneck / expensive resource.
For 50K records, even JSON might be a bit verbose. If you data is a single table, you might have significant data savings by using something like CSV (including the 1st row as a record header for a sanity check if nothing else). If your result is a result of joining multiple table, i.e., hierarchical, using JSON would be recommended simply to avoid the complexity of rolling your own heirarchical representation.
Are you using a SSL or your webserver, if so SSL could be your bottleneck (unless this is handled via other hardware)
What is the nature of the data you are sending? Is is mostly text, numbers, images? Text usually compress well, numbers less so, and images poorly (usually). Since you suggest JSON, I would expect that you have little if any binary data though.
If compressing JSON, it can be a very efficient format since the repeated field name mostly compress out of your result. XML likewise (but less so this the tags come in pairs)
ADDED
If you know what the client will be fetching before hand and can prepare the packet data in advance, by all means do so (unless storing the prepared data is an issue). You could run this at off peak hours, create it as a static .gz file and let IIS serve it directly when needed. Your API could simply be in 2 parts 1) retrieve a list of static .gz files available to the client 2) Confirm processing of said files so you can delete them.
Presumably you know that JSON & XML are not as fragile as CSV, i.e., added or deleting fields from your API is usually simple. So, if you can compress the files, you should definitely use JSON or XML -- XML is easier for some clients to parse, and to be honest if you use the Json.NET or similar tools you can generate either one from the same set of definitions and information, so it is nice to be flexible. Personally, I like Json.NET quite a lot, simple and fast.

Normally what happens with such large requests is pagination, so included in the JSON response is a URL to request the next lot of information.
Now the next question is what is your client? e.g. a Browser or a behind the scenes application.
If it is a browser there are limitations as shown here:
http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
If it is an application then your current approach of 50,000 requests in a single JSON call would be acceptable, the only thing you need to watch here is the load on the DB pulling the records, especially if you have many clients.

If you are willing to use a third-party library, you can try Heavy-HTTP which solves this problem out of the box. (I'm the author of the library)

How to handle caching and data-store sync in GAE

I am writing a small appengine application and I want to start using Datastore.
My app has some users and each user is a complicated JAVA class.
Users might swap some "point" objects between them, so I need the data to be available and fast.
My question is quite generic:
How should I handle the data caching?
Storing the data entirely in the Datastore and fetching it in every call sounds slow in runtime.
On the other hand, holding the data in a static JAVA class sounds tricky, because every now and then the server resets and data is erased.
If I had a main loop, like in a regular console application I would've probably saved the data twice-three times a day on predefined hours of the day.
How should I manage my code in such a way that it would save the status in the datastore every now and then and this way will not loose any data.

if the number of write operations are less than read operations,
it is good to try "Write-through caching" with memcache module.
https://developers.google.com/appengine/docs/java/memcache/
basically the write-through caching works like this:
for every write operations, program update the datastore, then backup the data to memcache
for every read operations, program will try to read from memcache. if it can found required data, then it returns, otherwise it will fetch the data from datastore and backup the results to memcache.

As #lucemia noted, you should use memcache.
If you use objectify, you can set-up caching declaratively, without writing and cache handling code.

JDO Batch PUT from Memcache

In an effort to reduce the number of datastore PUTs I am consuming I
wish to use the memcache much more frequently. The idea being to store
entities in the memcache for n minutes before writing all entities to
the datastore and clearing the cache.
I have two questions:
Is there a way to batch PUT every entity in the cache? I believe
makePersistentAll() is not really batch saving but rather saving each
individually is it not?
Secondly is there a "callback" function you can place on entities as
you place them in the memcache? I.e. If I add an entity to the cache
(with a 2 minute expiration delta) can I tell AppEngine to save the
entity to the datastore when it is evicted?
Thanks!

makePersistentAll does indeed do a batch PUT, which the log ought to tell you clear enough

There's no way to fetch the entire contents of memcache in App Engine. In any case, this is a bad idea - items can be evicted from memcache at any time, including between when you inserted them and when you tried to write the data to the datastore. Instead, use memcache as a read cache, and write data to the datastore immediately (using batch operations when possible).

Best Practice for temporary query result storage in Google App Engine

I have a query that takes a bit of time (it involves some computations and accessing a third-party). Currently, the user makes an HTTP request to initiate the query (it returns immediately). GAE puts the task in a queue and executes it. After execution, the task stores the results in a static object. The user makes another HTTP request later to retrieve the results.
Is there a best practice way to implement something like this? Would the results be better stored in the DataStore?

the task stores the results in a static object
How are you making sure that the subsequent request from same user hits the same instance so that it can access the static object?
A better way to do it would certainly be either storing it in memcache (prone to hit or miss) and/or datastore. Keep in mind, with the new pricing model, datastore operations are going to cost more.

When to use a certain type of persistence in Google App Engine?

First of all I'll explain the question. By persistence, I mean storing data beyond the execution of a single request. It might not be the best question title, so feel free to edit it.
The way I see it, there are three types of persistence in GAE, each one "closer" to the request itself:
The datastore
This is where all data is most likely to be based. It may go into the higher layers of persistence temporarily, but in the end, this is where the data really is. Unfortunately, querying the datastore repeatedly is slow and uses a lot of resources.
Use when...
storing data that should be stored for an indefinite amount of time.
Avoid using when...
getting data that is queried often but rarely updated.
memcache
This is a highly complex caching engine that stores the data in memory and makes sure all users read from/write to the same cache. It's a much faster way to get/set data on a key→value basis than using the datastore. Unfortunately, data can only stay in the memory for so long, and there is no guarantee that it will stay for as long as you tell it to; the data may disappear at any time if memory is needed elsewhere.
Use when...
you need to get data more often than you need to update it. Even when data needs to be updated often, it can have its uses (if a few missed updates are considered okay), by setting up a task queue to persist data from the memcache to the datastore.
Avoid using when...
data needs to be updated often and has to be up-to-date when fetched.
Global variables
This isn't an official method of persisting data, but it works. However, it's the least reliable method, and since it has no data synchronization across servers, persisted data may show up differently for different users (but from what I've found, the server rarely changes for the same user.) Theoretically, this should be the method that has the least overhead in getting/setting values, however, and could have its uses.
Use when...
hell freezes over? I don't know... I haven't enough knowledge about what goes on behind the scenes to actually rely on this method. Discuss!
Avoid using when...
you rely on the data being the same across servers.
Cookies
If the data is user-specific, it can be efficient to store it as a cookie in the user's browser. There are some pitfalls to watch out for though:
Security – the user can meddle with cookies, and malicious people could potentially do the same. To make sure that the contents are unreadable and unchangeable to all, the cookie can be encrypted using the PyCrypto library which is available on GAE.
Performance – since cookies are sent with every request (even images), it can add to the bandwidth being used, and slow down requests. One solution is to use another domain for static content, so the browser won't send the cookie for that content.
When should the different types of persistence be used? How can they be combined to reduce/even out the amount of resources being spent?

Datastore
Use the datastore to hold any long living information. The datastore should be used like you would use a normal database to hold data that will be used in your site/application.
MemCache
Use this to access data a lot quicker than trying to access the datastore. MemCache can return data really quickly and can be used for any data that needs to span multiple calls from users. It is normally data that was originally in the datastore and then moved to the memcache.
def get_data():
data = memcache.get("key")
if data is not None:
return data
else:
data = self.query_for_data() #get data from the datastore
memcache.add("key", data, 60)
return data
The memcache will flush itself when the item is out of date. You set this in the last param of the add shown above.
Global Variables
I wouldn't use these at all since they can't span instances. In GAE a request creates a new instance, well in python it does. If you want to use Global variables I would store the data needed in the memcache.

Your post is a good summary of the 3 major options. You mostly have answered the question already. However, if you are currently building an app and stressing over whether or not you should memcache something, try this:
Write your app using the datastore for everything that needs to outlive more than one request.
Once your app (or some usable subset) is working, run some functional tests or simulations to see where the slow spots (or high quota usage) are.
Find the most slow or inefficient request path, and figure out how to make that faster (either by using memcache, or altering your datastructures so you can do gets instead of queries, or possibly storing something in a global instance variable*)
goto 2 until you're satisfied.
*Things that might be good for a "global" variable would be something that is relatively expensive to create/fetch, that a substantial portion of your requests will use, and that does not need to be consistent across requests/users.

I use global variable to speed up json conversion. Before I convert my data structure to json, I hash it and check if the json if already available. For my app this gives quite a speedup as the pure python implementation is quite slow.

Global variables
To complement AutomatedTester's answer, and also reply his further question about how to share information between GETs without memcache or datastore, below a quick illustration of how to use global variables:
if 'i' not in globals():
i = 0
def main():
global i
i += 1
print 'Status: 200'
print 'Content-type: text/plain\n'
print i
if __name__ == '__main__':
main()
Calling this script multiple times will give you 1, 2, 3... Of course as mentioned earlier by Blixt you should not count on this trick too much ('i' can sometimes switch back to zero) but it can be useful to store user-specific information in a dictionary, session data for instance.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight