appengine toy program hitting quota limits for reads and writes - google-app-engine

Let's say I wanted to make an appengine application that stores a 50,000 word dictionary and also equivalent dictionaries in 10 other languages of similar size.
I got this working locally on my dev server, but when I went to load the first dictionary into the real appserver I immediately went over my writes per day quota. I had no idea how many dictionary entries made it into the datastore. So, 24 hours later, I went and tried to bulk download the dictionary and see how many entries I actually had, but doing that I hit the reads per day quota and got nothing back for my trouble. I tried enabling billing at setting a daily maximum of $1.00 and hit that quota through the bulkloader and didn't get any data for my trouble or my $1.00.
Anyway, so I looked at my datastore viewer and it shows that each of my dictionary words required 8 writes to the datastore.
So, does this mean that this sort of application is inappropriate for appengine? Should I not be trying to store a dictionary in there? Is there a smarter way to do this? For instance, could I store the dictionary in file form in the blob store somehow and then do things with it programatically from there?
Thank you for any suggestions

It's likely that you'll be reading much less then writing so the problem is getting the data in, not so much reading it.
So all you need to do to use your current configuration is to slow down the write rate. Then, presumably you'll be getting each word by it's ID (the word itself I hope!) and so reads will be fast and small, exactly as you want.
You could do this: Chop your source data into 1 file per letter . Upload those files with your application and create a task to read each file in turn then slowly write that data to the datastore. Once that task completes, it's last action is to call the next task.
It might take a week to complete but once it's there it'll be alot more handy then an opaque blob you have to get from the blob store, reading N words for every single 1 you are actually interested in, and then unpack and process for every word.
You can also use the bulk downloader to upload data as well!

Related

Google appengine, least expensive way to run heavy datastore write cron job?

I have a Google appengine application, written in Go, that has a cron process which runs once a day at 3am. This process looks at all of the changes that have happened to my data during the day and stores some meta data about what happened. My users can run reports on this meta data to see trends that have happened over several months. The process does around 10-20 million datastore writes every night. It all works just fine, but since I have started running it I have noticed a significant increase in my monthly bill from Google (from around $50/month to around $400/month).
I have just setup a very basic taskqueue that this runs in, I have not changed the default settings at all. Is there a better way that I could be running this process at night that could save me money? I have never messed around with the backends (which are now depreciated) or the modules api, and I know they've changed a lot of this stuff recently so I'm not sure where to start looking. Any advice would be greatly appreciated.
Look at your instances at 3am. It might be that GAE spins up a lot of them to handle the job. You could configure your job to make it run less paralel so it will take longer but perhaps it will need only 1 instance then.
However, if your database writes are indeed the biggest factor this won't make a big impact.
You can try looking at your data models and indexes. Remember that each indexed field costs 2 writes extra, so see if you can remove indexes from some fields if you don't need them.
One improvement that you can do is to batch your write operations, you can use memcache for this (pay the dedicated one since it's more reliable). Write the updates to memcache, once it's about 900K, flush it to datastore. This will reduces the number of write to datastore A LOT, especially if your metadata's size is small.

Google app engine - Sudden Increase of Datastore Read Operations

I'm maintaining a blog app(blog.wokanxing.info, it's in Chinese) for myself which was built upon Google app engine. It's been like two or three years since first deployment and I've never met any quota issue because its simpicity and small visit count.
However since early last month, I noticed that from time to time the app reported 500 server error, and in admin panel it shows a mysterious fast consumption of free datatstore read operation quota. Within a single hour about 10% of free read quota (~5k ops) are consumed, but I'm counting only a dozen requests that involve datastore read ops, 30 tops, which means an average 150 to 200 read op per request, which sounds impossible to me.
I've not commited any change to my codebase for months, and I'm not seeing any change in datastore or quote policy either. Despite that, it also confuses me how such consumption can be made. I use memcache a lot, which leaves first page the biggest player, which fetch the first threads using Post.all.order('-date').fetch(10, offset). Other request merely fetch a single model using Post.get_by_key_name and iterates post.comment_set.
Sorry for my poor English, but can anyone give me some clues? Thanks.
From Admin console check your log.
Do not check for errors only, rather check all types of messages inside the log.
Look for the requests made by robots/web crawlers. In most cases, you can detect such "users" by words "robot" or "bot" (well, if they are honest...).
The first thing you can do is to edit your "robot" file. For more detail read How to identify web-crawler? . Also, GAE has help for use of "robot" file.
If that fails, try to detect IP address used by bot/bots. Using GAE Admin console put such addresses in blacklist and check your quota consumption again.

Avoiding exceeding soft private memory limits when url-fetching large documents

I need to run regularly scheduled tasks that fetch relatively large xml documents (5Mb) and process them.
A problem I am currently experiencing is that I hit the memory limits of the application instance, and the instance is terminated while my task is running.
I did some rough measurements:
The task is usually scheduled into an instance that already uses 40-50 Mb of memory
Url fetching of 5 mb text file increases instance memory usage to 65-75 Mb.
Decoding fetched text into Unicode increases memory usage to 95-105 Mb.
Passing unicode string to lxml parser and accessing its root node increases instance memory usage to about 120-150 Mb.
During actual processing of the document (converting xml nodes to datastore models, etc.) the instance is terminated.
I could avoid 3rd step and save some memory by passing encoded text directly to lxml parser, but specifying encoding for lxml parser has some problems on GAE for me.
I can probably use MapReduce library for this job, but is it really worthwhile for a 5mb file?
Another option could be to split the task into several tasks.
Also I could probably save the file to blobstore, and then process it by reading it line by line from blobstore? As a side note it would be convenient if UrlFetch service allowed to read the response "on demand" to simplify processing of large documents.
So generally speaking what is the most convenient way to perform such kind of work?
Thank you!
Is this on frontend or a backend instance? Looks like a job for a backend instance to me.
Have you considered using different instance types?
frontend types
backend types

Passing (large binary) data to appengine task

I have a task endpoint that needs to process data (say >1MB file) uploaded from a frontend request. However, I do not think I can pass the data from the frontend request via TaskOptions.Builder as I will get the "Task size too large" error.
I need some kind of "temporary" data store for the uploaded data, that can be deleted once the task has successfully processed it.
Option A: Store uploaded data in memcache, pass the key to the task. This is likely going to work most of the time, except when the data is evicted BEFORE the task is processed. If this can be resolved, sounds like a great solution.
Option B: Store the data in datastore (an Entity created just for this purpose). Pass the id to the task. The task is responsible for deleting the entity when it is done.
Option C: Use the Blobstore service. This, IMHO, is similar in concept to Option B.
At the moment, i'm thinking option B is the most feasible way.
Appreciate any advise on the best way to handle these situations.
If you are storing data larger than 1mb, you must use the blobstore. (Yes, you can segment the data in the datastore, but it's not worth the work.) There are two things to look out for, however. Make sure that you write the data to the blobstore in chunks less than 1mb. Also, since the task queue is idempotent, your tasks should not fail if the requested blobstore key does not exist, since a previous task may have deleted it already.
Option B doesn't work too, because maximum entity size is 1mb too. Same limit as for task.
Option A can't give any guarantee, values can expire from the memcache at any time, and may be expired prior to the expiration deadline set for the value.
Imho, Option C is best for your needs

What's the best way to send pictures to a browser when they have to be stored as blobs in a database?

I have an existing database containing some pictures in blob fields. For a web application I have to display them.
What's the best way to do that, considering stress on the server and maintenance and coding efforts.
I can think of the following:
"Cache" the blobs to external files and send the files to the browser.
Read them from directly the database every time it's requested.
Some additionals facts:
I cannot change the database and get rid of the blobs alltogether and only save file references in the database (Like in the good ol' Access days), because the database is used by another application which actually requires the blobs.
The images change rarely, i.e. if an image is in the database it mostly stays that way forever.
There'll be many read accesses to the pictures, 10-100 pictures per view will be displayed. (Depending on the user's settings)
The pictures are relativly small, < 500 KB
I would suggest a combination of the two ideas of yours: The first time the item is requested read it from the database. But afterwards make sure they are cached by something like Squid so you don't have to retrieve the every time they are requested.
one improtant thing is to use proper HTTP cache control ... that is, setting expiration dates properly, responding to HEAD requests correctly (not all plattforms/webservers allow that) ...
caching thos blobs to the file system makes sense to me ... even more so, if the DB is running on another server ... but even if not, i think a simple file access is a lot cheaper, than piping the data through a local socket ... if you did cache the DB to the file system, you could most probably configure any webserver to do a good cache control for you ... if it's not the case already, you should maybe request, that there is a field to indicate the last update of the image, to make your cache more efficient ...
greetz
back2dos

Resources