I have a GAE Golang app that should be able to handle hundreds of concurrent requests, and for each requests, I do some work on the input and then store it in the datastore.
Using the task queue (appengine/delay lib) I am getting pretty good performance, but it still seems very inefficient to perform single-row inserts for each request (even though the inserts are deferred using task queue).
If this was not app engine, I would probably append the output a file, and every once in a while I would batch load the file into the DB using a cron job / some other kind of scheduled service.
So my questions are:
Is there an equivalent scheme I can implement on app engine? I was
thinking - perhaps I should write some of the rows to memecache, and
then every couple of seconds I will bulk load all of the rows from
there and purge the cache.
Is this really needed? Can the datastore
handle thousands of concurrent writes - a write per http request my
app is getting?
Depends really on your setup. Are you using ancestor queries? If so then your are limited to 1 write per second PER ancestor (and all children, grand children). The datastore has a natural queue so if you try and write too quickly it will queue it. It only becomes an issue if you are writing too many way too quickly. You can read some best practices here.
If you think you will be going over that limit use a pull queues with async multi puts. You would put each entity in the queue. With a backed module (10 minute timeouts) you can pull in the entries in batches (10-50-100...) and do a put_async on them in batches. It will handle putting them in at the proper speed. While its working you can queue up the next batch. Just be wary of the timeout.
Related
Good day,
I am running a back-end to an application as an app engine (Java).
Using endpoints, I receive requests. The problem is, there is something big I need to compute, but I need fast response times for the front end. So as a solution I want to precompute something, and store it a dedicated the memcache.
The way I did this, is by adding in a static block, and then running a deferred task on the default queue. Is there a better way to have something calculated on startup?
Now, this deferred task performs a large amount of datastore operations. Sometimes, they time out. So I created a system where it retries on a timeout until it succeeds. However, when I start up the app engine, it immediately creates two of the deferred task. It also keeps retrying the tasks when they fail, despite the fact that I set DeferredTaskContext.setDoNotRetry(true);.
Honestly, the deferred tasks feel very finicky.
I just want to run a method that takes >5 minutes (probably longer as the data set grows). I want to run this method on startup, and afterwards on a regular basis. How would you model this? My first thought was a cron job but they are limited in time. I would need a cron job that runs a deferred task, hope they don't pile up somehow or spawn duplicates or start retrying.
Thanks for the help and good day.
Dries
Your datastore operations should never time out. You need to fix this - most likely, by using cursors and setting the right batch size for your large queries.
You can perform initialization of objects on instance startup - check if an object is available, if not - do the calculations.
Remember to store the results of your calculations in the datastore (in addition to Memcache) as Memcache is volatile. This way you don't have to recalculate everything a few seconds after the first calculation was completed if a Memcache object was dropped for any reason.
Deferred tasks can be scheduled to perform after a specified delay. So instead of using a cron job, you can create a task to be executed after 1 hour (for example). This task, when it completes its own calculations, can create another task to be excited after an hour, and so on.
I am using Objectify in my google cloud endpoints module , My endpoint project handles most of my datastore read and write ops , but i wanted to know if it is an efficient design practice to use Task queues to wrap a read or write operation on the datastore in google app engine .
All the data necessary for a task execution has to be written somewhere, and the App Engine persists this data in a task queue backed by the same Datastore. Unless your write operation involves number crunching, URL fetching, external API calls, updates of hundreds on entities, or some other expensive logic, there is no advantage to wrapping a write call in a task.
Wrapping read calls in tasks is impossible in most cases as you lose an ability to return this data in the same call.
Consider to use write-behind-cache if you want to speed up your writes. There's a little chance that you will lose your data, but you will dramatically speed up the write speed (as seen by user).
The idea is to write entity only into memcache first, so user will not wait for actual datastore write, and then pick up that memcached entity by task queue/cron and write it into datastore.
is there a way to determine when a set of Google App Engine tasks (and child tasks they spawn) have all completed?
Let's say that I have 100 tasks to execute and 10 of those spawn 10 child tasks each. That's 200 tasks. Let's also say that those child tasks might spawn more tasks, recursively, etc...
Is there a way to determine when all tasks have completed? I tried using the app engine pipeline API, but it doesn't look like it's going to work out for my particular use case, even though it is a great API.
My use case is that I want to make a whole bunch of rate limited URL fetch calls while concurrently writing to a blob. At the end of all the URL fetch calls, I want to finalize the blob.
I found the solution with the pipeline API, but it does so much writing to the datastore that it wouldn't be cost effective for me with how often I need to run the pipeline.
There's no way around writing to a persistent storage medium of some sort, and the datastore is the only game in town. You could write your own server to track completions using a backend, but that's an awful lot of overhead for a simple task. Using the pipeline API is your best bet.
We have an affiliate system which counts millions of banner Impressions/Clicks per day.
Currently it writes to SQL every Impression/Click that occurs in real time on each request.
Web application serves these requests.
We are facing two problems:
If we have a lot of concurrent requests per second, the SQL is
starting to work very hard to insert the Impressons/Clicks data and
as a result lead to problem #2.
If SQL is slow at the moment, the requests are being accumulated and
are waiting in the queue on web server. As a result we have a
slowness on a web application and requests are not being processed.
Design we thought of in high level:
We are now considering changing the design by taking out the writing to SQL logic out of web application (write it to some local storage instead) and making a stand alone service which will read from local storage and eventually write the aggregated Impressions/Clicks data (not in real time) to SQL in background.
Our constraints:
10 web servers (load balanced)
1 SQL server
What do you think of suggested design?
Would you use NoSQL as local storage for each web server?
Suggest your alternative.
Your problem seems to be that your front-end code is synchronusly blocking while waiting for the back-end code to update the database.
Decouple front-end and back-end, e.g. by putting a queue inbetween where the front-end can write to the queue with low latency and high throughput. The back-end then can take its time to process the queued data into their destinations.
It may or may not be necessary to make the queue restartable (i.e. not losing data after a crash). Depending on this, you have various options:
In-memory queue, speedy but not crash-proof.
Database queue, makes sense if writing the raw request data to a simple data structure is faster than writing the final data into its target data structures.
Renundant queues, to cover for crashes.
I'm with Bernd, but I'm not sure about using a queue specifically.
All you need is something asynchronous that you can call; that way the act of logging the impression is pretty much redundant.
I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.