How to create large number of entities in Cloud Datastore - database

My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:
1. Read a csv file line by line and create entity in the datstore.
Issues: It works well but it timed out and cannot create all the entities in one go.
2. Uploaded all files in Blobstore and red them to datastore
Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.
I am looking for a way to create above(50k+) large number of entities in datastore all in one go.

Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.
It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.
Task Queues
I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.
When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.
Caveat on catching DeadlineExceededError
Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.
If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:
Switch to a manually scaled instance which gives you are 24 hour deadline.
Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.

Related

How to avoid Google App Engine push queue increased error rates when using named tasks?

While reading the documentation for Google App Engine push queues in the Java 8 standard environment, I came across the following information regarding named tasks:
Note that de-duplication logic introduces significant performance overhead, resulting in increased latencies and potentially increased error rates associated with named tasks.
I would like to utilize the de-duplication logic in a production environment, however, I am concerned about the potentially increased error rates. What is the cause of the increased error rates using named tasks and how can I effectively avoid these issues? Also, when naming the tasks I would use the random 32 character UID of a user as a prefix, therefore the names would not be sequential.
Data de-duplication is a technique for eliminating duplicate copies of repeating data. This can increases the performance needeed by Cloud Tasks in order to eliminate possible duplicate and only dispatch once your request.
If you accidentally add the same task to your list multiple times, the request will still only be dispatched once, avoiding de-duplication.
In conclusion, when you are using Cloud Tasks, you have a risk of having higher latencies, which can cause more errors as a result, like timeout errors for example.
To avoid this kind of errors please also bear in mind the de-duplication window when deleting tasks, as stated here:
The time period during which adding a task with the same name as a recently deleted task will cause the service to reject it with an error. This is the length of time that task de-duplication remains in effect after a task is deleted.

Google app engine API: Running large tasks

Good day,
I am running a back-end to an application as an app engine (Java).
Using endpoints, I receive requests. The problem is, there is something big I need to compute, but I need fast response times for the front end. So as a solution I want to precompute something, and store it a dedicated the memcache.
The way I did this, is by adding in a static block, and then running a deferred task on the default queue. Is there a better way to have something calculated on startup?
Now, this deferred task performs a large amount of datastore operations. Sometimes, they time out. So I created a system where it retries on a timeout until it succeeds. However, when I start up the app engine, it immediately creates two of the deferred task. It also keeps retrying the tasks when they fail, despite the fact that I set DeferredTaskContext.setDoNotRetry(true);.
Honestly, the deferred tasks feel very finicky.
I just want to run a method that takes >5 minutes (probably longer as the data set grows). I want to run this method on startup, and afterwards on a regular basis. How would you model this? My first thought was a cron job but they are limited in time. I would need a cron job that runs a deferred task, hope they don't pile up somehow or spawn duplicates or start retrying.
Thanks for the help and good day.
Dries
Your datastore operations should never time out. You need to fix this - most likely, by using cursors and setting the right batch size for your large queries.
You can perform initialization of objects on instance startup - check if an object is available, if not - do the calculations.
Remember to store the results of your calculations in the datastore (in addition to Memcache) as Memcache is volatile. This way you don't have to recalculate everything a few seconds after the first calculation was completed if a Memcache object was dropped for any reason.
Deferred tasks can be scheduled to perform after a specified delay. So instead of using a cron job, you can create a task to be executed after 1 hour (for example). This task, when it completes its own calculations, can create another task to be excited after an hour, and so on.

Golang on App Engine Datastore - Using PutMulti to Improve Performance

I have a GAE Golang app that should be able to handle hundreds of concurrent requests, and for each requests, I do some work on the input and then store it in the datastore.
Using the task queue (appengine/delay lib) I am getting pretty good performance, but it still seems very inefficient to perform single-row inserts for each request (even though the inserts are deferred using task queue).
If this was not app engine, I would probably append the output a file, and every once in a while I would batch load the file into the DB using a cron job / some other kind of scheduled service.
So my questions are:
Is there an equivalent scheme I can implement on app engine? I was
thinking - perhaps I should write some of the rows to memecache, and
then every couple of seconds I will bulk load all of the rows from
there and purge the cache.
Is this really needed? Can the datastore
handle thousands of concurrent writes - a write per http request my
app is getting?
Depends really on your setup. Are you using ancestor queries? If so then your are limited to 1 write per second PER ancestor (and all children, grand children). The datastore has a natural queue so if you try and write too quickly it will queue it. It only becomes an issue if you are writing too many way too quickly. You can read some best practices here.
If you think you will be going over that limit use a pull queues with async multi puts. You would put each entity in the queue. With a backed module (10 minute timeouts) you can pull in the entries in batches (10-50-100...) and do a put_async on them in batches. It will handle putting them in at the proper speed. While its working you can queue up the next batch. Just be wary of the timeout.

DeadlineExceededError in GAE + BQ despite several steps taken to avoid it

I have a BigQuery query that takes perhaps a minute several BigQuery queries that each take around 10-30 seconds to run that I have been trying to execute from Google App Engine. At one or more places in the call stack, an HTTP request is being killed with a DeadlineExceededError. Sometimes the DeadlineExceededError (unsure which kind) is raised as is, and sometimes it is translated to an HTTPException.
Following leads found in different SO posts, I have taken various steps to avoid the timeout:
Run the query in a task that is added to a GAE TaskQueue, setting the task_age_limit to 10m. (1)
Pass a timeoutMs flag to getQueryResults (called on a job object in Google's Python API) using a value of 599 * 1000 ~ 10 minutes. (2)
Just before the call to getQueryResults, call urlfetch.set_default_fetch_deadline(60), every time, in an attempt to ensure that the setting is local to the thread that is making the call. (3, 4, 5)
A gist of the relevant part of a typical stack trace can be found here. In a typical task execution, there will be a number of failures and then finally, perhaps, a success.
This answer seems to be saying that a urlfetch call will not be allowed to exceed 60 seconds on GAE, in any context (including a task). I doubt the queries are exceeding the hard limit in my case, so I'm probably missing an important step. Has anyone run into a similar situation and figured out what was going on?

Burst of processing power with TaskQueues?

I've got a situation where I want to make 1000 different queries to the datastore, do some calculations on the results of each individual query (to get 1000 separate results), and return the list of results.
I would like the list of results to be returned as the response from the same 30-second user request that started the calculation, for better client-side performance. Hah!
I have a bold plan.
Each of these operations individually will usually have no problem finishing in under a second, none of them need to write to the same entity group as any other, and none of them need any information from any of the other queries. Might it be possible to start 1000 independent tasks, each taking on one of these queries, doing its calculations, and storing the result in some sort of temporary collection of entities? The original request could wait 10 seconds, and then do a single query for the results from the datastore (maybe they all set a unique value I can query on). Any results that aren't in yet would be noticed at the client end, and the client could just ask for those values again in another ten seconds.
The questions I hope experienced appengineers can answer are:
Is this ludicrous? If so, is it ludicrous for any number of tasks? Would 50 at once be reasonable?
I won't run into datastore contention if I'm reading the same entity 20 times a second, right? That contention stuff is all for writing?
Is there an easier way to get a response from a task?
Yep, sounds pretty ludicrous :)
You shouldn't rely on the Taskqueue to operate like that. You can't rely on 1000 tasks being spawned that quickly (although they most likely will).
Why not use the Channel API to wait for your response. So your solution becomes:
Client send request to Server
Server spawns N tasks to do your calculations and responds to Client with a Channel API token
Client listens to the Channel using token
Once all the tasks are finished Server pushes response to Client via the Channel
This would avoid any timeout issues that would very likely arrise from time to time due to tasks not executing as fast as you like, or some other reason.
The Task Queue doesn't provide firm guarantees on when a task will execute - the ETA (which defaults to the current time) is the earliest time at which it will execute, but if the queue is backed up, or there are no instances available to execute the task, it could execute much later.
One option would be to use Datastore Plus / NDB, which allows you to execute queries in parallel. 1000 queries is going to be very expensive, however, no matter how you execute them.
Another option, as #Chris suggests, is to use the task queue with the Channel API, so you can notify the user asynchronously when the queries complete.

Resources