Synchronicity and the Datastore in Google App Engine - google-app-engine

I seem to be having a consistency problem with some of my data; i'm writing a unit test to see if a certain model has been placed in the datastore. It fails in the unit test unless I put a 5 second sleep before the return of the storing function.
I've been reading about asynchronous functions in gae, thinking that perhaps I need something along the lines of a promise so that the function will wait before returning until the data has been placed into the datastore. However, all the documentation on asynchronous versions of functions in GAE seem to imply that its non async functions already sort of act like promises in that way.
What does it mean for a function like put() to return? It seems to not mean that the data has been appropriately stored. Is there a way to wait until the data has been stored?
EDIT: My problem wasn't simply dealing with consistency, but that I was unsure of whether the problem was a consistency issue at all, and wanted instead to ask specifically about how the return of call to put() related to what was happening under the hood of GAE.
I think this question is similar to that listed, but is still useful to remain up because it approaches the consistency issue from a different perspective. If other people need to find this information, but aren't entirely sure of the phrasing as I was, or follow a similar train of thought as me, they may be able to reach the information through this question. It's also written more explicitly, with less domain specific terminology.
That being said, I do see the issue in terms of end-goal informational content; I would understand if it's taken down.

https://cloud.google.com/appengine/docs/java/datastore/#Java_Datastore_writes_and_data_visibility
Data writes happen in two stages, commit and apply. Commit records the transactions to a majority of replicas, and apply does two things in parallel: 1) writes the data and 2) writes the indexes.
Your unit test query may be executing on a replica that has a stale version of the data. The write operation returns immediately after the commit phase but the apply phase happens asynchronously. Ancestor queries are guaranteed to be up-to-date, however, so try testing by getting on the object key.

Related

Prevent redundant CRUD operations in multi-container pod

If I have multiple identical containers deployed simultaneously, and each contains a job to periodically create an artifact and save to a database, and what they save is deterministic, how should I go about preventing redundant operations?
Should I check the key in the database to see if it exists first, and if it doesn't, begin the saving operation? The artifact creation process is lengthy, so it's quite likely that one container may check the DB, see that it hasn't been saved to yet, and start the artifact creation process ... in the meantime, the other container may do the same.
I realize that having multiple clones of the same container is good for preventing downtime / keeping the application robust, but how should you deal with side effects?
This is a pretty open-ended question, so there isn't going to be one definitive answer without knowing the exact specifics of your situation.
Generally speaking in situations like this you should try to make the action that is being performed idempotent if possible, thus removing the issues if multiple requests are sent to perform the same action.
The question I would be asking myself is whether or not your architecture and technology stack is sutiable for this task. Not every activity needs to be performed in Kubernetes.
Would a Kubernetes CronJob be more sutiable for this?
What about a using messaging queue?

How to get stale data from the cache with Objectify?

I have been trying to investigate what behaviours that may come from getting a DeadlineExceededException when using Objectify with caching turned on.
My experiments so far have been like this: 1) Store an entity or entities, 2) then sleep for most of the remaining execution time and 3) make some updates in an infite loop till the request is aborted. 4) Whether the cache is in sync with successful writes to the datastore is checked in a separate request.
"Some updates" means changing a lot (50) of strings in an object, then writing it back. I have also tried making updates to several objects in a transaction to test if I can get some inconsistent results when loading the entities again. So far, after thousands of tests, I have not got a single inconsistent entity from the cache.
So can I somehow provoke a load of a presumably cached entity to be inconsistent with the entity in the datastore?
There are a lot of possible reasons for this. If you're making changes in a single request, you're probably seeing the session cache in operation:
https://github.com/objectify/objectify/wiki/Caching
If you're making queries in many requests, you may be seeing the results of eventual consistency:
https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/
Perhaps you are seeing session cache contamination because you don't have the ObjectifyFilter installed? Recent versions of Objectify give you a nasty warning if you don't, but maybe you're running an old version?

Why do major DB vendors not provide truly asynchronous APIs?

I work with Oracle and Mysql, and I struggle to understand why the APIs are not written such that I can issue a call, go away and do something else, and then come back and pick it up later eg NIO - I am forced to dedicate a thread to waiting for data. It seems that the SQL interfaces are the only place where sync IO is still forced, which means tying up a thread waiting for the DB.
Can anybody explain the reasons for this? Is there something fundamental that makes this difficult?
It would be great to be able to use 1-2 threads to manage my DB query issue and result fetch, rather than use worker threads to retrieve data.
I do note that there are two experimental attempts (eg: adbcj) at implementing an async API but none seem to be ready for Production use.
Database servers should be able to handle thousands of clients. To provide an asyncronous interface, the DB server will need to keep the resultset from the query in memory, so you can pick it up at later stage. It will quickly become out of resources.
A considerable problem with async is many many libraries use threadlocal for transactions.
For example in Java Much of the JDBC specification relies on a synchronous behavior to achieve single thread per-transaction. That is you write your transaction in procedural order.
To do it right transactions would have to be done through callback but they are not. I know of only node.js that does this but its unclear if its really async.
Of course even if you do async I'm not sure if it will really improve performance as the database itself if is probably doing it synchronous.
There are lots of ways to avoid thread over-population in (Java):
Is asynchronous jdbc call possible?
Personally to get around this issue I use a Message Bus like RabbitMQ.

Sharing transactional space between two connections

There's an app that starts a transaction on SQL Server 2008 and moves some data around. Then, while the transaction is still not committed, the app prints out some labels. It is very important that the transaction is not committed until printing succeeded; if a printing error occurs, everything is rolled back.
Now, the printing engine is a) grew quite huge and complex, and b) is eventually required from lots of places. It is therefore decided to separate the engine and make it a service.
Yes, it is possible to pass all data required for printing from the client app to that server so that the server only prints and is not concerned about databases. However, that would mean leaving piles of code and label templates in each application that requires printing; effectively, very little separation will then occur. On contrary, it would be extreemely efficient (and easier for me to write and maintain) to just pass the IDs of what is required to the service which then would go to the database and get the data. All formats and layouts will be centralized and apps will only ask for 5 delivery notes from print job 12345.
Now, this is not going to happen as the transaction is still not committed at the moment of printing. The service would not be able to read the data, and using READ UNCOMMITTED is not quite an option.
I was going to use the good old sp_bindsession to join the two sessions, app's and service's, but then it is suddenly deprecated and to be removed from future releases. The help suggests I use MARS or distributed transactions instead, but I can't see how they would help.
Any advice?
My gut feeling is that attempting to share a transaction between two processes in this way is not a good idea.
My approach would be to either to pass all data to service, or investigate alternatives to keeping the transaction open for the duration of the printing - would a simpler mechanism (such as an IsPrinted flag for each record) not suffice?
Failing that, the eaisest way I can see of doing this would be to have the printing service pass all of its SQL requests back through to the originating process so that they can be executed in the context of the original transaction.
Only sp_getbindtoken/sp_bindsession can do what you ask, and it is deprecated and will be removed.
In theory you should use short transactions, represent the 'printing' state as a committed state, and have compensating actions if the print fails. Also if the printing engine is exposed as a service, it should be autonomous and receive as a message all data it needs to print (like label templates). I understand this is easy for me to to say but may be a major undertaking on the product.
For the moment I think your best bet is to use the session binding tokens. Altough I have to call out that leaving transactions open for the duration of physical operations (printing) is a very bad practice.

Tips for optimising Database and POST request performance

I am developing an application which involves multiple user interactivity in real time. It basically involves lots of AJAX POST/GET requests from each user to the server - which in turn translates to database reads and writes. The real time result returned from the server is used to update the client side front end.
I know optimisation is quite a tricky, specialised area, but what advice would you give me to get maximum speed of operation here - speed is of paramount importance, but currently some of these POST requests take 20-30 seconds to return.
One way I have thought about optimising it is to club POST requests and send them out to the server as a group 8-10, instead of firing individual requests. I am not currently using caching in the database side, and don't really have too much knowledge on what it is, and whether it will be beneficial in this case.
Also, do the AJAX POST and GET requests incur the same overhead in terms of speed?
Rather than continuously hitting the database, cache frequently used data items (with an expiry time based upon how infrequently the data changes).
Can you reduce your communication with the server by caching some data client side?
The purpose of GET is as its name
implies - to GET information. It is
intended to be used when you are
reading information to display on the
page. Browsers will cache the result
from a GET request and if the same GET
request is made again then they will
display the cached result rather than
rerunning the entire request. This is
not a flaw in the browser processing
but is deliberately designed to work
that way so as to make GET calls more
efficient when the calls are used for
their intended purpose. A GET call is
retrieving data to display in the page
and data is not expected to be changed
on the server by such a call and so
re-requesting the same data should be
expected to obtain the same result.
The POST method is intended to be used
where you are updating information on
the server. Such a call is expected to
make changes to the data stored on the
server and the results returned from
two identical POST calls may very well
be completely different from one
another since the initial values
before the second POST call will be
differentfrom the initial values
before the first call because the
first call will have updated at least
some of those values. A POST call will
therefore always obtain the response
from the server rather than keeping a
cached copy of the prior response.
Ref.
The optimization tricks you'd use are generally the same tricks you'd use for a normal website, just with a faster turn around time. Some things you can look into doing are:
Prefetch GET requests that have high odds of being loaded by the user
Use a caching layer in between as Mitch Wheat suggests. Depending on your technology platform, you can look into memcache, it's quite common and there are libraries for just about everything
Look at denormalizing data that is going to be queried at a very high frequency. Assuming that reads are more common than writes, you should get a decent performance boost if you move the workload to the write portion of the data access (as opposed to adding database load via joins)
Use delayed inserts to give priority to writes and let the database server optimize the batching
Make sure you have intelligent indexes on the table and figure out what benefit they're providing. If you're rebuilding the indexes very frequently due to a high write:read ratio, you may want to scale back the queries
Look at retrieving data in more general queries and filtering the data when it makes to the business layer of the application. MySQL (for instance) uses a very specific query cache that matches against a specific query. It might make sense to pull all results for a given set, even if you're only going to be displaying x%.
For writes, look at running asynchronous queries to the database if it's possible within your system. Data synchronization doesn't have to be instantaneous, it just needs to appear that way (most of the time)
Cache common pages on disk/memory in a fully formatted state so that the server doesn't have to do much processing of them
All in all, there are lots of things you can do (and they generally come down to general development practices on a more bite sized scale).
The common tuning tricks would be:
- use more indexing
- use less indexing
- use more or less caching on filesystem, database, application, or content
- provide more bandwidth or more cpu power or more memory on any of your components
- minimize the overhead in any kind of communication
Of course an alternative would be to:
0 develop a set of tests, preferable automatic that can determine, if your application works correct.
1 measure the 'speed' of your application.
2 determine how fast it has to become
3 identify the source of the performane problems:
typical problems are: network throughput, file i/o, latency, locking issues, insufficient memory, cpu
4 fix the problem
5 make sure it is actually faster
6 make sure it is still working correct (hence the tests above)
7 return to 1
Have you tried profiling your app?
Not sure what framework you're using (if any), but frankly from your questions I doubt you have the technical skill yet to just eyeball this and figure out where things are slowing down.
Bluntly put, you should not be messing around with complicated ways to try to solve your problem, because you don't really understand what the problem is. You're more likely to make it worse than better by doing so.
What I would recommend you do is time every step. Most likely you'll find that either
you've got one or two really long running bits or
you're running a shitton of queries because of an n+1 error or the like
When you find what's going wrong, fix it. If you don't know how, post again. ;-)

Resources