How to get stale data from the cache with Objectify? - google-app-engine

I have been trying to investigate what behaviours that may come from getting a DeadlineExceededException when using Objectify with caching turned on.
My experiments so far have been like this: 1) Store an entity or entities, 2) then sleep for most of the remaining execution time and 3) make some updates in an infite loop till the request is aborted. 4) Whether the cache is in sync with successful writes to the datastore is checked in a separate request.
"Some updates" means changing a lot (50) of strings in an object, then writing it back. I have also tried making updates to several objects in a transaction to test if I can get some inconsistent results when loading the entities again. So far, after thousands of tests, I have not got a single inconsistent entity from the cache.
So can I somehow provoke a load of a presumably cached entity to be inconsistent with the entity in the datastore?

There are a lot of possible reasons for this. If you're making changes in a single request, you're probably seeing the session cache in operation:
https://github.com/objectify/objectify/wiki/Caching
If you're making queries in many requests, you may be seeing the results of eventual consistency:
https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/
Perhaps you are seeing session cache contamination because you don't have the ObjectifyFilter installed? Recent versions of Objectify give you a nasty warning if you don't, but maybe you're running an old version?

Related

How can I combine similar tasks to reduce total workload?

I use App Engine, but the following problem could very well occur in any server application:
My application uses memcache to cache both large (~50 KB) and small (~0.5 KB) JSON documents which aggregate information which is expensive to refresh from the datastore. These JSON documents can change often, but the changes are sparse in the document (i.e., one item out of hundreds may change at a time). Currently, the application invalidates an entire document if something changes, and then will lazily re-create it later when it needs it. However, I want to move to a more efficient design which updates whatever particular value changed in the JSON document directly from the cache.
One particular concern is contention from multiple tasks / request handlers updating the same document, but I have ways to detect this issue and mitigate it. However, my main concern is that it's possible that there could be rapid changes to a set of documents within a small period of time coming from different request handlers, and I don't want to have to edit the JSON document in the cache separately for each one. For example, it's possible that 10 small changes affecting the same set of 20 documents of 50 KB each could be triggered in less than a minute.
So this is my problem: What would be an effective solution to combine these changes together? In my old solution, although it is expensive to re-create an entire document when a small item changes, the benefit at least is that it does it lazily when it needs it (which could be a while later). However, to update the JSON document with a small change seems to require that it be done immediately (not lazily). That is, unless I come up with a complex solution that lazily applies a set of changes to the document later on. I'm hoping for something efficient but not too complicated.
Thanks.
Pull queue. Everyone using GAE should watch this video:
http://www.youtube.com/watch?v=AM0ZPO7-lcE
When a call comes in, update memcache and do an async_add to your task pull queue. You likely could run a process that will handle thousands of updates each minute without a lot of overhead (i.e. instance issues). Still have an issue should memcache get purged prior to your updates, but that it not too hard to work around. HTH. -stevep

Write/Read with High Replication Datastore + NDB

So I have been reading a lot of documentation on HRD and NDB lately, yet I still have some doubts regarding how NDB caches things.
Example case:
Imagine a case where a users writes data and the app needs to fetch it immediately after the write. E.g. A user creates a "Group" (similar to a Facebook/Linkedin group) and is redirected to the group immediately after creating it. (For now, I'm creating a group without assigning it an ancestor)
Result:
When testing this sort of functionality locally (having enabled high replication), the immediate fetch of the newly created group fails. A NoneType is returned.
Question:
Having gone through the High Replication docs and Google IO videos, I understand that there is a higher write latency, however, shouldn't NDB caching take care of this? I.e. A write is cached, and then asynchronously actually written on disk, therefore, an immediate read would be reading from cache and thus there should be no problem. Do I need to enforce some other settings?
Pretty sure you are running into the HRD feature where queries are "eventually consistent". NDB's caching has nothing to do with this behavior.
I suspect it might be because of the redirect that the NoneType is returned.
https://developers.google.com/appengine/docs/python/ndb/cache#incontext
The in-context cache persists only for the duration of a single incoming HTTP request and is "visible" only to the code that handles that request. It's fast; this cache lives in memory. When an NDB function writes to the Datastore, it also writes to the in-context cache. When an NDB function reads an entity, it checks the in-context cache first. If the entity is found there, no Datastore interaction takes place.
Queries do not look up values in any cache. However, query results are written back to the in-context cache if the cache policy says so (but never to Memcache).
So you are writing the value to the cache, redirecting it and the read then fails because the HTTP request on the redirect is a different one and so the cache is different.
I'm reaching the limit of my knowledge here but I'd suggest initially that you try the create in a transaction and redirect when complete/success.
https://developers.google.com/appengine/docs/python/ndb/transactions
Also when you put the group model into the datastore you'll get a key back. Can you pass that key (via urlsafe for example) to the redirect and then you'll be guaranteed to retrieve the data as you have it's explicit key? Can't have it's key if it's not in the datastore after all.
Also I'd suggest trying it as is on the production server, sometimes behaviours can be very different locally and on production.

Hibernate HQL only hits session cache

I am having some trouble understanding where an HQL query gets the information from. My project is using different threads and each thread reads/writes to the database. Threads do not share Session objects, instead I am using a HibernateUtil class which creates sessions for me.
Until recently, I would only close a session after writing but not after reading. Changes to objects would be immediately seen in the database but when reading on other threads (different Session object than the one used for writing) I would get stale information. Reading and writing happened always on different threads which means different Session objects and different session caches.
I always thought that by using HQL instead of Criteria, I would always target the database (or second level cache) and not the session cache but while debugging my code, it was made clear to me that the HQL was looking for the object in the session cache and retrieved an old outdated object.
Was I wrong in assuming that HQL always targets the database? Or at least the second level cache?
PS: I am using only one SessionFactory object.
Hibernate has different concepts of caching - entity caches, and query caches. Entity caching is what the session cache (and the 2nd level cache, if enabled) does.
Assuming query caching is not enabled (which it's not, by default), then your HQL would have been executed against the database. This would have returned the IDs of the entities that match the query. If those entities were already in the session cache, then Hibernate would have returned those, rather than rebuilding them from the database. If your session has stale copies of them (because another session has updated the database), then that's the problem you have.
I would advise against using long-lived sessions, mainly for that reason. You should limit the lifespan of the session to the specific unit of work that you're trying to do, and then close it. There's little or no performance penalty to doing this (assuming you use a database connection pool). Alternatively, to make sure you don't get stale entities, you can call Session.clear(), but you may end up with unexpected performance side-effects.

writing then reading entity does not fetch entity from datastore

I am having the following problem. I am now using the low-level
google datastore API rather than JDO, that way I should be in a
better position to see exactly what is happening in my code. I am
writing an entity to the datastore and shortly thereafter reading it
from the datastore using Jetty and eclipse. Sometimes the written
entity is not being read. This would be a real problem if it were to
happen in production code. I am using the 2.0 RC2 API.
I have tried this several times, sometimes the entity is retrieved
from the datastore and sometimes it is not. I am doing a simple
query on the datastore just after committing a write transaction.
(If I run the code through the debugger things run slow enough
that the entity has a chance of being read back on the second pass).
Any help with this issue would be greatly appreciated,
Regards,
The development server has the same consistency guarantees as the High Replication datastore on the live server. A "global" query uses an index that is only guaranteed to be eventually consistent with writes. To perform a query with strongly consistent guarantees, the query must be limited to an entity group, using an "ancestor" key.
A typical technique is to group data specific to a single user in a group, so the user can see changes to queries limited to the user's group with strong consistency guarantees. Another technique is to use fancier client logic to update the client's local view as soon as the change is submitted, so the user sees the change in the UI immediately while the update to the global index is in progress.
See the docs on queries and transactions.

Tips for optimising Database and POST request performance

I am developing an application which involves multiple user interactivity in real time. It basically involves lots of AJAX POST/GET requests from each user to the server - which in turn translates to database reads and writes. The real time result returned from the server is used to update the client side front end.
I know optimisation is quite a tricky, specialised area, but what advice would you give me to get maximum speed of operation here - speed is of paramount importance, but currently some of these POST requests take 20-30 seconds to return.
One way I have thought about optimising it is to club POST requests and send them out to the server as a group 8-10, instead of firing individual requests. I am not currently using caching in the database side, and don't really have too much knowledge on what it is, and whether it will be beneficial in this case.
Also, do the AJAX POST and GET requests incur the same overhead in terms of speed?
Rather than continuously hitting the database, cache frequently used data items (with an expiry time based upon how infrequently the data changes).
Can you reduce your communication with the server by caching some data client side?
The purpose of GET is as its name
implies - to GET information. It is
intended to be used when you are
reading information to display on the
page. Browsers will cache the result
from a GET request and if the same GET
request is made again then they will
display the cached result rather than
rerunning the entire request. This is
not a flaw in the browser processing
but is deliberately designed to work
that way so as to make GET calls more
efficient when the calls are used for
their intended purpose. A GET call is
retrieving data to display in the page
and data is not expected to be changed
on the server by such a call and so
re-requesting the same data should be
expected to obtain the same result.
The POST method is intended to be used
where you are updating information on
the server. Such a call is expected to
make changes to the data stored on the
server and the results returned from
two identical POST calls may very well
be completely different from one
another since the initial values
before the second POST call will be
differentfrom the initial values
before the first call because the
first call will have updated at least
some of those values. A POST call will
therefore always obtain the response
from the server rather than keeping a
cached copy of the prior response.
Ref.
The optimization tricks you'd use are generally the same tricks you'd use for a normal website, just with a faster turn around time. Some things you can look into doing are:
Prefetch GET requests that have high odds of being loaded by the user
Use a caching layer in between as Mitch Wheat suggests. Depending on your technology platform, you can look into memcache, it's quite common and there are libraries for just about everything
Look at denormalizing data that is going to be queried at a very high frequency. Assuming that reads are more common than writes, you should get a decent performance boost if you move the workload to the write portion of the data access (as opposed to adding database load via joins)
Use delayed inserts to give priority to writes and let the database server optimize the batching
Make sure you have intelligent indexes on the table and figure out what benefit they're providing. If you're rebuilding the indexes very frequently due to a high write:read ratio, you may want to scale back the queries
Look at retrieving data in more general queries and filtering the data when it makes to the business layer of the application. MySQL (for instance) uses a very specific query cache that matches against a specific query. It might make sense to pull all results for a given set, even if you're only going to be displaying x%.
For writes, look at running asynchronous queries to the database if it's possible within your system. Data synchronization doesn't have to be instantaneous, it just needs to appear that way (most of the time)
Cache common pages on disk/memory in a fully formatted state so that the server doesn't have to do much processing of them
All in all, there are lots of things you can do (and they generally come down to general development practices on a more bite sized scale).
The common tuning tricks would be:
- use more indexing
- use less indexing
- use more or less caching on filesystem, database, application, or content
- provide more bandwidth or more cpu power or more memory on any of your components
- minimize the overhead in any kind of communication
Of course an alternative would be to:
0 develop a set of tests, preferable automatic that can determine, if your application works correct.
1 measure the 'speed' of your application.
2 determine how fast it has to become
3 identify the source of the performane problems:
typical problems are: network throughput, file i/o, latency, locking issues, insufficient memory, cpu
4 fix the problem
5 make sure it is actually faster
6 make sure it is still working correct (hence the tests above)
7 return to 1
Have you tried profiling your app?
Not sure what framework you're using (if any), but frankly from your questions I doubt you have the technical skill yet to just eyeball this and figure out where things are slowing down.
Bluntly put, you should not be messing around with complicated ways to try to solve your problem, because you don't really understand what the problem is. You're more likely to make it worse than better by doing so.
What I would recommend you do is time every step. Most likely you'll find that either
you've got one or two really long running bits or
you're running a shitton of queries because of an n+1 error or the like
When you find what's going wrong, fix it. If you don't know how, post again. ;-)

Resources