So I have been reading a lot of documentation on HRD and NDB lately, yet I still have some doubts regarding how NDB caches things.
Example case:
Imagine a case where a users writes data and the app needs to fetch it immediately after the write. E.g. A user creates a "Group" (similar to a Facebook/Linkedin group) and is redirected to the group immediately after creating it. (For now, I'm creating a group without assigning it an ancestor)
Result:
When testing this sort of functionality locally (having enabled high replication), the immediate fetch of the newly created group fails. A NoneType is returned.
Question:
Having gone through the High Replication docs and Google IO videos, I understand that there is a higher write latency, however, shouldn't NDB caching take care of this? I.e. A write is cached, and then asynchronously actually written on disk, therefore, an immediate read would be reading from cache and thus there should be no problem. Do I need to enforce some other settings?
Pretty sure you are running into the HRD feature where queries are "eventually consistent". NDB's caching has nothing to do with this behavior.
I suspect it might be because of the redirect that the NoneType is returned.
https://developers.google.com/appengine/docs/python/ndb/cache#incontext
The in-context cache persists only for the duration of a single incoming HTTP request and is "visible" only to the code that handles that request. It's fast; this cache lives in memory. When an NDB function writes to the Datastore, it also writes to the in-context cache. When an NDB function reads an entity, it checks the in-context cache first. If the entity is found there, no Datastore interaction takes place.
Queries do not look up values in any cache. However, query results are written back to the in-context cache if the cache policy says so (but never to Memcache).
So you are writing the value to the cache, redirecting it and the read then fails because the HTTP request on the redirect is a different one and so the cache is different.
I'm reaching the limit of my knowledge here but I'd suggest initially that you try the create in a transaction and redirect when complete/success.
https://developers.google.com/appengine/docs/python/ndb/transactions
Also when you put the group model into the datastore you'll get a key back. Can you pass that key (via urlsafe for example) to the redirect and then you'll be guaranteed to retrieve the data as you have it's explicit key? Can't have it's key if it's not in the datastore after all.
Also I'd suggest trying it as is on the production server, sometimes behaviours can be very different locally and on production.
Related
I have been trying to investigate what behaviours that may come from getting a DeadlineExceededException when using Objectify with caching turned on.
My experiments so far have been like this: 1) Store an entity or entities, 2) then sleep for most of the remaining execution time and 3) make some updates in an infite loop till the request is aborted. 4) Whether the cache is in sync with successful writes to the datastore is checked in a separate request.
"Some updates" means changing a lot (50) of strings in an object, then writing it back. I have also tried making updates to several objects in a transaction to test if I can get some inconsistent results when loading the entities again. So far, after thousands of tests, I have not got a single inconsistent entity from the cache.
So can I somehow provoke a load of a presumably cached entity to be inconsistent with the entity in the datastore?
There are a lot of possible reasons for this. If you're making changes in a single request, you're probably seeing the session cache in operation:
https://github.com/objectify/objectify/wiki/Caching
If you're making queries in many requests, you may be seeing the results of eventual consistency:
https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/
Perhaps you are seeing session cache contamination because you don't have the ObjectifyFilter installed? Recent versions of Objectify give you a nasty warning if you don't, but maybe you're running an old version?
Is it guarantied that if I put entity inside ndb transaction and then read it (in the same ndb txn), I'll get recently written entity?
Here https://cloud.google.com/appengine/docs/java/datastore/transactions#Java_Isolation_and_consistency is mentioned:
Unlike with most databases, queries and gets inside a Datastore
transaction do not see the results of previous writes inside that
transaction. Specifically, if an entity is modified or deleted within
a transaction, a query or get returns the original version of the
entity as of the beginning of the transaction, or nothing if the
entity did not exist then.
So bare Google Datastore (without ndb) wouldn't return recently written entity.
But ndb caches entities (https://cloud.google.com/appengine/docs/python/ndb/transactions):
Transaction behavior and NDB's caching behavior can combine to confuse
you if you don't know what's going on. If you modify an entity inside
a transaction but have not yet committed the transaction, then NDB's
context cache has the modified value but the underlying datastore
still has the unmodified value.
To be a bit more clear I wrote a piece of code which works fine:
#ndb.transactional()
def test():
new_entity = MyModel()
key = new_entity.put()
assert key.get() is not None
test()
And my question is how is reliable ndb caching inside ndb transaction?
Your test will fail if you turn off caching:
class MyModel(ndb.Model):
_use_cache = False
But it's work with context caching enabled.
Just like it's explained in the docs.
It does not makes any sense to write and get the same entity inside of transaction anyway.
NDB Caching article (https://cloud.google.com/appengine/docs/python/ndb/cache#incontext) says:
The in-context cache persists only for the duration of a single
thread. This means that each incoming HTTP request is given a new
in-context cache and and is "visible" only to the code that handles
that request. If your application spawns any additional threads while
handling a request, those threads will also have a new, separate
in-context cache.
The in-context cache is fast; this cache lives in memory. When an NDB
function writes to the Datastore, it also writes to the in-context
cache. When an NDB function reads an entity, it checks the in-context
cache first. If the entity is found there, no Datastore interaction
takes place.
So it is guaranteed that in-context cache will be alive during ndb transaction (if you don't turn off caching deliberately).
I am writing a small appengine application and I want to start using Datastore.
My app has some users and each user is a complicated JAVA class.
Users might swap some "point" objects between them, so I need the data to be available and fast.
My question is quite generic:
How should I handle the data caching?
Storing the data entirely in the Datastore and fetching it in every call sounds slow in runtime.
On the other hand, holding the data in a static JAVA class sounds tricky, because every now and then the server resets and data is erased.
If I had a main loop, like in a regular console application I would've probably saved the data twice-three times a day on predefined hours of the day.
How should I manage my code in such a way that it would save the status in the datastore every now and then and this way will not loose any data.
if the number of write operations are less than read operations,
it is good to try "Write-through caching" with memcache module.
https://developers.google.com/appengine/docs/java/memcache/
basically the write-through caching works like this:
for every write operations, program update the datastore, then backup the data to memcache
for every read operations, program will try to read from memcache. if it can found required data, then it returns, otherwise it will fetch the data from datastore and backup the results to memcache.
As #lucemia noted, you should use memcache.
If you use objectify, you can set-up caching declaratively, without writing and cache handling code.
I am having the following problem. I am now using the low-level
google datastore API rather than JDO, that way I should be in a
better position to see exactly what is happening in my code. I am
writing an entity to the datastore and shortly thereafter reading it
from the datastore using Jetty and eclipse. Sometimes the written
entity is not being read. This would be a real problem if it were to
happen in production code. I am using the 2.0 RC2 API.
I have tried this several times, sometimes the entity is retrieved
from the datastore and sometimes it is not. I am doing a simple
query on the datastore just after committing a write transaction.
(If I run the code through the debugger things run slow enough
that the entity has a chance of being read back on the second pass).
Any help with this issue would be greatly appreciated,
Regards,
The development server has the same consistency guarantees as the High Replication datastore on the live server. A "global" query uses an index that is only guaranteed to be eventually consistent with writes. To perform a query with strongly consistent guarantees, the query must be limited to an entity group, using an "ancestor" key.
A typical technique is to group data specific to a single user in a group, so the user can see changes to queries limited to the user's group with strong consistency guarantees. Another technique is to use fancier client logic to update the client's local view as soon as the change is submitted, so the user sees the change in the UI immediately while the update to the global index is in progress.
See the docs on queries and transactions.
First of all I'll explain the question. By persistence, I mean storing data beyond the execution of a single request. It might not be the best question title, so feel free to edit it.
The way I see it, there are three types of persistence in GAE, each one "closer" to the request itself:
The datastore
This is where all data is most likely to be based. It may go into the higher layers of persistence temporarily, but in the end, this is where the data really is. Unfortunately, querying the datastore repeatedly is slow and uses a lot of resources.
Use when...
storing data that should be stored for an indefinite amount of time.
Avoid using when...
getting data that is queried often but rarely updated.
memcache
This is a highly complex caching engine that stores the data in memory and makes sure all users read from/write to the same cache. It's a much faster way to get/set data on a key→value basis than using the datastore. Unfortunately, data can only stay in the memory for so long, and there is no guarantee that it will stay for as long as you tell it to; the data may disappear at any time if memory is needed elsewhere.
Use when...
you need to get data more often than you need to update it. Even when data needs to be updated often, it can have its uses (if a few missed updates are considered okay), by setting up a task queue to persist data from the memcache to the datastore.
Avoid using when...
data needs to be updated often and has to be up-to-date when fetched.
Global variables
This isn't an official method of persisting data, but it works. However, it's the least reliable method, and since it has no data synchronization across servers, persisted data may show up differently for different users (but from what I've found, the server rarely changes for the same user.) Theoretically, this should be the method that has the least overhead in getting/setting values, however, and could have its uses.
Use when...
hell freezes over? I don't know... I haven't enough knowledge about what goes on behind the scenes to actually rely on this method. Discuss!
Avoid using when...
you rely on the data being the same across servers.
Cookies
If the data is user-specific, it can be efficient to store it as a cookie in the user's browser. There are some pitfalls to watch out for though:
Security – the user can meddle with cookies, and malicious people could potentially do the same. To make sure that the contents are unreadable and unchangeable to all, the cookie can be encrypted using the PyCrypto library which is available on GAE.
Performance – since cookies are sent with every request (even images), it can add to the bandwidth being used, and slow down requests. One solution is to use another domain for static content, so the browser won't send the cookie for that content.
When should the different types of persistence be used? How can they be combined to reduce/even out the amount of resources being spent?
Datastore
Use the datastore to hold any long living information. The datastore should be used like you would use a normal database to hold data that will be used in your site/application.
MemCache
Use this to access data a lot quicker than trying to access the datastore. MemCache can return data really quickly and can be used for any data that needs to span multiple calls from users. It is normally data that was originally in the datastore and then moved to the memcache.
def get_data():
data = memcache.get("key")
if data is not None:
return data
else:
data = self.query_for_data() #get data from the datastore
memcache.add("key", data, 60)
return data
The memcache will flush itself when the item is out of date. You set this in the last param of the add shown above.
Global Variables
I wouldn't use these at all since they can't span instances. In GAE a request creates a new instance, well in python it does. If you want to use Global variables I would store the data needed in the memcache.
Your post is a good summary of the 3 major options. You mostly have answered the question already. However, if you are currently building an app and stressing over whether or not you should memcache something, try this:
Write your app using the datastore for everything that needs to outlive more than one request.
Once your app (or some usable subset) is working, run some functional tests or simulations to see where the slow spots (or high quota usage) are.
Find the most slow or inefficient request path, and figure out how to make that faster (either by using memcache, or altering your datastructures so you can do gets instead of queries, or possibly storing something in a global instance variable*)
goto 2 until you're satisfied.
*Things that might be good for a "global" variable would be something that is relatively expensive to create/fetch, that a substantial portion of your requests will use, and that does not need to be consistent across requests/users.
I use global variable to speed up json conversion. Before I convert my data structure to json, I hash it and check if the json if already available. For my app this gives quite a speedup as the pure python implementation is quite slow.
Global variables
To complement AutomatedTester's answer, and also reply his further question about how to share information between GETs without memcache or datastore, below a quick illustration of how to use global variables:
if 'i' not in globals():
i = 0
def main():
global i
i += 1
print 'Status: 200'
print 'Content-type: text/plain\n'
print i
if __name__ == '__main__':
main()
Calling this script multiple times will give you 1, 2, 3... Of course as mentioned earlier by Blixt you should not count on this trick too much ('i' can sometimes switch back to zero) but it can be useful to store user-specific information in a dictionary, session data for instance.