DjangoAppEngine and Eventual Consistency Problems on the High Replication Datastore - google-app-engine

I am using djangoappengine and I think have run into some problems with the way it handles eventual consistency on the high application datastore.
First, entity groups are not even implemented in djangoappengine.
Second, I think that when you do a djangoappengine get, the underlying app engine system is doing an app engine query, which are only eventually consistent. Therefore, you cannot even assume consistency using keys.
Assuming those two statements are true (and I think they are), how does one build an app of any complexity using djangoappengine on the high replication datastore? Every time you save a value and then try to get the same value, there is no guarantee that it will be the same.

Take a look in djangoappengine/db/compiler.py:get_matching_pk()
If you do a djangomodel.get() by the pk, it'll translate to a Google App Engine Get().
Otherwise it'll translate to a query. There's room for improvement here. Submit a fix?

Don't really know about djangoappengine but an appengine query if it includes only key is considered a key only query and you will always get consistent results.

No matter what the system you put on top of the AppEngine models, it's still true that when you save it to the datastore you get a key. When you look up an entity via its key in the HR datastore, you are guaranteed to get the most recent results.

Related

Does GAE datastore internally use memcache?

As you can see from the attached screenshot, the datastore asks memcache to delete an entry inside a put(). What's that?
At least the ndb datastore caches include memcache:
The pattern you observed could be explained in this section:
Memcache does not support transactions. Thus, an update meant to be
applied to both the Datastore and memcache might be made to only one
of the two. To maintain consistency in such cases (possibly at the
expense of performance), the updated entity is deleted from memcache
and then written to the Datastore.

Write limit per entity group google cloud datastore

I am new to Google Cloud Datastore. I have read that there is a write limit on 1 write per second on an entity group. Does it means that the main "guestbook" tutorial on app engine cannot scale to thousands of very active users?
Indeed.
The tutorial is just a showcase. The writes per second limitation is due to the strong consistency guarantee for entities in the same group or ancestor. This limit can be exceed at the price of changing strong consistency by eventual consistency, meaning all datastore queries will show the same information at some point. This is due to App engine distributed design.
Please, have a look at https://cloud.google.com/appengine/articles/scaling/contention to avoid datastore contention issues. Hope it helps.
Yes, I think it does mean that.
It might not be a problem if the greetings are all added to different guestbooks, but quickly adding Greetings to the same Guestbook is definitely not gonna scale. However, in practice it's often much faster than 1 write per second.
Perhaps you could work around this by using a taskqueue to add Greetings, but that might be overkill.
That guestbook tutorial is not a very good example in general. You shouldn't put logic in your jsp's like in that example (you probably shouldn't use jsp's at all). It's also not very practical to use the datastore at such a low level. Just use Objectify.

Objectify: adding the #Cache annotation on existing data and removing #Parent

Two questions about updating my domain diagram:
1) I am new to GAE and have just deployed my first application based on Objectify. Just to discover than soon after my first users came in I had soon gone through the datastore read quota limit:
I had not until now put too much thought on server side caching. I thought Objectify's session cache would do the job for me. But now I realise I need use the global memcache.
According to Objectify's doc, I have to use Objectify's #Cache annotation on every entity that is accessed by key (and not by query).
However I am concerned about the side effects this will have on data that I have already stored in datastore.
2) I also realize now that I am using #Parent too much. There are a couple entities were using #Parent has no benefit (and it has some drawbacks due the datastore limiting write operations on entities belonging to the same root).
If I go ahead and remove the #Parent annotation from the entities of my domain where it no longer is needed, will it have side effects on the already persited entities?
Thanks!
For objectify : the global cache is enabled by default, however you
must still annotate your entity classes with #Cache.
#Parent is
important if you need consistent result, and avoid eventual
consistency. Removing the ancestor will have a side effect on the already stored data as the key will change. You will need a migration plan.
But most of all, the free quota are quite reasonable, so if you already run into quota errors with your first user, then I would suggest installing appstats and actually measure what is the real underlying cause i.e. what action(s) are responsible for the bulk of the operations and work on those. Much better than a general approach.

Complex Google App Engine Search

A couple quick questions related to GAE search and datastore:
(1) Why is it that I can inequality filter on more than one property using the search service, but that I can only inequality filter on at most one property when querying the datastore? It seems odd that this limitation would exist in one service but not the other.
(2) I intend to use google app engine search to query very many objects (thousands or hundreds of thousands, maybe more). I plan to be doing many inequalities, for example: "time created" before x, "price" greater than y, "rating" less than z, "latitude" between a and b, "longitude" between c and d etc. This seems like a lot of filters and potentially expensive. Is App Engine Search an appropriate solution for this?
Thanks so much.
1) The SearchService basically gives you an API to perform the sorts of things you can't using the datastore. If you could do them on the datastore, you wouldn't really need the SearchService. While not a very satisfactory answer, many of the common operations you might do with a traditional RDBMS were not really even possible before the Search API was available.
2) is a bit harder. Currently the search api doesn't handle failure conditions very well, usually you'll get a SearchServiceException without a meaningful message. The team seem to have been improving this over the last year or so, although fixes in this space seem to have been coming very slowly.
From the tickets I've raised, failures are usually a result of queries running too long. This is usually represented as queries that are too complex. You can actually tune queries quite a lot with combinations of the query string as well as the parameters you apply to your search request. The downside is that its all totally black box, I haven't seen any guides or tools on optimising queries. When they fail, they just fail.
The AppEngine search api is designed to solve the problems you describe, whether in your case it does may be hard to determine. You could set up some sample queries and deploy to a test environment to see if it even basically works for your typical set of data. I would expect that it will work fine for the example you gave. I have successfully been running similar searches in large scale production environments.

Amazon Cloudsearch (or Solr, ElasticSearch) best practice for result contents?

I have read that it is best practice to only return an ID when querying for results, and then populate metadata from the database. Is this true? I am worried about performance.
In my opinion, it is almost always best to store and return the fewest fields possible — preferably just the ID, unless you explicitly need a feature such as highlighting.
Storing a lot of data in your index can have a negative impact on your search performance as your index grows. There is no data that loads faster than no data. Plus, looking up objects by their IDs should be a very cheap operation in your primary data store of choice.
Most importantly, if your application is using an ORM to interact with its data store, then the sheer utility of reusing all your domain modeling consistently throughout your application would be hard to overstate.
Returning values straight from your search engine can be useful. But, short of using the search engine as a primary data store, I would need a very compelling reason to fragment my domain logic by foregoing an ORM.
IMO, If you can retrieve the search results and the data within a single call would be a huge boost to performance in comparison with getting just the ids and making a DB call to retrieve the metadata for the same.
Also, Solr/ES provides in built Caching solutions so the response would be faster for subsequent queries. For DB you may have to use a Solution or probably some other options.
this all depends on your specific scenario.
In some cases, what you say might be true. For instance, Etsy does exactly that (or at least used to do that), they rationale is that they had a very capable mysql cluster and they know very well how to manage it, and is very fast, so Solr returning only the id was enough for them.
But, you might be in a totally different scenario, and maybe calling the db will take longer than storing everything needed in Solr and hitting just Solr.
In my experience Solr performs bad on retrieving results when you either have highlighting on, or the fields you retrieve are very large and the network serialization/deserialization transfer overhead increases. If that is the case, you might be better off asynchronously retrieving these fields from the DB.

Resources