Does Objectify cache the Query.fetchKeys results in memcache?

Does Objectify cache the Query.fetchKeys results in memcache? - google-app-engine

I have limited seed-data in an entity, for which I want to fetch all keys (unique strings) frequently and whole entity not that frequent.
If I fetch the keys using Query.fetchKeys, does Objectify cache the results in memcache or hit datastore everytime for Query.fetchKeys results?

Query.fetchKeys() is a method from a very old version of Objectify.
But in answer to your question, all 'queries' (that is, anything besides get-by-key) must pass through to the datastore. Only the datastore knows what satisfies a query.

Related

Does GAE datastore internally use memcache?

As you can see from the attached screenshot, the datastore asks memcache to delete an entry inside a put(). What's that?

At least the ndb datastore caches include memcache:
The pattern you observed could be explained in this section:
Memcache does not support transactions. Thus, an update meant to be
applied to both the Datastore and memcache might be made to only one
of the two. To maintain consistency in such cases (possibly at the
expense of performance), the updated entity is deleted from memcache
and then written to the Datastore.

how to keep memcache and datastore in sync

suppose I have million users registered with my app. now there's a new user, and I want to show him who all in his contacts have this app installed. A user can have many contacts, let's say 500. now if I go to get an entity for each contact from datastore then it's very time and money consuming. memcache is a good option, but I've to keep it in sync for that Kind. I can get dedicated memcache for such a large data, but how do I sync it? my logic would be, if it's not there in memcache, assume that that contact is not registered with this app. A backend module with manual scaling can be used to keep both in sync. But I don't know how good this design is. Any help will be appreciated.

This is not how memcache is designed to be used. You should never rely on memcache. Keys can drop at any time. Therefore, in your case, you can never be sure if a contact exists or not.
I don't know what your problem with datastore is? Datastore is designed to read data very fast - take advantage of it.
When new users install your app, create a lookup entity with the phone number as the key. You don't necessarily need any other properties. Something like this:
Entity contactLookup = new Entity("ContactLookup", "somePhoneNumber");
datastore.put(contactLookup);
That will keep a log of who's got the app installed.
Then, to check which of your users contacts are already using your app, you can create an array of keys out of the phone numbers from the users address book (with their permission of course!), and perform a batch get. Something like this:
Set<Key> keys = new HashSet<Key>();
for (String phoneNumber : phoneNumbers)
keys.add(KeyFactory.createKey("ContactLookup", phoneNumber));
Map<Key, Entity> entities = datastore.get(keys);
Now, entities will be those contacts that have your app installed.
You may need to batch the keys to reduce load. The python api does this for you, but not sure about the java apis. But even if your users has 500 contacts, it's only 5 queries (assuming batches of 100).
Side note: you may want to consider hashing phone numbers for storage.

Memcache is a good option to reduce costs and improve performance, but you should not assume that it is always available. Even a dedicated Memcache may fail or an individual record can be evicted. Besides, all this synchronization logic will be very complicated and error-prone.
You can use Memcache to indicate if a contact is registered with the app, in which case you do not have to check the datastore for that contact. But I would recommend checking all contacts not found in Memcache in the Datastore.
Verifying if a record is present in a datastore is fast and inexpensive. You can use .get(java.lang.Iterable<Key> keys) method to retrieve the entire list with a single datastore call.
You can further improve performance by creating an entity with no properties for registered users. This way there will be no overhead in retrieving these entities.

Since you don't use python and therefore don't have access to NDB, the suggestion would be to, when you add a user, add him to memcache and create an async query (or a task queue job) to push the same data to your datastore. Like that memcache gets pushed first, and then eventually the datastore follows. They'll always be in sync.
Then all you need to do is to first query your memcache when you do "gets" (because memcache is always in sync since you push there first), and if memcache returns empty (being volatile and whatnot), then query the actual datastore to "re fill" memcache

Objectify: adding the #Cache annotation on existing data and removing #Parent

Two questions about updating my domain diagram:
1) I am new to GAE and have just deployed my first application based on Objectify. Just to discover than soon after my first users came in I had soon gone through the datastore read quota limit:
I had not until now put too much thought on server side caching. I thought Objectify's session cache would do the job for me. But now I realise I need use the global memcache.
According to Objectify's doc, I have to use Objectify's #Cache annotation on every entity that is accessed by key (and not by query).
However I am concerned about the side effects this will have on data that I have already stored in datastore.
2) I also realize now that I am using #Parent too much. There are a couple entities were using #Parent has no benefit (and it has some drawbacks due the datastore limiting write operations on entities belonging to the same root).
If I go ahead and remove the #Parent annotation from the entities of my domain where it no longer is needed, will it have side effects on the already persited entities?
Thanks!

For objectify : the global cache is enabled by default, however you
must still annotate your entity classes with #Cache.
#Parent is
important if you need consistent result, and avoid eventual
consistency. Removing the ancestor will have a side effect on the already stored data as the key will change. You will need a migration plan.
But most of all, the free quota are quite reasonable, so if you already run into quota errors with your first user, then I would suggest installing appstats and actually measure what is the real underlying cause i.e. what action(s) are responsible for the bulk of the operations and work on those. Much better than a general approach.

Regularly updated data and the Search API

I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.

It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.

DjangoAppEngine and Eventual Consistency Problems on the High Replication Datastore

I am using djangoappengine and I think have run into some problems with the way it handles eventual consistency on the high application datastore.
First, entity groups are not even implemented in djangoappengine.
Second, I think that when you do a djangoappengine get, the underlying app engine system is doing an app engine query, which are only eventually consistent. Therefore, you cannot even assume consistency using keys.
Assuming those two statements are true (and I think they are), how does one build an app of any complexity using djangoappengine on the high replication datastore? Every time you save a value and then try to get the same value, there is no guarantee that it will be the same.

Take a look in djangoappengine/db/compiler.py:get_matching_pk()
If you do a djangomodel.get() by the pk, it'll translate to a Google App Engine Get().
Otherwise it'll translate to a query. There's room for improvement here. Submit a fix?

Don't really know about djangoappengine but an appengine query if it includes only key is considered a key only query and you will always get consistent results.

No matter what the system you put on top of the AppEngine models, it's still true that when you save it to the datastore you get a key. When you look up an entity via its key in the HR datastore, you are guaranteed to get the most recent results.