Best approach for caching lists of objects in memcache - google-app-engine

Our Google AppEngine Java app involves caching recent users that have requested information from the server.
The current working solution is that we store the users information in a list, which is then cached.
When we need a recent user we simply grab one from this list.
The list of recent users is not vital to our app working, and if it's dropped from the cache it's simply rebuilt as users continue to request from the server.
What I want to know is: Can I do this a better way?
With the current approach there is only a certain amount of users we can store before the list gets to large for memcache (we are currently limiting the list to 1000 and dropping the oldest when we insert new). Also, the list is going to need updating very quickly which involves retrieving the full list from memcache just to add a single user.
Having each user stored in cache separately would be beneficial to us as we require the recent user to expire after 30 minutes. At the moment this is a manual task we do to make sure the list does not include expired users.
What is the best approach for this scenario? If it's storing the users separately in cache, what's the best approach to keeping track of the user so we can retrieve it?

You could keep in the memcache list just "pointers" that you can use to build individual memcache keys to access user entities separately stored in memcache. This makes the list's memcache size footprint a lot smaller and easy to deal with.
If the user entities have parents then pointers would have to be their keys, which are unique so they can be used as memcache keys as well (well, their urlsafe versions if needed).
But if the user entities don't have parents (i.e. they're root entities in their entity groups) then you can use their datastore key IDs as pointers - typically shorter than the keys. Better yet, if the IDs are numerical IDs you can even store them as numbers, not as strings. The IDs are unique for those entities, but they might not be unique enough to serve as memcache keys, you may need to add a prefix/suffix to make the respective memcache keys unique (to your app).
When you need a user entity data you 1st obtain the "pointer" from the list, build the user entity memcache key and retrieve the entity with that key.
This, of course, assumes you do have reasons to keep that list in place. If the list itself is not mandatory all you need is just the recipe to obtain the (unique) memcache keys for each of your entities.

If you use NDB caching, it will take care of memcache for you. Simply request the users with the key using ndb.Key(Model, id).get() or Model.get_by_id(id), where id is the User id.

Related

Maximum concurrent operations on couchbase single bucket

I’m building a server for customers where each customer need to have each access to a database for serving his/her clients.
So my thought was to assign each customer to a specific bucket but just to find out now that a single couchbase only serve maximum of 10 buckets as recommended. But now, i don’t know if sharing a single bucket across my customers using their ID combining with the collection documents name they are creating as namespace in document type will affect the performance of all customers due to heavy operation by each customer clients on a single bucket.
I will also appreciate any database platform that can also handle this kind of project at large that performance of one customer will affect others.
If you expect the system to be heavily loaded, the activities of one user will affect the activities of another user whether they are sharing a single bucket or operating in separate buckets. There are only so many cycles to go around, so if one user is placing a heavy load on the system, the other users will definitely feel it. If you absolutely want the users completely isolated, you need to set up separate clusters for each of them.
If you are ok with the load from one user affecting the load from another, your plan for having users sharing a bucket by adding user ids to each document sounds workable. Just make sure you are using a separator that can not be part of the user id, so you can unambiguously separate the user id from the document id.
Also be aware that while Couchbase supports multiple buckets, it tends to run best with just one. Buckets are distinctly heavyweight structures.

Riak backend choice: bitcask vs leveldb

I'm planning to use Riak as a backend for a service that stores user session data. The main key used to retrieve data (binary blob) is named UUID and actually is a uuid, but sometimes the data might be retrieved using one or two other keys (e.g. user's email).
Natural option would be to pick leveldb backend with possibility to use secondary indexes for such scenario, but as secondary index search is not very common (around 10% - 20% of lookups), I was wondering if it wouldn't be better to have a separate "indexes" bucket, where such mapping email->uuid would be stored.
In such scenario, when looking using "secondary" index, I would first lookup the uuid in the "indexes" bucket, and then normally read the data using primary key.
Knowing that bitcask is much more predictable when it comes to latency and possibly faster, would you recommend such design, or shall I stick to leveldb and secondary indexes?
I think that both scenario would work. One way to choose which scenario to use is if you need expiration. I guess you'll want to have expiration for user sessions. If that's the case, then I would go with the second scenario, as bitcask offers a very good expiration feature, fully customizable.
If you go that path, you'll have to cleanup the metadata bucket (in eleveldb) that you use for secondary indexes. That can be done easily by also having an index of the last modification of the metadata keys. Then you run a batch to do a 2i query to fetch old metadata and delete them. Make sure you use the latest Riak, that supports aggressive deletion and reclaiming of disk space in leveldb.
That said, maybe you can have everything in bitcask, and avoid secondary indexes altogether. Consider this data design:
one "data" bucket: keys are uuid, value is the session
one "mapping_email" bucket: keys are email, values are uuid
one "mapping_otherstuff" bucket: same for other properties
This works fine if :
most of the time you let your data expire. That means you have no bookkeeping to do
you don't have too many mapping as it's cumbersome to add more
you are ready to properly implement a client library that would manage the 3 buckets, for instance when creating / updating / deleting new values
You could start with that, because it's easier on the administration, bookkeeping, batch-creation (none), and performance (secondary index queries can be expensive).
Then later on if you need, you can add the leveldb route. Make sure you use multi_backend from the start.

how to keep memcache and datastore in sync

suppose I have million users registered with my app. now there's a new user, and I want to show him who all in his contacts have this app installed. A user can have many contacts, let's say 500. now if I go to get an entity for each contact from datastore then it's very time and money consuming. memcache is a good option, but I've to keep it in sync for that Kind. I can get dedicated memcache for such a large data, but how do I sync it? my logic would be, if it's not there in memcache, assume that that contact is not registered with this app. A backend module with manual scaling can be used to keep both in sync. But I don't know how good this design is. Any help will be appreciated.
This is not how memcache is designed to be used. You should never rely on memcache. Keys can drop at any time. Therefore, in your case, you can never be sure if a contact exists or not.
I don't know what your problem with datastore is? Datastore is designed to read data very fast - take advantage of it.
When new users install your app, create a lookup entity with the phone number as the key. You don't necessarily need any other properties. Something like this:
Entity contactLookup = new Entity("ContactLookup", "somePhoneNumber");
datastore.put(contactLookup);
That will keep a log of who's got the app installed.
Then, to check which of your users contacts are already using your app, you can create an array of keys out of the phone numbers from the users address book (with their permission of course!), and perform a batch get. Something like this:
Set<Key> keys = new HashSet<Key>();
for (String phoneNumber : phoneNumbers)
keys.add(KeyFactory.createKey("ContactLookup", phoneNumber));
Map<Key, Entity> entities = datastore.get(keys);
Now, entities will be those contacts that have your app installed.
You may need to batch the keys to reduce load. The python api does this for you, but not sure about the java apis. But even if your users has 500 contacts, it's only 5 queries (assuming batches of 100).
Side note: you may want to consider hashing phone numbers for storage.
Memcache is a good option to reduce costs and improve performance, but you should not assume that it is always available. Even a dedicated Memcache may fail or an individual record can be evicted. Besides, all this synchronization logic will be very complicated and error-prone.
You can use Memcache to indicate if a contact is registered with the app, in which case you do not have to check the datastore for that contact. But I would recommend checking all contacts not found in Memcache in the Datastore.
Verifying if a record is present in a datastore is fast and inexpensive. You can use .get(java.lang.Iterable<Key> keys) method to retrieve the entire list with a single datastore call.
You can further improve performance by creating an entity with no properties for registered users. This way there will be no overhead in retrieving these entities.
Since you don't use python and therefore don't have access to NDB, the suggestion would be to, when you add a user, add him to memcache and create an async query (or a task queue job) to push the same data to your datastore. Like that memcache gets pushed first, and then eventually the datastore follows. They'll always be in sync.
Then all you need to do is to first query your memcache when you do "gets" (because memcache is always in sync since you push there first), and if memcache returns empty (being volatile and whatnot), then query the actual datastore to "re fill" memcache

Regularly updated data and the Search API

I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.
It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.

What are best practices for handling ids in web services?

We have two separate systems communicating via a web service. Call them front-end and back-end. A lot of the processing involves updating lists in the back-end. For example, the front-end needs to update a specific person. Currently, we are designing the back-end where we are making the decision on what the interface should be. We will need the actual database ids to update the underlying database, but we also see where propagating database ids to our consumers could be a bad idea.
What are some alternatives in forcing the clients (i.e. front-end) to have to send ids back into the web service to update a particular entity? The other reason we are trying to avoid ids is the front-end often saves these changes to be sent at a later date. This would require the front-ends to save our ids in their system, which also seems like a bad idea.
We have considered the following:
1) Send database ids back to front-end; they would have to send these back to process the change
2) Send hashed ids (based off of database ids) back to the front-end; they would have to send these back to process the change.
3) Do not force the clients to send ids at all but have them send the original entity and new entity and "match" to our entity in the database. Their original entity would have to match our saved entity. We would also have to define what constitutes a match between our entity and their new entity.
The only reasonable way for front-end would be to someway identify persons in DB.
Matching the full entity is unreliable and isn't obvious; for returning hashed ID to front-end you need to receive not-hashed ID from front-end first, or perform some revertible "hashing" (more like "encrypting") under IDs, so anyway there would be some person identifier.
IMHO it does not matter whether it will be a database ID or some piece of data (encrypted database ID) from which the ID could be extracted. Why do you think that consumers knowing the database ID would be a bad idea? I don't see any problem as long as every person belongs to a single consumer.
If there is many-to-many relation between persons (objects in DB) and consumers, then you may "encrypt" (in the broad sense) the object id so that the encryption will be consumer-dependent. For example, in communication with consumer you can use the ID of the link (between object and consumer) entry in DB.
If sending IDs to consumers seems to be the bad idea for you because of the possibility of consumer enumerating all the IDs one-by-one, you can avoid this problem by using GUIDs instead of an integer auto-incremented IDs.
PS: As for your comment, consider using e.g. GUID as an object ID. The ID is the part of data, not the part of schema, so it will be preserved when migrating between databases. Such the ID won't contain sensitive information as well, so it is perfectly safe to reveal the ID to consumer (or someone else). If you want to prevent creation of two different persons with the same SSNs, just add an UNIQUE key on your SSN field, but do not use SSN as the part of ID, as such approach has many serious disadvantages, with inability to reveal the ID being the least of them.
From my point of view the id of a record does not convey any sensitive information to anyone.
As a result there is no problem transmitting database ids to front-end (and in general).
The only concern would be related to database consistency issues, but I can not see any.
Additionally from performance it is much better, since you don't need to query the database on attributes to find the database id.
Additionally if you send a hash of the id you can not extract the id from the hash.
You would have to find an id in the database that matches the hash and that is not good IMO
So:
we also see where propagating database ids to our consumers could be a
bad idea.
I don't see it. If you could explain why you think is a bad idea, may be there would be a discussion.

Resources