suppose I have million users registered with my app. now there's a new user, and I want to show him who all in his contacts have this app installed. A user can have many contacts, let's say 500. now if I go to get an entity for each contact from datastore then it's very time and money consuming. memcache is a good option, but I've to keep it in sync for that Kind. I can get dedicated memcache for such a large data, but how do I sync it? my logic would be, if it's not there in memcache, assume that that contact is not registered with this app. A backend module with manual scaling can be used to keep both in sync. But I don't know how good this design is. Any help will be appreciated.
This is not how memcache is designed to be used. You should never rely on memcache. Keys can drop at any time. Therefore, in your case, you can never be sure if a contact exists or not.
I don't know what your problem with datastore is? Datastore is designed to read data very fast - take advantage of it.
When new users install your app, create a lookup entity with the phone number as the key. You don't necessarily need any other properties. Something like this:
Entity contactLookup = new Entity("ContactLookup", "somePhoneNumber");
datastore.put(contactLookup);
That will keep a log of who's got the app installed.
Then, to check which of your users contacts are already using your app, you can create an array of keys out of the phone numbers from the users address book (with their permission of course!), and perform a batch get. Something like this:
Set<Key> keys = new HashSet<Key>();
for (String phoneNumber : phoneNumbers)
keys.add(KeyFactory.createKey("ContactLookup", phoneNumber));
Map<Key, Entity> entities = datastore.get(keys);
Now, entities will be those contacts that have your app installed.
You may need to batch the keys to reduce load. The python api does this for you, but not sure about the java apis. But even if your users has 500 contacts, it's only 5 queries (assuming batches of 100).
Side note: you may want to consider hashing phone numbers for storage.
Memcache is a good option to reduce costs and improve performance, but you should not assume that it is always available. Even a dedicated Memcache may fail or an individual record can be evicted. Besides, all this synchronization logic will be very complicated and error-prone.
You can use Memcache to indicate if a contact is registered with the app, in which case you do not have to check the datastore for that contact. But I would recommend checking all contacts not found in Memcache in the Datastore.
Verifying if a record is present in a datastore is fast and inexpensive. You can use .get(java.lang.Iterable<Key> keys) method to retrieve the entire list with a single datastore call.
You can further improve performance by creating an entity with no properties for registered users. This way there will be no overhead in retrieving these entities.
Since you don't use python and therefore don't have access to NDB, the suggestion would be to, when you add a user, add him to memcache and create an async query (or a task queue job) to push the same data to your datastore. Like that memcache gets pushed first, and then eventually the datastore follows. They'll always be in sync.
Then all you need to do is to first query your memcache when you do "gets" (because memcache is always in sync since you push there first), and if memcache returns empty (being volatile and whatnot), then query the actual datastore to "re fill" memcache
Related
I'm new to Firebase, and I was researching about it to see if it fits our needs. It has everything we need except an offline database. Well, I know that it has the capability to cache changes when the user is offline and then sync them when the user becomes online, but this is not what I'm talking about.
As firebase is costly, we want our free users to be able to use the app only offline and the data should not sync to the cloud no matter the user is online or not, and only use sync for subscribed users.
One solution which we have not yet put much thought into is to use an offline DB like SQLite and:
a) when the user subscribes move the data to firebase
b) if the user cancels the subscription move the data to SQLite
but this solution needs 2 completely different codings the same thing. Extra code for migrating from SQLite to firebase and from firebase to SQLite. Is there a better solution to use the Firestore database and also have a complete offline database functionality?
In my opinion, your solution can work. But there are some situations that you should take into consideration.
a) when the user subscribes move the data to Firebase.
As I understand, it's one or the other. In this case, when a user gets subscribed, you should always consider locking the local SQLite database for writes, until all the data is written on the Firebase servers. When the operation completes, only then you should allow the user to write data in the cloud.
Why is this needed? To always have consistent data.
b) if the user cancels the subscription move the data to SQLite.
If the user cancels the subscription, you might consider using almost the same mechanism as above. But first, you have to clear the local database, as it will contain outdated data, and then copy all data from Firestore right into the SQLite database.
Edit:
You can also take into consideration copying the data from the local cache rather than getting it from the Firebase servers. This will imply no costs.
but this solution needs 2 completely different codings the same thing.
That's correct, but since both, the offline and online databases share the same fields, the operation might not be so complicated as think. Simply attach a real-time listener on a property within the user object, most likely called "subscribed" which can hold the value of true/false, and take actions accordingly when is changed.
I’m building a server for customers where each customer need to have each access to a database for serving his/her clients.
So my thought was to assign each customer to a specific bucket but just to find out now that a single couchbase only serve maximum of 10 buckets as recommended. But now, i don’t know if sharing a single bucket across my customers using their ID combining with the collection documents name they are creating as namespace in document type will affect the performance of all customers due to heavy operation by each customer clients on a single bucket.
I will also appreciate any database platform that can also handle this kind of project at large that performance of one customer will affect others.
If you expect the system to be heavily loaded, the activities of one user will affect the activities of another user whether they are sharing a single bucket or operating in separate buckets. There are only so many cycles to go around, so if one user is placing a heavy load on the system, the other users will definitely feel it. If you absolutely want the users completely isolated, you need to set up separate clusters for each of them.
If you are ok with the load from one user affecting the load from another, your plan for having users sharing a bucket by adding user ids to each document sounds workable. Just make sure you are using a separator that can not be part of the user id, so you can unambiguously separate the user id from the document id.
Also be aware that while Couchbase supports multiple buckets, it tends to run best with just one. Buckets are distinctly heavyweight structures.
Our Google AppEngine Java app involves caching recent users that have requested information from the server.
The current working solution is that we store the users information in a list, which is then cached.
When we need a recent user we simply grab one from this list.
The list of recent users is not vital to our app working, and if it's dropped from the cache it's simply rebuilt as users continue to request from the server.
What I want to know is: Can I do this a better way?
With the current approach there is only a certain amount of users we can store before the list gets to large for memcache (we are currently limiting the list to 1000 and dropping the oldest when we insert new). Also, the list is going to need updating very quickly which involves retrieving the full list from memcache just to add a single user.
Having each user stored in cache separately would be beneficial to us as we require the recent user to expire after 30 minutes. At the moment this is a manual task we do to make sure the list does not include expired users.
What is the best approach for this scenario? If it's storing the users separately in cache, what's the best approach to keeping track of the user so we can retrieve it?
You could keep in the memcache list just "pointers" that you can use to build individual memcache keys to access user entities separately stored in memcache. This makes the list's memcache size footprint a lot smaller and easy to deal with.
If the user entities have parents then pointers would have to be their keys, which are unique so they can be used as memcache keys as well (well, their urlsafe versions if needed).
But if the user entities don't have parents (i.e. they're root entities in their entity groups) then you can use their datastore key IDs as pointers - typically shorter than the keys. Better yet, if the IDs are numerical IDs you can even store them as numbers, not as strings. The IDs are unique for those entities, but they might not be unique enough to serve as memcache keys, you may need to add a prefix/suffix to make the respective memcache keys unique (to your app).
When you need a user entity data you 1st obtain the "pointer" from the list, build the user entity memcache key and retrieve the entity with that key.
This, of course, assumes you do have reasons to keep that list in place. If the list itself is not mandatory all you need is just the recipe to obtain the (unique) memcache keys for each of your entities.
If you use NDB caching, it will take care of memcache for you. Simply request the users with the key using ndb.Key(Model, id).get() or Model.get_by_id(id), where id is the User id.
I am using djangoappengine and I think have run into some problems with the way it handles eventual consistency on the high application datastore.
First, entity groups are not even implemented in djangoappengine.
Second, I think that when you do a djangoappengine get, the underlying app engine system is doing an app engine query, which are only eventually consistent. Therefore, you cannot even assume consistency using keys.
Assuming those two statements are true (and I think they are), how does one build an app of any complexity using djangoappengine on the high replication datastore? Every time you save a value and then try to get the same value, there is no guarantee that it will be the same.
Take a look in djangoappengine/db/compiler.py:get_matching_pk()
If you do a djangomodel.get() by the pk, it'll translate to a Google App Engine Get().
Otherwise it'll translate to a query. There's room for improvement here. Submit a fix?
Don't really know about djangoappengine but an appengine query if it includes only key is considered a key only query and you will always get consistent results.
No matter what the system you put on top of the AppEngine models, it's still true that when you save it to the datastore you get a key. When you look up an entity via its key in the HR datastore, you are guaranteed to get the most recent results.
I would like to develop an app engine application where users can store data. The problem is that the user data can get really large in size. So I would like to set quotas on single users or check how much space in datastore they use.
I there an easy way to do that? Or do I have to count the bytes of stored strings by myself?
The most efficient way is to keep track of how many bytes are stored for each user, and store it in an entity in the datastore.
Alternatively, and if you are storing the data (or at least a reference) in the datastore, you could also fetch all that are belonging to a user from the datastore and calculate how much is used.
So yeah, you have to do it yourself. The total datastore size as reported by statistics is only for the entire datastore, as well as each entity kind. Besides, it's only updated every ~24 hours, so even if it could be used, it isn't as useful as coding it by hand.