I would like to develop an app engine application where users can store data. The problem is that the user data can get really large in size. So I would like to set quotas on single users or check how much space in datastore they use.
I there an easy way to do that? Or do I have to count the bytes of stored strings by myself?
The most efficient way is to keep track of how many bytes are stored for each user, and store it in an entity in the datastore.
Alternatively, and if you are storing the data (or at least a reference) in the datastore, you could also fetch all that are belonging to a user from the datastore and calculate how much is used.
So yeah, you have to do it yourself. The total datastore size as reported by statistics is only for the entire datastore, as well as each entity kind. Besides, it's only updated every ~24 hours, so even if it could be used, it isn't as useful as coding it by hand.
Related
I am developing an application where the user can request data based on its current location. So for example I have a big pool of local stores all over the world and when the user hits a query just the stores nearby his location gets queried to match. Same as in tinder, there are a lot of people in the database but one user can only see people around his location. How must the database be structured, cause I guess just querying the whole database pool, possible out of million entries, to find people that match with your geo location is bad practice? What is the architecture similar applications use? Thanks for tips.
I would say querying the whole database pool and letting the database take care of optimization is actually good practice. However, you need to be careful in defining your table so that these queries will be efficient.
In particular, you'll want a spatial two-dimensional index on the column. PostGIS is great for this: http://revenant.ca/www/postgis/workshop/indexing.html
Then, when you query, use ST_DWithin.
If you are interested in learning about how this actually looks like when laid out on memory/disk, start with r-tree indexes: https://en.wikipedia.org/wiki/R-tree
My project mainly utilizes two different tools from Google:
Natural Language API to analyze and save entities and syntax of hundreds of docs
Datastore to store each document along with its data retrieved from the Google NL API
I absolutely must save the tokens and entities otherwise I would have to call the Google NL API each and every time I work with a given document.
However, when a document is over a thousand words (i.e. extremely often) I cannot save my document inside Google Datastore.
I receive a 400 error saying entity is too big. It's around 2 to 5MB depending on the topic.
I also have The value of property 'tokens' is longer than 1048487 bytes when I try to Blob it.
I tried serializing the array and turning it into a Blob but same result.
Any way I could make this work without migrating to MongoDB?
I can suggest you 3 options:
Use compression. (tactical) Usually text data are very well compressed. So you may begin to fit into 1MB limit
Use sharding. Split the data in multiple entities and store/read them together. Join on read as needed.
Migrate to blob storage. For example https://cloud.google.com/storage/
There are other options/solutions but this 3 are probably easiest to implement.
I'm developing an App Engine app that offers users to keep a diary.
Now, I noticed that I can check all data in datastore through Developers Console.
This is not good for a diary app for privacy.
So I want to know how to make datastore private to prevent me from checking users' data.
Please help me.
This is a little bit tricky since the code can read the data in the datastore and so, by definition, anyone who can update the running code can also read the data in the datastore; however, there are ways that you can at least make it more difficult to inadvertently examine the data (though accessing the data will still be technically possible for you or any of the owners to do). The simplest way is to encrypt the data before storing it within the datastore model objects (and decrypting it when you read the data from the model objects); however, this will make indexed fields no longer work if you do that (you will need to decide whether that content really needs to be indexable or whether it is worthwhile to add manual indexing).
If you want data to not be readable by you at all, then you will need to encrypt/decrypt the data with a key that is only available to your application while the user is interacting with it (e.g. encrypting the data in the client that communicates with your server); however, you need to be aware that this will make any sort of indexing or background processing of the data impossible.
The only way to prevent you from viewing data in the datastore is to remove you from the developers of the app. A developer can always extract data if he wants to, either by looking it at directly in the Datastore viewer or by writing code that can read/forward this data.
suppose I have million users registered with my app. now there's a new user, and I want to show him who all in his contacts have this app installed. A user can have many contacts, let's say 500. now if I go to get an entity for each contact from datastore then it's very time and money consuming. memcache is a good option, but I've to keep it in sync for that Kind. I can get dedicated memcache for such a large data, but how do I sync it? my logic would be, if it's not there in memcache, assume that that contact is not registered with this app. A backend module with manual scaling can be used to keep both in sync. But I don't know how good this design is. Any help will be appreciated.
This is not how memcache is designed to be used. You should never rely on memcache. Keys can drop at any time. Therefore, in your case, you can never be sure if a contact exists or not.
I don't know what your problem with datastore is? Datastore is designed to read data very fast - take advantage of it.
When new users install your app, create a lookup entity with the phone number as the key. You don't necessarily need any other properties. Something like this:
Entity contactLookup = new Entity("ContactLookup", "somePhoneNumber");
datastore.put(contactLookup);
That will keep a log of who's got the app installed.
Then, to check which of your users contacts are already using your app, you can create an array of keys out of the phone numbers from the users address book (with their permission of course!), and perform a batch get. Something like this:
Set<Key> keys = new HashSet<Key>();
for (String phoneNumber : phoneNumbers)
keys.add(KeyFactory.createKey("ContactLookup", phoneNumber));
Map<Key, Entity> entities = datastore.get(keys);
Now, entities will be those contacts that have your app installed.
You may need to batch the keys to reduce load. The python api does this for you, but not sure about the java apis. But even if your users has 500 contacts, it's only 5 queries (assuming batches of 100).
Side note: you may want to consider hashing phone numbers for storage.
Memcache is a good option to reduce costs and improve performance, but you should not assume that it is always available. Even a dedicated Memcache may fail or an individual record can be evicted. Besides, all this synchronization logic will be very complicated and error-prone.
You can use Memcache to indicate if a contact is registered with the app, in which case you do not have to check the datastore for that contact. But I would recommend checking all contacts not found in Memcache in the Datastore.
Verifying if a record is present in a datastore is fast and inexpensive. You can use .get(java.lang.Iterable<Key> keys) method to retrieve the entire list with a single datastore call.
You can further improve performance by creating an entity with no properties for registered users. This way there will be no overhead in retrieving these entities.
Since you don't use python and therefore don't have access to NDB, the suggestion would be to, when you add a user, add him to memcache and create an async query (or a task queue job) to push the same data to your datastore. Like that memcache gets pushed first, and then eventually the datastore follows. They'll always be in sync.
Then all you need to do is to first query your memcache when you do "gets" (because memcache is always in sync since you push there first), and if memcache returns empty (being volatile and whatnot), then query the actual datastore to "re fill" memcache
I´m thinking about to write an application will have to store a small amount of records per user (<300) but hopefully will have a lot of users (>>1000).
I did some research for a platform that allows starting small and scale if there is a need to do so and got stuck with App Engine, but I´m not sure if it is the right tool for it, especially the datastore.
How will I get it to scale if I have a User entity and a Message entity and store all users and messages in that entities? I think the amount of records in the entities will grow really big and filtering i.e. for all messages of a user will get expensive. Is that a problem or will Google handle that ? Do I have to introduce multitenancy and create a namespace for each user so I only see the records in the entities that relates to the user? Is there a limit for the number of namespaces ? What would be the right approach for modeling the data in the datastore?
I do not really have a clue how to handle the App Engine datastore and if its the right tool for me.
The App Engine datastore is explicitly designed to handle this kind of scalability. Queries execute in time proportional to the number of records returned, so fetching all a user's messages will take the same time regardless of how many users there are in the system.
I think with those kind of numbers you are probably ok in terms of scalability. anywhere from 300,000 to millions of records is easily handled by any serious datastore.
It is not advisable to think of scaling during the infancy of your project.. Your first step should always be to build an app/product and launch it... Scaling comes afterwords Most of the app/products that are launched these days never make it to the level where they need to scale.. even if you do make or launch such a website/product/app that gets hit by large amount of traffic and you need to scale, then rejoice!!! because you've made it to that level.. But how to get to that level should always be the first question...
I'm not trying to de-moralise you, rather trying to help you focus where you should be... Thanks for reading and good luck with your App! May you do need to scale and as Toby said, even the most basic App Engine configuration is good enough to handle a couple of hundred thousands of records...