Choosing the right model for storing and querying data? - google-app-engine

I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?

The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.

For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers

It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.

Related

Finding unique products (never seen before by a user) in a datastore sorted by a dynamically changing value (i.e. product rating)

been trying to solve this problem for a week and couldn't come up with any solutions in all my research so I thought I'd ask you all.
I have a "Product" table and a "productSent" table, here's a quick scheme to help explain:
class Product(ndb.Model):
name = ndb.StringProperty();
rating = ndb.IntegerProperty
class productSent(ndb.Model): <--- the key name here is md5(Product Key+UUID)
pId = ndb.KeyProperty(kind=Product)
uuId = ndb.KeyProperty(kind=userData)
action = ndb.StringProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
My goal is to show users the highest rated product that they've never seen before--fast. So to keep track of the products users have seen, I use the productSent table. I created this table instead of using Cursors because every time the rating order changes, there's a possibility that the cursor skips the new higher ranking product. An example: assume the user has seen products 1-24 in the db. Next, 5 users liked product #25, making it the #10 product in the database--I'm worried that the product will never be shown again to the user (and possibly mess things up on a higher scale).
The problem with the way I'm doing it right now is that, once the user has blown past the first 1,000 products, it really starts slowing down the query performance. Because I'm literally pulling 1,000+ results, checking if they've been sent by querying against the productSent table (doing a keyName lookup to speed things up) and going through the loop until 15 new ones have been detected.
One solution I thought of was to add a repeated property (listProperty) to the Product table of all the users who have seen a product. Or if I don't want to have inequality filters I could put a repeated property of all the users who haven't seen a product. That way when I query I can dynamically take those out. But I'm afraid of what happens when I have 1,000+ users:
a) I'll go through the roof on the limit of repeated properties in one entity.
b) The index size will increase size costs
Has anyone dealt with this problem before (I'm sure someone has!) Any tips on the best way to structure it?
update
Okay, so had another idea. In order to minimize the changes that take place when a rating (number of likes) changes, I could have a secondary column that only has 3 possible values: positive, neutral, negative. And sort by that? Ofcourse for items that have a rating of 0 and get a 'like' (making them a positive) would still have a chance of being out of order or skipped by the cursor--but it'd be less likely. What do y'all think?
Sounds like the inverse, productNotSent would work well here. Every time you add a new product, you would add a new productNotSent entity for each user. When the user wants to see the highest rated product they have not seen, you will only have to query over the productNotSent entities that match that user. If you put the rating directly on the productNotSent you could speed the query up even more, since you will only have to query against one Model.
Another idea would be to limit the number of productNotSent entities per user. So each user only has ~100 of these entities at a time. This would mean your query would be constant for each user, regardless of the number of products or users you have. The creation of new productNotSent entities would become more complex, though. You'd have to have a cron job or something that "tops up" a user's collection of productNotSent entities when they use some up. You also may want to double-check that products rated higher than those already within the user's set of productNotSent entities get pushed in there. These are a little more difficult and well require some design trade-offs.
Hope this helps!
I do not know your expected volumes and exact issues (only did a quick perusal of your question), but you may consider using Json TextProperty storage as part of your plan. Create dictionaries/lists and store them in records by json.dump()ing them to a TextProperty. When the client calls, simply send the TextProperties to the client, and figure everything out on the client side once you JSON.parse() them. We have done some very large array/object processing in JS this way, and it is very fast (particularly indexed arrays). When the user clicks on something, send a transaction back to update their record. Set up some pull or push queue processes to handle your overall product listing updates, major customer rec updates, etc.
One downside is higher bandwidth going out of you app, but I think this cost will be minimal given potential processing savings on GAE. If you structure this right, you may be able to use get_by_id() to replace all or most of your planned indices and queries. We have found json.loads() and json.dumps() to be very fast inside the app, but we only use simple dictionary/list structures.This approach will be, though, a big, big quantum measure lower than your planned use of queries. The other potential issue is that very large objects may run into soft memory limits. Be sure that your Json objects are fairly simple+lightweight to avoid this (e.g. do no include product description, sub-objects, etc. in the Json item, just the basics such as product number). HTH, -stevep

"Connected" vs "disconnected" data model at GAE: how should I proceed in this case?

It's better to start with an example to illustrate this case. Let's say we have an User class and it should have an list of Post.
The first thought is to create this list inside the User class, but analyzing the use cases we find out that most of the times we want to retrieve the user without its posts and retrieve the posts without the user. However we need the user ID to retrieve posts. So the other way to create the data model is to not have the associations but create Post indexed by User ID.
In terms of cost, what are the pros and cons of both implementations?
See the billing page, in particular the section on the datastore operations:
https://developers.google.com/appengine/docs/billing
Datastore read costs grow per entity.
Datastore write costs grow per indexed property.
The first method will be much cheaper since it only operates on one User entity, and there's no indexing required.
However, cost probably isn't your sole deciding factor. Entities are limited to 1MB each, so if you're storing your posts within your User entity, you'll likely run into a wall. Time to read/write entities also depend on size, so large entities will take longer to read/write.
My previous answer was assuming you were actually storing a list of Post objects within your User entity. It sounds like you're asking if the User and Post are both entities, and the User stores a list of keys to the Posts.
The main benefit to the first case (User with a List of keys to Post entities) is that it enables you to fetch Posts in a consistent manner. After getting the User object, you can read the list of POSTS and fetch them individually. Datastore get-by-key operations are consistent. Depending on how you issue the get operations, this may be slower than a query.(ie, if you just use a for loop).
There's a possible very minor other benefit is as long as you don't index your Post List in your User, you can update your User relatively inexpensively this way. As an extreme example, if your User adds 5 Posts at once, you can add them all to the list, and then write the User once with one write operation. This isn't really all that great, since you probably have to write your Post entity anyways, But it's one less index write op per entity.
There is still the limit on the size of the User entity, so your List will have a maximum limit. There's also a maximum on the number of index entries per entity, so if you index the List, that could be a limit (but that would make the User entity more expensive to write too).
From a read perspective, the first case is non optimal.
The second case works better from a read perspective, it makes it easier to get Posts if you have the User id, but you have the index write ops when you write your Post. If you don't write Posts often, this is better. Note that queries are eevntually consistent.

Advanced database queries - appengine datastore

I have a fairly simple application (like CRM) which has a lot of contacts and associated tags.
A user can search giving lot of criteria (search-items) such as
updated_time in last 10 days
tags in xxx
tags not in xxx
first_name starts with xxx
first_name not in 'Smith'
I understand indexing and how filters (not in) cannot work on more than one property.
For me, since most of the times, reporting is done in a cron - I can iterate through all records and process them. However, I would like to know the best optimized route of doing it.
I am hoping that instead of querying 'ALL', I can get close to a query which can run with the appengine design limits and then manually match rest of the items in the query.
One way of doing it is to start with the first search-item and then get count, add another the next search-item, get count. The point it bails out, I then process those records with rest of the search-items manually.
The question is
Is there a way before hand to know if a query is valid programatically w/o doing a count
How do you determine the best of search-items in a set which do not collide (like not-in does not work on many filters) etc.
The only way I see it is to get all equal filters as one query, take the first in-equality filter or in, execute it and just iterate over the search entities.
Is there a library which can help me ;)
I understand indexing and how filters (not in) cannot work on more than one property.
This is not strictly true. You may create a "composite index" which allows you to perform filters on multiple fields. These consume additional data.
You may also generate your own equivalent of composite index by generating your own "composite field" that you can use to query against.
Is there a way before hand to know if a query is valid programatically w/o doing a count
I'm not sure I understand what kind of validity you're referring to.
How do you determine the best of search-items in a set which do not collide (like not-in does not work on many filters) etc.
A "not in" filter is not trivial. One way is to create two arrays (repeated fields). One with all the tagged entries and one with not all the tags. This would allow you to easily find all the entities with and without the tag. The only issue is that once you create a new tag, you have to sweep across the entities adding a "not in" entry for all the entities.

Google App Engine storing as list vs JSON

I have a model called User, and a user has a property relatedUsers, which, in its general format, is an array of integers. Now, there will be times when I want to check if a certain number exists in a User's relatedUsers array. I see two ways of doing this:
Use a standard Python list with indexed values (or maybe not) and just run an IN query and see if that number is in there.
Having the key to that User, get back the value for property relatedUsers, which is an array in JSON string format. Decode the string, and check if the number is in there.
Which one is more efficient? Would number 1 cost more reads than option 2? And would number 1 writes cost more than number 2, since indexing each value costs a write. What if I don't index -- which solution would be better then?
Here's your costs vs capability, option wise:
Putting the values in an indexed list will be far more expensive. You will incur the cost of one write for each value in the list, which can explode depending on how many friends your users have. It's possible for this cost explosion to be worse if you have certain kinds of composite indexes. The good side is that you get to run queries on this information: you can get query for a list of users who are friends with a particular user, for example.
No extra index or write costs here. The problem is that you lose querying functionality.
If you know that you're only going to be doing checks only on the current user's list of friends, by all means go with option 2. Otherwise you might have to look at your design a little more carefully.

List of keys or separate model?

I'm building an app with users and their activities. Now I'm thinking of the best way of setting up the datastore models. Which one is fastest/preferred, and why?
A
class User(db.Model):
activities = db.ListProperty(db.Key)
...
class Activity(db.Model):
...
activities = db.get(user.activities)
or
B
class User(db.Model):
...
class Activity(db.Model):
owner = db.ReferenceProperty(reference_class=User)
...
activities = Activity.filter('owner =', user)
If a given activity can only have a single owner, definitely use a ReferenceProperty.
It's what ReferencePropertys are designed for
It'll automatically set up back-references for you, which can be handy since it gives you a bi-directional link (unlike the ListProperty which is a uni-directional link)
It enforces that the thing being linked to is the proper type/class
It enforces that only a single user is linked to a given activity
It lets you automatically fetch the linked objects without having to write an explicit query, if you so desire
I'm guessing the difference is going to be marginal and will likely depend more on your application than some concrete difference in read/write times based on your models.
I would say use the first option if you're going to use info from every activity a user has done each time you fetch a user. In other words, if almost everything a user does on your application coincides with a large subset of their activities, then it makes sense to always have the activities available.
Use option B if you don't need the activities all of the time. This will result in a separate request on the data store whenever you need to use the activity, but it will also make the requests smaller. Making an extra request likely adds more overhead than making bigger requests.
All of that being said, I would be surprised if you had a noticeable difference between these two approaches. The area where you're going to get much more noticeable performance improvements is by using memcache.
I don't know about the performance difference, I suspect it'll be similar. When it comes to perf, things are hard to control with the GAE datastore. If all your queries happen to hit the same tablet (bigtable server), that could limit your perf more than the query itself.
The big difference is that A would be cheaper than B. Since you have a list of activities you want, you don't need to write an index for every activity object you write. If activities are written a lot, your savings add up.
Since you have the activity key, you also have the ability to do a highly-consistent get() rather than an eventually consistent filter()
On the flip side, you won't be able to do backwards references, like look up an owner given an activity. Your ListProperty can also cause you to hit your maximum entity size - there will eventually be a hard limit on the number of activities per user. If you went with B, you can have a huge number of activities per user.
Edit: I forgot, you can have backwards reference if you index your ListProperty, but then that way, writing your User object would get expensive, and the limit on the number of indexed properties would limit the size of your list. So even though it's possible, B is still preferable if you need backwards references.
A will be a good deal faster because it is working purely with keys. Looking up objects with just keys goes straight to the data node in BigTable, whereas B requires a lookup on the indices first which is slower (and costs will go up with the number of Activity entities).
If you never need to test for ownership, you can modify A to not index the key list. This is definitely the cheapest and most efficient route. However, as I understand it, if you later need to index them app engine cannot retroactively update indices on the key list. So only disable the index if you're certain you'll never need it.
How about C: setting Activity's parent to user key? So that you can fetch user's activities with a Activity.query(ancestor=user.key).
That way you don't need additional keys/properties + good way to group your entities for HR datastore.

Resources