Data Modeling - modeling an Append-only list in NDB

Data Modeling - modeling an Append-only list in NDB - google-app-engine

I'm trying to make a general purpose data structure. Essentially, it will be an append-only list of updates that clients can subscribe to. Clients can also send updates.
I'm curious for suggestions on how to implement this. I could have a ndb.Model, 'Update' that contains the data and an index, or I could use a StructuredProperty with Repeated=true on the main Entity. I could also just store a list of keys somehow and then the actual update data in a not-strongly-linked structure.
I'm not sure how the repeated properties work - does appending to the list of them (via the Python API) have to rewrite them all?
I'm also worried abut consistency. Since multiple clients might be sending updates, I don't want them to overwrite eachother and lose an update or somehow end up with two updates with the same index.

The problem is that you've a maximum total size for each model in the datastore.
So any single model that accumulates updates (storing the data directly or via collecting keys) will eventually run out of space (not sure how the limit applies with regard to structured properties however).
Why not have a model "update", as you say, and a simple version would be to have each provided update create and save a new model. If you track the save date as a field in the model you can sort them by time when you query for them (presumably there is an upper limit anyway at some level).
Also that way you don't have to worry about simultaneous client updates overwriting each other, the data-store will worry about that for you. And you don't need to worry about what "index" they've been assigned, it's done automatically.
As that might be costly for datastore reads, I'm sure you could implement a version that used repeated properties in a single, moving to a new model after N keys are stored but then you'd have to wrap it in a transaction to be sure mutiple updates don't clash and so on.
You can also cache the query generating the results and invalidate it only when a new update is saved. Look at NDB also as it provides some automatic caching (not for a query however).

Related

What's the most efficient way of reading relations out of a pouchdb database

I'm using pouchDb on an electron app. The data was stored in a postgres database before passing to pouchDb. On some cases it wasn't hard to figured out how to structure the data in a document fashion.
My main concern is regarding relations. For example:
I have the data type Projects and the Projects have many events. Right now I have a field called project_id on each event. So when I want to get the events for a project with ID 'project/1' I'll do
_db.allDocs({
include_docs: true,
startkey: 'event',
endkey: 'event\uffff'
}).then(function(response){
filtered = _.filter(response['rows'], function(row){
return row['doc']['project_id'] == 'project/1'
});
result = filtered.map(function(row){
return row['doc']
})
});
I've read that allDocs is the most performant API, but, Is having a view more convenient on this case?
On the other hand, when I show a list with all the projects, each project needs to show the number of events it has. On this scenario looks like I would have to run allDocs again, with include_docs: false in order to count the number of events the project has.
Does having a view improves this situation?
On the other hand I'm thinking on having an array with all the events Ids on the Project document so I can easily count how many event it has. In this case should I use allDocs? Is there a way of passing an array of Ids to allDocs? Or would it be better using a loop over that array and call get(id) for each id?
Is this other way more performant than the first one?
Thanks!

Good question! There are many ways to handle relationships in PouchDB. And like many NoSQL databases, each will give you a tradeoff of performance vs. convenience.
The system you describe is not terribly performant. Basically you are fetching every single event in the database (O(n)) and then filtering in-memory. If you have many events, then n will be large, meaning it will be very very slow.
You have a few options here. All of them are better than your current system:
Linked (aka joined) documents in map/reduce. I.e. in your map function, you would emit() the project _ids for each event. This creates a secondary index on whatever you put as the key in the emit() function.
relational-pouch, which is a plugin that works by using prefixed _ids and running allDocs() with startkey and endkey for each one. So it would do one allDocs() to fetch the project, then a second allDocs() to fetch the events for that project.
Entirely separate databases, e.g. new PouchDB('projects') and new PouchDB('events')
(Roughly, these are listed in order of least performant to most performant.)
#1 is more performant than the system you describe, although it's still not terribly fast, because it requires creating a secondary index, and then after that will essentially do an allDocs() on the secondary index database as well as on the original database (to fetch the linked docs). So basically you are running allDocs() three times under the hood – one of which is on whatever you emitted as the key, which it seems like you don't need, so it would just be wasted.
#2 is much better, because under the hood it runs two fast allDocs() queries - one to fetch the project, and another to fetch the events. It also doesn't require creating a secondary index; it can use the free _id index.
#3 also requires two allDocs() calls. So why is it the fastest? Well, interestingly it's because of how IndexedDB orders read/write operations under the hood. Let's say you are writing to both 'projects' and 'events'. What IndexedDB will do is to serialize those two writes, because it can't be sure that the two aren't going to modify the same documents. (When it comes to reads, though, the two queries can run concurrently in either case – in Chrome, at least. I believe Firefox will actually serialize the reads.) So basically if you have two completely separate PouchDBs, representing two completely separate IndexedDBs, then both reads and writes can be done concurrently.
Of course, in the case of a parent-child relationship, you can't know the child IDs in advance, so you have to fetch the parent anyway and then fetch the children. So in that case, there is no performance difference between #2 and #3.
In your case, I would say the best choice is probably #2. It's a nice compromise between perf and convenience, especially since the relational-pouch plugin already does the work for you.

Is Couchbase an ordered key-value store?

Are documents in Couchbase stored in key order? In other words, would they allow efficient queries for retrieving all documents with keys falling in a certain range? In particular I need to know if this is true for Couchbase lite.

Query efficiency is correlated with the construction of the views that are added to the server.
Couchbase/Couchbase Lite only stores the indexes specified and generated by the programmer in these views. As Couchbase rebalances it moves documents between nodes, so it seems impractical that key order could be guaranteed or consistent.
(Few databases/datastores guarantee document or row ordering on disk, as indexes provide this functionality more cheaply.)
Couchbase document retrieval is performed via map/reduce queries in views:
A view creates an index on the data according to the defined format and structure. The view consists of specific fields and information extracted from the objects in Couchbase. Views create indexes on your information that enables search and select operations on the data.
source: views intro
A view is created by iterating over every single document within the Couchbase bucket and outputting the specified information. The resulting index is stored for future use and updated with new data stored when the view is accessed. The process is incremental and therefore has a low ongoing impact on performance. Creating a new view on an existing large dataset may take a long time to build but updates to the data are quick.
source: Views Basics
source
and finally, the section on Translating SQL to map/reduce may be helpful:
In general, for each WHERE clause you need to include the corresponding field in the key of the generated view, and then use the key, keys or startkey / endkey combinations to indicate the data you want to select.
In conclusion, Couchbase views constantly update their indexes to ensure optimal query performance. Couchbase Lite is similar to query, however the server's mechanics differ slightly:
View indexes are updated on demand when queried. So after a document changes, the next query made to a view will cause that view's map function to be called on the doc's new contents, updating the view index. (But remember that you shouldn't write any code that makes assumptions about when map functions are called.)
How to improve your view indexing: The main thing you have control over is the performance of your map function, both how long it takes to run and how many objects it allocates. Try profiling your app while the view is indexing and see if a lot of time is spent in the map function; if so, optimize it. See if you can short-circuit the map function and give up early if the document isn't a type that will produce any rows. Also see if you could emit less data. (If you're emitting the entire document as a value, don't.)
from Couchbase Lite - View

GAE Datastore: Normalization?

Normalization not in a general relational database sense, in this context.
I have received reports from a User. The data in these reports was generated roughly at the same time, making the timestamp the same for all reports gathered in one request.
I'm still pretty new to datastore, and I know you can query on properties, you have to grab the ancestors' entity's key to traverse down... so I'm wondering which one is better performance and "write/read/etc" wise.
Should I do:
Option 1:
User (Entity, ancestor of ReportBundle): general user information properties
ReportBundle (Entity, ancestor of Report): timestamp
Report (Entity): general data properties
Option 2:
User (Entity, ancestor of Report): insert general user information properties
Report (Entity): timestamp property AND general data properties

Do option 2:
Because, you save time for reading and writing an additional Entity.
You also save database operations (which in the end will save money).
As I see from your options, you need to check the timestamp property anyhow so putting it inside the report object would be fine,
also your code is less complex and better maintainable.
As mentioned from Chris and in comments, using datastore means thinking denormalized.
It's better to store the data twice then doing complex queries, goal for your data design should be to get the entities by ID.
Doing so will also save on the amount of indexes you may need. This is important to know.
The reason why the amount of indexes is limited, is because of denormalization.
For each index you create, datastore creates a new table in behind, which holds the data in the right order based on your index. So when you use indexes your data is already stored more then one time. The good thing about this behavior is that writes are faster, because you can write to all the index tables in parallel. Also reads, because you read data already in right order based on your index.
Knowing this, and if only these 2 options are available, option 2 would be the better one.

We have lots of very denormalized models because of the inability to do JOINs.
You should think about how you are going to process the data, if you might expect request timeouts.

List of keys or separate model?

I'm building an app with users and their activities. Now I'm thinking of the best way of setting up the datastore models. Which one is fastest/preferred, and why?
A
class User(db.Model):
activities = db.ListProperty(db.Key)
...
class Activity(db.Model):
...
activities = db.get(user.activities)
or
B
class User(db.Model):
...
class Activity(db.Model):
owner = db.ReferenceProperty(reference_class=User)
...
activities = Activity.filter('owner =', user)

If a given activity can only have a single owner, definitely use a ReferenceProperty.
It's what ReferencePropertys are designed for
It'll automatically set up back-references for you, which can be handy since it gives you a bi-directional link (unlike the ListProperty which is a uni-directional link)
It enforces that the thing being linked to is the proper type/class
It enforces that only a single user is linked to a given activity
It lets you automatically fetch the linked objects without having to write an explicit query, if you so desire

I'm guessing the difference is going to be marginal and will likely depend more on your application than some concrete difference in read/write times based on your models.
I would say use the first option if you're going to use info from every activity a user has done each time you fetch a user. In other words, if almost everything a user does on your application coincides with a large subset of their activities, then it makes sense to always have the activities available.
Use option B if you don't need the activities all of the time. This will result in a separate request on the data store whenever you need to use the activity, but it will also make the requests smaller. Making an extra request likely adds more overhead than making bigger requests.
All of that being said, I would be surprised if you had a noticeable difference between these two approaches. The area where you're going to get much more noticeable performance improvements is by using memcache.

I don't know about the performance difference, I suspect it'll be similar. When it comes to perf, things are hard to control with the GAE datastore. If all your queries happen to hit the same tablet (bigtable server), that could limit your perf more than the query itself.
The big difference is that A would be cheaper than B. Since you have a list of activities you want, you don't need to write an index for every activity object you write. If activities are written a lot, your savings add up.
Since you have the activity key, you also have the ability to do a highly-consistent get() rather than an eventually consistent filter()
On the flip side, you won't be able to do backwards references, like look up an owner given an activity. Your ListProperty can also cause you to hit your maximum entity size - there will eventually be a hard limit on the number of activities per user. If you went with B, you can have a huge number of activities per user.
Edit: I forgot, you can have backwards reference if you index your ListProperty, but then that way, writing your User object would get expensive, and the limit on the number of indexed properties would limit the size of your list. So even though it's possible, B is still preferable if you need backwards references.

A will be a good deal faster because it is working purely with keys. Looking up objects with just keys goes straight to the data node in BigTable, whereas B requires a lookup on the indices first which is slower (and costs will go up with the number of Activity entities).
If you never need to test for ownership, you can modify A to not index the key list. This is definitely the cheapest and most efficient route. However, as I understand it, if you later need to index them app engine cannot retroactively update indices on the key list. So only disable the index if you're certain you'll never need it.

How about C: setting Activity's parent to user key? So that you can fetch user's activities with a Activity.query(ancestor=user.key).
That way you don't need additional keys/properties + good way to group your entities for HR datastore.

Sencha Touch - Is creating Stores often expensive?

Is it an expensive operation to create Ext.data.Store objects, because
I quite often create stores just for retrieving data once.

It will depend on the quantity of data you're retrieving and how you use it in your application.
You need to weigh up the overheads with calling data from you datasource more than once, with the overhead of storing it on the page and using it client side.
Using stores just to retrieve data once isn't a problem really as the store is just a client side d collection of data. There isn't really much weight to them.
It may also be worth knowing that if you're using ExtJS4 and you're talking about retrieving a single data item, rather than a collection of items, you can create a single 'model' and interact with that rather than a store, which would be a lighter solution.