This might sound like an obvious question, but I'm new to CouchDB, so I thought it was worthwhile asking in case there is something about CouchDB's structure that changes the situation that I didn't know about. For reasons out of my control, I have to build a queue-like structure out of CouchDB. For simplicity's sake, let's say I'm queueing IDs for jobs to be executed later. Note that there will be no duplicates.
I'm trying to figure out what the best way to structure this is. As I currently see it, I have a few options:
Store the queue items as entries in a queue database with the IDs as _id, and store the dequeued items in a similar dequeued database with the IDs as the _id. Each record in each database wouldn't have any other information other than the (mandatory) _id and _rev.
Have a single queueing database, and that database will contain one record with _id = 'queue' and one record with _id = 'dequeued'. Within each of the two records, there will be an arbitrary number of keys, each of which will be an ID for the jobs to be executed (or that were already executed). The values associated in the database with the keys will be irrelevant, possibly just a Boolean.
Have a single queueing database, and within that database, have a single record called queue. Within that record, have two keys: queue and dequeued. Each of those keys will have as its associated value an arbitrary-length list of job execution IDs.
1 is slightly less desirable because it requires two databases, and 2 strikes me as a poor choice because it requires loading the entire list of queued or dequeued items in order to read a list item or make any changes. However, 3 is nice in that it allows for the whole list of IDs to be an ordered list rather than key/value pairs, which makes it easier to pick a random item from the list to be the next job to be executed, since I don't actually need to know any key names (since there are none).
I'm looking for whichever provides the best performance. Any thoughts on this?
Update
For people reading this question in the future, I've built my CouchDB queuing module, CouchQueue, a work in progress.
You can get it npm install couchqueue.
Take a look (and please comment, pull request, etc.) here at Github.
Use one document per element in the queue, and keep one queue database.
I recommend a field to order the elements, for example .created_at with a timestamp in ISO 8601 format.
You can toggle an element's visibility with a .visible flag.
I recommend a map/reduce view, something like this
function(doc) {
if(doc.visible)
emit(doc.created_at, doc)
}
Now you can query this view, either oldest-first, or newest-first (?descending=true). You can mark an element complete by updating it, setting visible = false.
I wrote a CouchDB queue, CQS which is identical to the Amazon SQS API. It is similar to what I describe, except there is a checked-out state messages can be, not visible in the queue for a timeout period. I have used CQS in production for about two years, with hundreds of millions of updates.
I suggest using separate documents for each queue entry, it will allow you to avoid conflicts.
If you just need an queue with the interface push(), pop(), top() for adding inserting an element and taking one then the solution may be very simple (if you want the list with next(), or accessing n-th element, it gets more complicated). For scheduling algorithm with linear order (like FIFO, FILO) you can implement push() as insertion of the new document:
{ type: "queue", inserted: CURRENT_TIME, ... }
top() as the map:
function (doc) {
if (doc.type == "queue" && doc.inserted) {
emit(doc.inserted, doc);
}
}
and reduce as an aggregation (eg. max for FILO, min for FIFO).
For pop() you can ask view for the top() and then delete the document.
Map/reduce has to be deterministic, so if you want to choose the random element you can make the reduce dependent on the pseudo-random (chosen by the server) _id.
I expect two problems:
Mind the concurrency: two processes can ask for the same document with top(), first will delete the document as part of the pop(), and second will try to fetch the deleted document.
CouchDB never really deletes the document, only marks as deleted. Adding and deleting for each push()/pop() will grow the database. You will have to reuse the documents somehow. Perhaps you have some poll of the tasks, which are inserted and removed, or reordered in the queue. Then you can add queued: true to the task document, instead of separate documents with type: "queue".
Related
Just a question regarding NoSQL DB. As far as I know, operations are done by the app/website outside the DB. For instance, if I need to add an value to a list, I need to
download the intial list
add the new value in the list on my device
upload the whole updated list.
At the end, a lot of data is travelling (twice the initial list) with no added value.
Is there any way to request directly the DB for simple operations like this?
db.collection("collection_key").document("document_key").add("mylist", value)
Or simply increment a field?
Same for knowing the number of documents in a collection: is it needed to download the whole set of document to get the number ?
Couple different answers:
In Firestore, many intrinsic operations can be done "FieldValues", such as increment/decrement (by supplied value, so really Add/subtract). Also array unions, field deletes, etc. Just search the documentation for FieldValue. Whether this is true for NoSQL in general, I can't say.
Knowing the number of documents, on the other hand. is not trivially done in Firestore - but frankly, I can't think of any situations other than artificially contrived examples where you would need to know. Easy enough to setup ways to "count" documents as you create/delete them, and keep that separately, if for some reason you find yourself needing it.
Or were you just trying to generically put down NoSQL as a concept?
I'm using pouchDb on an electron app. The data was stored in a postgres database before passing to pouchDb. On some cases it wasn't hard to figured out how to structure the data in a document fashion.
My main concern is regarding relations. For example:
I have the data type Projects and the Projects have many events. Right now I have a field called project_id on each event. So when I want to get the events for a project with ID 'project/1' I'll do
_db.allDocs({
include_docs: true,
startkey: 'event',
endkey: 'event\uffff'
}).then(function(response){
filtered = _.filter(response['rows'], function(row){
return row['doc']['project_id'] == 'project/1'
});
result = filtered.map(function(row){
return row['doc']
})
});
I've read that allDocs is the most performant API, but, Is having a view more convenient on this case?
On the other hand, when I show a list with all the projects, each project needs to show the number of events it has. On this scenario looks like I would have to run allDocs again, with include_docs: false in order to count the number of events the project has.
Does having a view improves this situation?
On the other hand I'm thinking on having an array with all the events Ids on the Project document so I can easily count how many event it has. In this case should I use allDocs? Is there a way of passing an array of Ids to allDocs? Or would it be better using a loop over that array and call get(id) for each id?
Is this other way more performant than the first one?
Thanks!
Good question! There are many ways to handle relationships in PouchDB. And like many NoSQL databases, each will give you a tradeoff of performance vs. convenience.
The system you describe is not terribly performant. Basically you are fetching every single event in the database (O(n)) and then filtering in-memory. If you have many events, then n will be large, meaning it will be very very slow.
You have a few options here. All of them are better than your current system:
Linked (aka joined) documents in map/reduce. I.e. in your map function, you would emit() the project _ids for each event. This creates a secondary index on whatever you put as the key in the emit() function.
relational-pouch, which is a plugin that works by using prefixed _ids and running allDocs() with startkey and endkey for each one. So it would do one allDocs() to fetch the project, then a second allDocs() to fetch the events for that project.
Entirely separate databases, e.g. new PouchDB('projects') and new PouchDB('events')
(Roughly, these are listed in order of least performant to most performant.)
#1 is more performant than the system you describe, although it's still not terribly fast, because it requires creating a secondary index, and then after that will essentially do an allDocs() on the secondary index database as well as on the original database (to fetch the linked docs). So basically you are running allDocs() three times under the hood – one of which is on whatever you emitted as the key, which it seems like you don't need, so it would just be wasted.
#2 is much better, because under the hood it runs two fast allDocs() queries - one to fetch the project, and another to fetch the events. It also doesn't require creating a secondary index; it can use the free _id index.
#3 also requires two allDocs() calls. So why is it the fastest? Well, interestingly it's because of how IndexedDB orders read/write operations under the hood. Let's say you are writing to both 'projects' and 'events'. What IndexedDB will do is to serialize those two writes, because it can't be sure that the two aren't going to modify the same documents. (When it comes to reads, though, the two queries can run concurrently in either case – in Chrome, at least. I believe Firefox will actually serialize the reads.) So basically if you have two completely separate PouchDBs, representing two completely separate IndexedDBs, then both reads and writes can be done concurrently.
Of course, in the case of a parent-child relationship, you can't know the child IDs in advance, so you have to fetch the parent anyway and then fetch the children. So in that case, there is no performance difference between #2 and #3.
In your case, I would say the best choice is probably #2. It's a nice compromise between perf and convenience, especially since the relational-pouch plugin already does the work for you.
I'm trying to make a general purpose data structure. Essentially, it will be an append-only list of updates that clients can subscribe to. Clients can also send updates.
I'm curious for suggestions on how to implement this. I could have a ndb.Model, 'Update' that contains the data and an index, or I could use a StructuredProperty with Repeated=true on the main Entity. I could also just store a list of keys somehow and then the actual update data in a not-strongly-linked structure.
I'm not sure how the repeated properties work - does appending to the list of them (via the Python API) have to rewrite them all?
I'm also worried abut consistency. Since multiple clients might be sending updates, I don't want them to overwrite eachother and lose an update or somehow end up with two updates with the same index.
The problem is that you've a maximum total size for each model in the datastore.
So any single model that accumulates updates (storing the data directly or via collecting keys) will eventually run out of space (not sure how the limit applies with regard to structured properties however).
Why not have a model "update", as you say, and a simple version would be to have each provided update create and save a new model. If you track the save date as a field in the model you can sort them by time when you query for them (presumably there is an upper limit anyway at some level).
Also that way you don't have to worry about simultaneous client updates overwriting each other, the data-store will worry about that for you. And you don't need to worry about what "index" they've been assigned, it's done automatically.
As that might be costly for datastore reads, I'm sure you could implement a version that used repeated properties in a single, moving to a new model after N keys are stored but then you'd have to wrap it in a transaction to be sure mutiple updates don't clash and so on.
You can also cache the query generating the results and invalidate it only when a new update is saved. Look at NDB also as it provides some automatic caching (not for a query however).
I'm building an app with users and their activities. Now I'm thinking of the best way of setting up the datastore models. Which one is fastest/preferred, and why?
A
class User(db.Model):
activities = db.ListProperty(db.Key)
...
class Activity(db.Model):
...
activities = db.get(user.activities)
or
B
class User(db.Model):
...
class Activity(db.Model):
owner = db.ReferenceProperty(reference_class=User)
...
activities = Activity.filter('owner =', user)
If a given activity can only have a single owner, definitely use a ReferenceProperty.
It's what ReferencePropertys are designed for
It'll automatically set up back-references for you, which can be handy since it gives you a bi-directional link (unlike the ListProperty which is a uni-directional link)
It enforces that the thing being linked to is the proper type/class
It enforces that only a single user is linked to a given activity
It lets you automatically fetch the linked objects without having to write an explicit query, if you so desire
I'm guessing the difference is going to be marginal and will likely depend more on your application than some concrete difference in read/write times based on your models.
I would say use the first option if you're going to use info from every activity a user has done each time you fetch a user. In other words, if almost everything a user does on your application coincides with a large subset of their activities, then it makes sense to always have the activities available.
Use option B if you don't need the activities all of the time. This will result in a separate request on the data store whenever you need to use the activity, but it will also make the requests smaller. Making an extra request likely adds more overhead than making bigger requests.
All of that being said, I would be surprised if you had a noticeable difference between these two approaches. The area where you're going to get much more noticeable performance improvements is by using memcache.
I don't know about the performance difference, I suspect it'll be similar. When it comes to perf, things are hard to control with the GAE datastore. If all your queries happen to hit the same tablet (bigtable server), that could limit your perf more than the query itself.
The big difference is that A would be cheaper than B. Since you have a list of activities you want, you don't need to write an index for every activity object you write. If activities are written a lot, your savings add up.
Since you have the activity key, you also have the ability to do a highly-consistent get() rather than an eventually consistent filter()
On the flip side, you won't be able to do backwards references, like look up an owner given an activity. Your ListProperty can also cause you to hit your maximum entity size - there will eventually be a hard limit on the number of activities per user. If you went with B, you can have a huge number of activities per user.
Edit: I forgot, you can have backwards reference if you index your ListProperty, but then that way, writing your User object would get expensive, and the limit on the number of indexed properties would limit the size of your list. So even though it's possible, B is still preferable if you need backwards references.
A will be a good deal faster because it is working purely with keys. Looking up objects with just keys goes straight to the data node in BigTable, whereas B requires a lookup on the indices first which is slower (and costs will go up with the number of Activity entities).
If you never need to test for ownership, you can modify A to not index the key list. This is definitely the cheapest and most efficient route. However, as I understand it, if you later need to index them app engine cannot retroactively update indices on the key list. So only disable the index if you're certain you'll never need it.
How about C: setting Activity's parent to user key? So that you can fetch user's activities with a Activity.query(ancestor=user.key).
That way you don't need additional keys/properties + good way to group your entities for HR datastore.
I need to do transactions (begin, commit or rollback), locks (select for update).
How can I do it in a document model db?
Edit:
The case is this:
I want to run an auctions site.
And I think how to direct purchase as well.
In a direct purchase I have to decrement the quantity field in the item record, but only if the quantity is greater than zero. That is why I need locks and transactions.
I don't know how to address that without locks and/or transactions.
Can I solve this with CouchDB?
No. CouchDB uses an "optimistic concurrency" model. In the simplest terms, this just means that you send a document version along with your update, and CouchDB rejects the change if the current document version doesn't match what you've sent.
It's deceptively simple, really. You can reframe many normal transaction based scenarios for CouchDB. You do need to sort of throw out your RDBMS domain knowledge when learning CouchDB, though. It's helpful to approach problems from a higher level, rather than attempting to mold Couch to a SQL based world.
Keeping track of inventory
The problem you outlined is primarily an inventory issue. If you have a document describing an item, and it includes a field for "quantity available", you can handle concurrency issues like this:
Retrieve the document, take note of the _rev property that CouchDB sends along
Decrement the quantity field, if it's greater than zero
Send the updated document back, using the _rev property
If the _rev matches the currently stored number, be done!
If there's a conflict (when _rev doesn't match), retrieve the newest document version
In this instance, there are two possible failure scenarios to think about. If the most recent document version has a quantity of 0, you handle it just like you would in a RDBMS and alert the user that they can't actually buy what they wanted to purchase. If the most recent document version has a quantity greater than 0, you simply repeat the operation with the updated data, and start back at the beginning. This forces you to do a bit more work than an RDBMS would, and could get a little annoying if there are frequent, conflicting updates.
Now, the answer I just gave presupposes that you're going to do things in CouchDB in much the same way that you would in an RDBMS. I might approach this problem a bit differently:
I'd start with a "master product" document that includes all the descriptor data (name, picture, description, price, etc). Then I'd add an "inventory ticket" document for each specific instance, with fields for product_key and claimed_by. If you're selling a model of hammer, and have 20 of them to sell, you might have documents with keys like hammer-1, hammer-2, etc, to represent each available hammer.
Then, I'd create a view that gives me a list of available hammers, with a reduce function that lets me see a "total". These are completely off the cuff, but should give you an idea of what a working view would look like.
Map
function(doc)
{
if (doc.type == 'inventory_ticket' && doc.claimed_by == null ) {
emit(doc.product_key, { 'inventory_ticket' :doc.id, '_rev' : doc._rev });
}
}
This gives me a list of available "tickets", by product key. I could grab a group of these when someone wants to buy a hammer, then iterate through sending updates (using the id and _rev) until I successfully claim one (previously claimed tickets will result in an update error).
Reduce
function (keys, values, combine) {
return values.length;
}
This reduce function simply returns the total number of unclaimed inventory_ticket items, so you can tell how many "hammers" are available for purchase.
Caveats
This solution represents roughly 3.5 minutes of total thinking for the particular problem you've presented. There may be better ways of doing this! That said, it does substantially reduce conflicting updates, and cuts down on the need to respond to a conflict with a new update. Under this model, you won't have multiple users attempting to change data in primary product entry. At the very worst, you'll have multiple users attempting to claim a single ticket, and if you've grabbed several of those from your view, you simply move on to the next ticket and try again.
Reference: https://wiki.apache.org/couchdb/Frequently_asked_questions#How_do_I_use_transactions_with_CouchDB.3F
Expanding on MrKurt's answer. For lots of scenarios you don't need to have stock tickets redeemed in order. Instead of selecting the first ticket, you can select randomly from the remaining tickets. Given a large number tickets and a large number of concurrent requests, you will get much reduced contention on those tickets, versus everyone trying to get the first ticket.
A design pattern for restfull transactions is to create a "tension" in the system. For the popular example use case of a bank account transaction you must ensure to update the total for both involved accounts:
Create a transaction document "transfer USD 10 from account 11223 to account 88733". This creates the tension in the system.
To resolve any tension scan for all transaction documents and
If the source account is not updated yet update the source account (-10 USD)
If the source account was updated but the transaction document does not show this then update the transaction document (e.g. set flag "sourcedone" in the document)
If the target account is not updated yet update the target account (+10 USD)
If the target account was updated but the transaction document does not show this then update the transaction document
If both accounts have been updated you can delete the transaction document or keep it for auditing.
The scanning for tension should be done in a backend process for all "tension documents" to keep the times of tension in the system short. In the above example there will be a short time anticipated inconsistence when the first account has been updated but the second is not updated yet. This must be taken into account the same way you'll deal with eventual consistency if your Couchdb is distributed.
Another possible implementation avoids the need for transactions completely: just store the tension documents and evaluate the state of your system by evaluating every involved tension document. In the example above this would mean that the total for a account is only determined as the sum values in the transaction documents where this account is involved. In Couchdb you can model this very nicely as a map/reduce view.
No, CouchDB is not generally suitable for transactional applications because it doesn't support atomic operations in a clustered/replicated environment.
CouchDB sacrificed transactional capability in favor of scalability. In order to have atomic operations you need a central coordination system, which limits your scalability.
If you can guarantee you only have one CouchDB instance or that everyone modifying a particular document connects to the same CouchDB instance then you could use the conflict detection system to create a sort of atomicity using methods described above but if you later scale up to a cluster or use a hosted service like Cloudant it will break down and you'll have to redo that part of the system.
So, my suggestion would be to use something other than CouchDB for your account balances, it will be much easier that way.
As a response to the OP's problem, Couch is probably not the best choice here. Using views is a great way to keep track of inventory, but clamping to 0 is more or less impossible. The problem being the race condition when you read the result of a view, decide you're ok to use a "hammer-1" item, and then write a doc to use it. The problem is that there's no atomic way to only write the doc to use the hammer if the result of the view is that there are > 0 hammer-1's. If 100 users all query the view at the same time and see 1 hammer-1, they can all write a doc to use a hammer 1, resulting in -99 hammer-1's. In practice, the race condition will be fairly small - really small if your DB is running localhost. But once you scale, and have an off site DB server or cluster, the problem will get much more noticeable. Regardless, it's unacceptable to have a race condition of that sort in a critical - money related system.
An update to MrKurt's response (it may just be dated, or he may have been unaware of some CouchDB features)
A view is a good way to handle things like balances / inventories in CouchDB.
You don't need to emit the docid and rev in a view. You get both of those for free when you retrieve view results. Emitting them - especially in a verbose format like a dictionary - will just grow your view unnecessarily large.
A simple view for tracking inventory balances should look more like this (also off the top of my head)
function( doc )
{
if( doc.InventoryChange != undefined ) {
for( product_key in doc.InventoryChange ) {
emit( product_key, 1 );
}
}
}
And the reduce function is even more simple
_sum
This uses a built in reduce function that just sums the values of all rows with matching keys.
In this view, any doc can have a member "InventoryChange" that maps product_key's to a change in the total inventory of them. ie.
{
"_id": "abc123",
"InventoryChange": {
"hammer_1234": 10,
"saw_4321": 25
}
}
Would add 10 hammer_1234's and 25 saw_4321's.
{
"_id": "def456",
"InventoryChange": {
"hammer_1234": -5
}
}
Would burn 5 hammers from the inventory.
With this model, you're never updating any data, only appending. This means there's no opportunity for update conflicts. All the transactional issues of updating data go away :)
Another nice thing about this model is that ANY document in the DB can both add and subtract items from the inventory. These documents can have all kinds of other data in them. You might have a "Shipment" document with a bunch of data about the date and time received, warehouse, receiving employee etc. and as long as that doc defines an InventoryChange, it'll update the inventory. As could a "Sale" doc, and a "DamagedItem" doc etc. Looking at each document, they read very clearly. And the view handles all the hard work.
Actually, you can in a way. Have a look at the HTTP Document API and scroll down to the heading "Modify Multiple Documents With a Single Request".
Basically you can create/update/delete a bunch of documents in a single post request to URI /{dbname}/_bulk_docs and they will either all succeed or all fail. The document does caution that this behaviour may change in the future, though.
EDIT: As predicted, from version 0.9 the bulk docs no longer works this way.
Just use SQlite kind of lightweight solution for transactions, and when the transaction is completed successfully replicate it, and mark it replicated in SQLite
SQLite table
txn_id , txn_attribute1, txn_attribute2,......,txn_status
dhwdhwu$sg1 x y added/replicated
You can also delete the transactions which are replicated successfully.