I'm investigating the suitability of ArangoDB for a specific use case:
I have a relatively high number of root documents.
Each document represents the top of a hierarchy.
The hierarchies are traversed as graphs.
The link between each level in the hierarchy is established via a combination of arrays of embedded documents and via IDs in arrays that point to documents.
I need to be able to push IDs onto & delete from arrays.
I need to be able to to add / remove embedded documents.
My questions:
Is ArangoDB able to update embedded documents without updating the entire container document?
Does it have a mechanism to address individual items in arrays for the purpose of pushing at item onto the end / deleting an item in an efficient manner (i.e. not degrading in speed at something like O(n).
I have looked in the documentation and searched online, but couldn't find clear answers to these questions.
To answer your questions:
1) There is no in-place-updating of documents in ArangoDB. When updating a document, ArangoDB will store the new, updated of the original document. The new version is self-contained, meaning it contains the (updated) entire container. The old version of the document is still kept around because other currently running operations may reference it. Outdated versions of documents will eventually be deleted.
2) As can be seen in answer 1, pushing a value into an array or deleting a value from an array will build a new self-contained version of the entire document. That means pushing/deleting an array value will take as long as the construction of the entire document, as this is proportional to the size of the document (i.e. the more array values, the longer this will take).
Related
Just a question regarding NoSQL DB. As far as I know, operations are done by the app/website outside the DB. For instance, if I need to add an value to a list, I need to
download the intial list
add the new value in the list on my device
upload the whole updated list.
At the end, a lot of data is travelling (twice the initial list) with no added value.
Is there any way to request directly the DB for simple operations like this?
db.collection("collection_key").document("document_key").add("mylist", value)
Or simply increment a field?
Same for knowing the number of documents in a collection: is it needed to download the whole set of document to get the number ?
Couple different answers:
In Firestore, many intrinsic operations can be done "FieldValues", such as increment/decrement (by supplied value, so really Add/subtract). Also array unions, field deletes, etc. Just search the documentation for FieldValue. Whether this is true for NoSQL in general, I can't say.
Knowing the number of documents, on the other hand. is not trivially done in Firestore - but frankly, I can't think of any situations other than artificially contrived examples where you would need to know. Easy enough to setup ways to "count" documents as you create/delete them, and keep that separately, if for some reason you find yourself needing it.
Or were you just trying to generically put down NoSQL as a concept?
I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.
I’m building what can be treated as a slideshow app with CouchDB/PouchDB: each “slide” is its own Couch document, and slides can be reordered or deleted, and new slides can be added in between existing slides or at the beginning or end of the slideshow. A slideshow could grow from one to ≲10,000 slides, so I am sensitive to space- and time-efficiency.
I made the slide creation/editing functionality first, completely underestimating how tricky it is to keep track of slide ordering. This is hard because the order of each slide-document is completely independent of the slide-doc itself, i.e., it’s not something I can sort by time or some number contained in the document. I see numerous questions on StackOverflow about how to keep track of ordering in relational databases:
Efficient way to store reorderable items in a database
What would be the best way to store records order in SQL
How can I reorder rows in sql database
Storing item positions (for ordering) in a database efficiently
How to keep ordering of records in a database table
Linked List in SQL
but all these involve either
using a floating-point secondary key for reordering/creation/deletion, with periodic normalization of indexes (i.e., imagine two documents are order-index 1.0 and 2.0, then a third document in between gets key 1.5, then a fourth gets 1.25, …, until ~31 docs are inserted in between and you get floating-point accuracy problems);
a linked list approach where a slide-document has a previous and next field containing the primary key of the documents on either side of it;
a very straightforward approach of updating all documents for each document reordering/insertion/deletion.
None of these are appropriate for CouchDB: #1 incurs a huge amount of incidental complexity in SQL or CouchDB. #2 is unreliable due to lack of atomic transactions (CouchDB might update the previous document with its new next but another client might have updated the new next document meanwhile, so updating the new next document will fail with 409, and your linked list is left in an inconsistent state). For the same reason, #3 is completely unworkable.
One CouchDB-oriented approach I’m evaluating would create a document that just contains the ordering of the slides: it might contain a primary-key-to-order-number hash object as well as an array that converts order-number-to-primary-key, and just update this object when slides are reordered/inserted/deleted. The downside to this is that Couch will keep a copy of this potentially large document for every order change (reorder/insert/delete)—CouchDB doesn’t support compacting just a single document, and I don’t want to run compaction on my entire database since I love preserving the history of each slide-document. Another downside is that after thousands of slides, each change to ordering involves transmitting the entire object (hundreds of kilobytes) from PouchDB/client to Couch.
A tweak to this approach would be to make a second database just to hold this ordering document and turn on auto-compaction on it. It’ll be more work to keep track of two database connections, and I’ll eventually have to put a lot of data down the wire, but I’ll have a robust way to order documents in CouchDB.
So my questions are: how do CouchDB people usually store the order of documents? And can more experienced CouchDB people see any flaws in my approach outlined above?
Thanks to a tip by #LynHeadley, I wound up writing a library that could subdivide the lexicographical interval between strings: Mudder.js. This allows me to infinitely insert and move around documents in CouchDB, by creating new keys at will, without any overhead of a secondary document to store the ordering. I think this is the right way to solve this problem!
Based on what I've read, I would choose the "ordering document" approach. (ie: slideshow document that has an array of ids for each slide document) This is really straightforward and accomplishes the use-case, so I wouldn't let these concerns get in the way of clean/intuitive code.
You are right that this document can grow potentially very large, compounded by the write-heavy nature of that specific document. This is why compaction exists and is the solution here, so you should not fight against CouchDB on this point.
It is a common misconception that you can use CouchDB's revision history to keep a comprehensive history to your database. The revisions are merely there to aid in write concurrency, not as a full version control system.
CouchDB has auto-compaction enabled by default, and without it your database will grow in size unchecked. Thus, you should abandon the idea of tracking document history using this approach, and instead adopt another, safer alternative. (a list of these alternatives is beyond the scope of this answer)
Does anyone know of any databases (SQL or NoSQL) that have native support for position based indexes?
To clarify, on many occasions I've had the need to maintain a position based collection, where the order or position is maintained by an external entity (user, external service, etc). By maintained I mean the order of the items in the collection will be changed quite often but are not based on any data fields in the record, the order is completely arbitrary as far as the service maintaining the collection is concerned. The service needs to provide an interface that allows CRUD functions by position (Insert after Pos X, Delete at Pos Y, etc) as well as manipulating the position (move from pos X to pos Y).
I'm aware there are workaround ways that you can achieve this, I've implemented many myself but this seems like a pretty fundamental way to want to index data. So I can't help but feel there must be an off the shelf solution out there for this.
The only thing I've seen that comes close to this is Redis's List data type, which while it's ordered by position, is pretty limited (compared to a table with multiple indexes) and Redis is more suited as a Cache rather than a persistent data store.
Finally I'm asking this as I've got a requirement that needs user ordered collections that could contain 10,000's of records.
In case it helps anyone, the best approximation of this I've found so far is to implement a Linked List structure in a Graph Database (like Neo4J). Maintaining the item links is considerably easier than maintaining a position column (especially if you only need next links, i.e. not doubly linked). It's easier as there is no need to leave holes, re-index, etc, you only have to move pointers (or relations). The performance is pretty good but reads slow down linearly if you're trying to access items towards the end of the list by position, as you have to scan (SKIP) the whole list start to end.
I'm coding a new NoSQL database, and had what I thought was a novel idea (for me anyways) regarding the hashing mechanism used to locate nodes for a given key.
I'm using object keys that incorporate a timestamp. A hash will be used to determine the node(s) holding the data. Pretty common so far.
The (possible) twist lies in that a map will record the times at which nodes have been added to the cluster. That way I can determine for any given object which nodes were present in the cluster when that object was added (and therefore which nodes hold the object's data).
I'm thinking that in this way growing the cluster wont require any data to be transferred. Objects always live on the same node...for ever.
Has anyone tried something like this? Any potential problems that anyone can foresee?