Maintain versions of a tree - database

I have a tree datastructure having upto 1000 nodes at a particular level ( max depth of 8-9 levels).
I need to maintain versions of this entire tree. A version is created after some processing happens. Between these versions, the data in the nodes may change ( not more than a 100 or so).
As of now, I was cloning the entire tree for each new version, but the space consumption is huge after a few versions. I cannot entirely delete the previous version records since I need to keep a track of the changes.
Whats the optimal way to store these versions in the database? (If not the db, any alternative way).

This is not a very straightforward problem, but it is a solved problem. In general, data structures that remember their history are called persistent data structures.
The linked Wikipedia page has an example of a persistent tree that you should look at.
The path copying approach is fairly simple to implement but doesn't have performance as good as is possible.

Keep each unique node permanently frozen in one table (once you insert a node in there, never edit or delete it). If you need to change a node even slightly, insert this modified node in your table. Then, keep track of your tree versions with foreign keys to the node table. That should require a trivial amount of space per tree.

Store each "version" as a map between the nodes that are changed and the old/new values.
You can reconstruct any previous version by inverting the sequence of operations.

a possible practical solution might be:
if you need to restore an old version:
serialize the new tree and the previous tree into XML.
diff the new XML with the previous XML, serialize and store the difference in a database
(for a Java solution I found http://diffxml.sourceforge.net and XMLUnit, but it
has to be checked if they are able to compute the difference between the new XML
and the previous XML in a way which permits to restore easily the previous XML from
the new XML).
each time an old version is needed, one successively takes the differences from the
database, from the most recent to the most remote, and applies them in this order to
the (serialized into XML) current tree, to get to the XML form of the old version of the tree.
if you do not need to reconstruct the old version, then simply use XMLUnit to compute
the differences and store their serializations in the database.

Since you care about previous versions of the tree and space is your primary concern, assuming that from one version to the other the treas are not entirely different you can store only the difference between the treas. How to do that is entirely up to you:
- you can do an in/pre/post order parsing of the tree (assuming it's binary) and come up with a logic to get from the location of on difference to the other
-or use a linked list to store only the differences + some logic to rebuild the treas

Related

How do databases "update" records?

As far as I can tell, it is not really possible to "update" a single portion of a file. One must overwrite the entire thing or simply append. A database, however, usually has update functionality. How would one design a database to not append - because that causes tombstones - but rather update?
Files can be overwritten it just can be a bit of a tedious process. You will have to know the beginning index of whatever you want to update and set the file pointer to that index in the file before starting to write to that file.
Databases are easier to update because they are a combination of many data structures (Linked lists, Trees, Heaps, etc.) that all contain specific data and can be iterated through. For these data structures you just need to know which node in the structure you need to update and navigate to it and overwrite the data.

Arbitrary document ordering in CouchDB/PouchDB

I’m building what can be treated as a slideshow app with CouchDB/PouchDB: each “slide” is its own Couch document, and slides can be reordered or deleted, and new slides can be added in between existing slides or at the beginning or end of the slideshow. A slideshow could grow from one to ≲10,000 slides, so I am sensitive to space- and time-efficiency.
I made the slide creation/editing functionality first, completely underestimating how tricky it is to keep track of slide ordering. This is hard because the order of each slide-document is completely independent of the slide-doc itself, i.e., it’s not something I can sort by time or some number contained in the document. I see numerous questions on StackOverflow about how to keep track of ordering in relational databases:
Efficient way to store reorderable items in a database
What would be the best way to store records order in SQL
How can I reorder rows in sql database
Storing item positions (for ordering) in a database efficiently
How to keep ordering of records in a database table
Linked List in SQL
but all these involve either
using a floating-point secondary key for reordering/creation/deletion, with periodic normalization of indexes (i.e., imagine two documents are order-index 1.0 and 2.0, then a third document in between gets key 1.5, then a fourth gets 1.25, …, until ~31 docs are inserted in between and you get floating-point accuracy problems);
a linked list approach where a slide-document has a previous and next field containing the primary key of the documents on either side of it;
a very straightforward approach of updating all documents for each document reordering/insertion/deletion.
None of these are appropriate for CouchDB: #1 incurs a huge amount of incidental complexity in SQL or CouchDB. #2 is unreliable due to lack of atomic transactions (CouchDB might update the previous document with its new next but another client might have updated the new next document meanwhile, so updating the new next document will fail with 409, and your linked list is left in an inconsistent state). For the same reason, #3 is completely unworkable.
One CouchDB-oriented approach I’m evaluating would create a document that just contains the ordering of the slides: it might contain a primary-key-to-order-number hash object as well as an array that converts order-number-to-primary-key, and just update this object when slides are reordered/inserted/deleted. The downside to this is that Couch will keep a copy of this potentially large document for every order change (reorder/insert/delete)—CouchDB doesn’t support compacting just a single document, and I don’t want to run compaction on my entire database since I love preserving the history of each slide-document. Another downside is that after thousands of slides, each change to ordering involves transmitting the entire object (hundreds of kilobytes) from PouchDB/client to Couch.
A tweak to this approach would be to make a second database just to hold this ordering document and turn on auto-compaction on it. It’ll be more work to keep track of two database connections, and I’ll eventually have to put a lot of data down the wire, but I’ll have a robust way to order documents in CouchDB.
So my questions are: how do CouchDB people usually store the order of documents? And can more experienced CouchDB people see any flaws in my approach outlined above?
Thanks to a tip by #LynHeadley, I wound up writing a library that could subdivide the lexicographical interval between strings: Mudder.js. This allows me to infinitely insert and move around documents in CouchDB, by creating new keys at will, without any overhead of a secondary document to store the ordering. I think this is the right way to solve this problem!
Based on what I've read, I would choose the "ordering document" approach. (ie: slideshow document that has an array of ids for each slide document) This is really straightforward and accomplishes the use-case, so I wouldn't let these concerns get in the way of clean/intuitive code.
You are right that this document can grow potentially very large, compounded by the write-heavy nature of that specific document. This is why compaction exists and is the solution here, so you should not fight against CouchDB on this point.
It is a common misconception that you can use CouchDB's revision history to keep a comprehensive history to your database. The revisions are merely there to aid in write concurrency, not as a full version control system.
CouchDB has auto-compaction enabled by default, and without it your database will grow in size unchecked. Thus, you should abandon the idea of tracking document history using this approach, and instead adopt another, safer alternative. (a list of these alternatives is beyond the scope of this answer)

Data structure for large set of stepwise/incremental data and method to store it

I don't know if I don't know the correct terms or if what I'm looking for simply isn't a common structure, so please bear with me as I try to describe what I am looking for.
Right now I have a sorted set. It changes over time with simple modifications. A (k,v) pair is inserted, deleted, or the value of a specific key may change.
No actions are or ever will be executed on more than a single key.
What I need is a way to store each incremental version of the data set and have it be mapped to a point in time. I will need to access any portion of it quickly and be able to generate the exact sorted set that existed at that time, and how it changed over the time period.
It is not feasible to store the actual sorted sets after each mutation themselves as it is about 10kb of data and will have approximately 2-3 mutations per second on average. This is a personal project so writing 2.5 gigabytes of data per set (times 10-20 sets) per day is cost prohibitive.
Now I have come up with a solution - and here lies my question, does the solution I've come up with have a term? Is there a better way to do it?
If I have an initial dataset Orders, the next iteration of data could be written as Orders + (K,V) then instead of storing the entire set twice, I simply store the actual set once, and then the second time it is stored as a reference + the mutation.
Then if I wanted to access Orders[n] I would iterative Orders[0] -> Order[n] applying the mutation and I would generate the set as it existed in time.
There is a big problem with this however. I need to be able to quickly access any range of data - roughly 250,000 iterations per day * months or years - so it is not practical to calculate the set from 0 -> n when n is large. The obvious solution here is to at some interval cache the resulting set and instead of a given data point recursively being calculated all the way back to Orders[0] it would only need to calculate back to Orders[1,500,000] to find the set which existed at Orders[1,500,100].
If I decided this was a good way to structure the data, how often should I cache results?
Does something like this exist? In my research a lot of sources said to use linked lists or binary trees. I don't need a tree as my data is 100% continuous, not branching. So if I used a linked list my confusion lies in actually storing the data. And this is where I am completely stuck. What would be the best database & database schema to store this? (could use any system at this point, though having a node.js wrapper would be ideal as that is what is serving the data to the front-end) Or would writing binary data work better?
Even something as simple as an actual term for what I'm looking for or an alternative data structure to research would be helpful. Thanks!
This sounds like an excellent use case for a persistent binary search tree. A persistent data structure is one where after performing an operation on the structure, you get back two structures - the one before the change and the one after the change. Crucially, the internal representations of the two structures share memory, so if you have a 10KB collection, it takes much less than 20KB to store the before and after snapshots.
Since you want a key/value store, a persistent binary search tree might be your best bet. Like a normal BST, all operations run in O(log n) time. You can then store an array of all the snapshots, giving you O(1) access to any time slice you want.
Hope this helps!
The data structures you are talking about are typically known as "persistent data structures" or sometimes "immutable data structures."

Novel idea for NoSQL node mapping based on time based ids?

I'm coding a new NoSQL database, and had what I thought was a novel idea (for me anyways) regarding the hashing mechanism used to locate nodes for a given key.
I'm using object keys that incorporate a timestamp. A hash will be used to determine the node(s) holding the data. Pretty common so far.
The (possible) twist lies in that a map will record the times at which nodes have been added to the cluster. That way I can determine for any given object which nodes were present in the cluster when that object was added (and therefore which nodes hold the object's data).
I'm thinking that in this way growing the cluster wont require any data to be transferred. Objects always live on the same node...for ever.
Has anyone tried something like this? Any potential problems that anyone can foresee?

Storing Composite Patterns (Hierarchical Data) in Database

What are 'best-practices' for saving Composite patterns in a Relational Database?
We have been using Modified Preorder Tree Traversal. This is very quick to build the whole tree, but very slow to insert or delete new nodes (all left and right values need to be adjusted). Also querying the children of a node is not easy and very slow.
Another thing we noticed is that you really have to make sure the tree doesn't get messy. You need transaction locks, otherwise the left and right values can get corrupt, and fixing a corrupt left right tree is not an easy job.
It does work very good however, the Modified Preorder Tree Traversal, but I was wondering if there are better alternatives.
While finding all descendents of a row with MPTT is fast, finding all children can be slow. However you should be able to fix that by adding a parent_id field to your table that records (yes, redundantly) the parent of the row. Then the search becomes:
SELECT *
FROM tbl
WHERE parent_id = z
Yes, parent_id contains redundant information, potentially denormalizing your table -- but since any insert/update/delete already requires global changes, keeping parent_id up-to-date isn't much extra to pay. You could alternatively use a level field that records the vertical level of the row, although that is in fact more likely to change under certain types of transformations (e.g. moving a subtree to a different point in the tree).
The plain old link-to-parent representation (i.e. just having parent_id and no left_pos or right_pos), is of course faster for insert/update-heavy workloads, but the only queries it can answer efficiently are "Find the parent of X" and "Find the children of X." Most workloads involve much more reading than writing, so usually MPTT is faster overall -- but perhaps in your case you need to consider moving ("back") to link-to-parent?
The best way to store hierakial data in a database I have heard is to use a string attribute where the content is the list of parents separated by, say colons.

Resources