I'm developing an application for Google App Engine which uses BigTable for its datastore.
It's an application about writing a story collaboratively. It's a very simple hobby project that I'm working on just for fun. It's open source and you can see it here: http://story.multifarce.com/
The idea is that anyone can write a paragraph, which then needs to be validated by two other people. A story can also be branched at any paragraph, so that another version of the story can continue in another direction.
Imagine the following tree structure:
Every number would be a paragraph. I want to be able to select all the paragraphs in every unique story line. Basically, those unique story lines are (2, 7, 2); (2, 7, 6, 5); (2, 7, 6, 11) and (2, 5, 9, 4). Ignore that the node "2" appears twice, I just took a tree structure diagram from Wikipedia.
I also made a diagram of a proposed solution: https://docs.google.com/drawings/edit?id=1fdUISIjGVBvIKMSCjtE4xFNZxiE08AoqvJSLQbxN6pc&hl=en
How can I set up a structure is performance efficient both for writing, but most importantly for reading?
There are a number of well known ways to represent trees in databases; each of them have their pros and cons. Here are the most common:
Adjacency list, where each node stores the ID of its parent.
Materialized path, which is the strategy Keyur describes. This is also the approach used by entity groups (eg, parent entities) in App Engine. It's also more or less what you're describing in your update.
Nested sets, where each node has 'left' and 'right' IDs, such that all child nodes are contained in that range.
Adjacency lists agumented with a root ID.
Each of these has its own advantages and disadvantages. Adjacency lists are simple, and cheap to update, but require multiple queries to retrieve a subtree (one for each parent node). Augmented adjacency lists make it possible to retrieve an entire tree by storing the ID of the root node in every record.
Materialized paths are easy to implement and cheap to update, and permit querying arbitrary subtrees, but impose increasing overhead for deep trees.
Nested sets are tougher to implement, and require updating, on average, half the nodes each time you make an insertion. They allow you to query arbitrary subtrees, without the increasing key length issue materialized path has.
In your specific case, though, it seems like you don't actually need a tree structure at all: each story, branched off an original though it may be, stands alone. What I would suggest is having a 'Story' model, which contains a list of keys of its paragraphs (Eg, in Python a db.ListProperty(db.Key)). To render a story, you fetch the Story, then do a batch fetch for all the Paragraphs. To branch a story, simply duplicate the story entry - leaving the references to Paragraphs unchanged.
One solution I can think about is - the path to the node is also the key of that node. So the key of node 11 is "2/7/6/11". You can traverse the path by a simple key lookup of all keys in the path - "2/7/6/11", "2/7/6", "2/7", "2"
Related
I’m building what can be treated as a slideshow app with CouchDB/PouchDB: each “slide” is its own Couch document, and slides can be reordered or deleted, and new slides can be added in between existing slides or at the beginning or end of the slideshow. A slideshow could grow from one to ≲10,000 slides, so I am sensitive to space- and time-efficiency.
I made the slide creation/editing functionality first, completely underestimating how tricky it is to keep track of slide ordering. This is hard because the order of each slide-document is completely independent of the slide-doc itself, i.e., it’s not something I can sort by time or some number contained in the document. I see numerous questions on StackOverflow about how to keep track of ordering in relational databases:
Efficient way to store reorderable items in a database
What would be the best way to store records order in SQL
How can I reorder rows in sql database
Storing item positions (for ordering) in a database efficiently
How to keep ordering of records in a database table
Linked List in SQL
but all these involve either
using a floating-point secondary key for reordering/creation/deletion, with periodic normalization of indexes (i.e., imagine two documents are order-index 1.0 and 2.0, then a third document in between gets key 1.5, then a fourth gets 1.25, …, until ~31 docs are inserted in between and you get floating-point accuracy problems);
a linked list approach where a slide-document has a previous and next field containing the primary key of the documents on either side of it;
a very straightforward approach of updating all documents for each document reordering/insertion/deletion.
None of these are appropriate for CouchDB: #1 incurs a huge amount of incidental complexity in SQL or CouchDB. #2 is unreliable due to lack of atomic transactions (CouchDB might update the previous document with its new next but another client might have updated the new next document meanwhile, so updating the new next document will fail with 409, and your linked list is left in an inconsistent state). For the same reason, #3 is completely unworkable.
One CouchDB-oriented approach I’m evaluating would create a document that just contains the ordering of the slides: it might contain a primary-key-to-order-number hash object as well as an array that converts order-number-to-primary-key, and just update this object when slides are reordered/inserted/deleted. The downside to this is that Couch will keep a copy of this potentially large document for every order change (reorder/insert/delete)—CouchDB doesn’t support compacting just a single document, and I don’t want to run compaction on my entire database since I love preserving the history of each slide-document. Another downside is that after thousands of slides, each change to ordering involves transmitting the entire object (hundreds of kilobytes) from PouchDB/client to Couch.
A tweak to this approach would be to make a second database just to hold this ordering document and turn on auto-compaction on it. It’ll be more work to keep track of two database connections, and I’ll eventually have to put a lot of data down the wire, but I’ll have a robust way to order documents in CouchDB.
So my questions are: how do CouchDB people usually store the order of documents? And can more experienced CouchDB people see any flaws in my approach outlined above?
Thanks to a tip by #LynHeadley, I wound up writing a library that could subdivide the lexicographical interval between strings: Mudder.js. This allows me to infinitely insert and move around documents in CouchDB, by creating new keys at will, without any overhead of a secondary document to store the ordering. I think this is the right way to solve this problem!
Based on what I've read, I would choose the "ordering document" approach. (ie: slideshow document that has an array of ids for each slide document) This is really straightforward and accomplishes the use-case, so I wouldn't let these concerns get in the way of clean/intuitive code.
You are right that this document can grow potentially very large, compounded by the write-heavy nature of that specific document. This is why compaction exists and is the solution here, so you should not fight against CouchDB on this point.
It is a common misconception that you can use CouchDB's revision history to keep a comprehensive history to your database. The revisions are merely there to aid in write concurrency, not as a full version control system.
CouchDB has auto-compaction enabled by default, and without it your database will grow in size unchecked. Thus, you should abandon the idea of tracking document history using this approach, and instead adopt another, safer alternative. (a list of these alternatives is beyond the scope of this answer)
In this
graphic, we are looking up employee_id 123 and subsidary_id 20 in a B-Tree (from a tutorial on database indexes). There are two leaf nodes branching off the tree. Is this purely demonstration, or is there something I'm missing, as I would think that the only leaf node that would need to be checked would be the top one, as it has employee_id max 123 and subsidary_id max 27.
The diagram itself isn't showing the actions of that particular search, rather it's showing a localised portion of the tree, so I wouldn't read too much into it for the specific query.
You're absolutely correct in that searching for the key 123-20, you never need to follow the link to the second leaf node (either through the hierarchical link from the left, or the sequential link from above).
However (and this is hard to tell without seeing the source material), it's quite likely that this diagram may be used for something else as well.
The fact that it shows links between consecutive leaf nodes means that it would be quite easy to use the index lookup to locate a specific entry, then sequentially process them.
By that I mean a query like "give me every record with an employee ID of 123", or "give me all records with an employee ID between 123 and 456", or "give me all records in employeeID/subsidiaryID order".
All those queries would entail finding a specific record using the hierarchical path (though the final one may have a quicker path direct to the first record), then following the sequential path for subsequent records.
In addition, the fact that the subsidiary IDs of 20 are both red means that it would be an ideal opportunity to educate the reader on the fact that the employee-subsidiary index is not necessarily the best one for all queries. In other words, an efficient query "give me all records from subsidiary 20" would be better with another index (one containing simply the subsidiary ID).
That'd be my best guess, it would be worthwhile looking at the tutorial to see if that diagram is used for something else.
Of course, it could be that the person putting together the tutorial couldn't be bothered creating a new graphic so simply used one from a different question, or an earlier iteration of the tutorial :-) I've been guilty of that before.
I've been trying to learn programming for a while. I've studied Java and Python, and I'm comfortable with their syntax. Recently, I wanted to use what I've learnt with coding a tangible software from ground up.
I want to implement a database engine, sort of a NoSQL database. I've put together a small document, sort of a specification to follow throughout my adventure of coding it. But all I know is a bunch of keywords. I don't know where to start.
Can someone help me find out how to gather the knowledge I need for this kind of work and in what order to learn things? I have searched for documents, but I feel like I'll end up finding unrelated/erroneous content or start from a wrong point, because implementing a complete database engine is (seeming to be) a truly complicated task.
I wan't to express that I'd prefer theses and whitepapers and (e)books to codes of other projects, because I've asked a question of kind in which people usually get answered in the form of "read project - x' source code". I'm not at the level of comfortably reading and understanding source code.
First, you may have a look that the answers for How to write a simple database engine. While it focus on a SQL engine, there is still a lot of good material in the answers.
Otherwise, a good project tutorial is Implementation of a B-Tree Database Class. The example code is in C++, but the description of what is done and why is probably what you'll want to look at anyway.
Also, there is Designing and Implementing Structured Storage (Database Engine) over at MSDN. Plenty of information there to help you in your learning project.
Because the accepted answer only offers (good) links to other resources, I'd thought I share my experience writing webdb, a small experimental database for browsers. I also invite you to read the source code. It's pretty small. You should be able to read through it and get a basic understanding of what it's doing in a couple of hours. Warning: I am a n00b at this and since writing it I learned a lot more about it and see I have been doing some things wrong. It can help you get started though.
The basics: BTree
I started out with adapting an AVL tree to suit my needs. An AVL tree is a kind of self-balancing binary search tree. You store the key K and related data (if any) in a node, then all items with key < K in a node in the left subtree and all items with key > K in a right subtree. You can use an array to store the data items if you want to support non unique keys.
This tree will give you the basics: Create, Update, Delete and a way to quickly get an item by key, or all items with key < x, or with key between x and y etc. It can serve as the index for our table.
A schema
As a next step I wrote code that lets the client code define a schema. Methods like createTable() etc. Schemas are typically associated with SQL, but even no-SQL sort-of has a schema; they usually require you to mark the ID field and any other fields you want to search on. You can make your schema as fancy as you want, but you typically want to model at least which column(s) serve as primary key and which fields will be searched on frequently and need an index.
Creating a data structure to store a table
I decided to use the tree I created in the first step to store my items. These were simple JS objects. Having defined which field contains the PK, I could simply insert the item into the tree using that field's value as the key. This gives me quick lookup by ID (range).
Next I added another tree for every column that needs an index. In these trees I did not store the full record, but only the key. So to fetch a customer by last name, I would first use the index on last name to get the ID, then the primary key index to get the actual record. The reason I did not just store the (reference to the) actual object is because it makes set operations a little bit simpler (see next step)
Querying
Now that we have a table with indexes for PK and search fields, we can implement querying. I did not take this very far as it becomes complicated quickly, but you can get some nice functionality with just some basics. WebDB does not implement joins; all queries operate only on a single table. But once you understand this you see a pretty clear (though long and winding) path to doing joins and other complicated stuff as well.
In WebDB, to get all customers with firstName = 'John' and city = 'New York' (assuming those are two search fields), you would write something like:
var webDb = ...
var johnsFromNY = webDb.customers.get({
firstName: 'John',
city: 'New York'
})
To solve it, we first do two lookups: we get the set X of all IDs of customers named 'John' and we get the set Y of all IDs of customers from New York. We then perform an intersection on these two sets to get all IDs of customers that are both named 'John' AND from New York. We then run through our set of resulting IDs, getting the actual record for each one and adding it to the result array.
Using the set operators like union and intersection we can perform AND and OR searches. I only implemented AND.
Doing joins would (I think) involve creating temporary tables in memory, then populating them as the query runs with the joined results, then applying the query criteria to the temp table. I never got there. I attempted some syncing logic next but that was too ambitious and it went downhill from there :)
I'm starting a project and I'm in the designing phase: I.e., I haven't decided yet on which db framework I'm going to use. I'm going to have code that creates a "forest" like structure. That is, many trees, where each tree is a standard: nodes and edges. After the code creates these trees I want to save them in the db. (and then pull them out eventually)
The naive approach to representing the data in the db is a relational db with two tables: nodes and edges. That is, the nodes table will have a node id, node data, etc.. And the edges table will be a mapping of node id to node id.
Is there a better approach? Or given the (limited) assumptions I'm giving this is the best approach? How about if we add an assumption that the trees are relatively small - is it better to save the whole tree as a blob in the db? Which type of db should I use in that case? Please comment on speed/scalability.
Thanks
I showed a solution similar to your nodes & edges tables, in my answer to the StackOverflow question: What is the most efficient/elegant way to parse a flat table into a tree? I call this solution "Closure Table".
I did a presentation on different methods of storing and using trees in SQL, Models for Hierarchical Data with SQL and PHP. I demonstrated that with the right indexes (depending on the queries you need to run), the Closure Table design can have very good performance, even over large collections of edges (about 500K edges in my demo).
I also covered the design in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Be sure to use some sort of low level-coding for the entity being treed to prevent looping. The entity might be a part, subject, folder, etc.
With an Entity file and and Entity-Xref file you can loop through one of say two relationships between the two files, a parent and a child relation.
A level is the level an entity found in a tree. A low-level-code for the entity is the lowest level an entity is found in any tree anywhere. Check to make sure the low level code of the entity you want to make a child is less than or equal to prevent a loop. after adding an entity as a child it will become at least one level lower.
What are 'best-practices' for saving Composite patterns in a Relational Database?
We have been using Modified Preorder Tree Traversal. This is very quick to build the whole tree, but very slow to insert or delete new nodes (all left and right values need to be adjusted). Also querying the children of a node is not easy and very slow.
Another thing we noticed is that you really have to make sure the tree doesn't get messy. You need transaction locks, otherwise the left and right values can get corrupt, and fixing a corrupt left right tree is not an easy job.
It does work very good however, the Modified Preorder Tree Traversal, but I was wondering if there are better alternatives.
While finding all descendents of a row with MPTT is fast, finding all children can be slow. However you should be able to fix that by adding a parent_id field to your table that records (yes, redundantly) the parent of the row. Then the search becomes:
SELECT *
FROM tbl
WHERE parent_id = z
Yes, parent_id contains redundant information, potentially denormalizing your table -- but since any insert/update/delete already requires global changes, keeping parent_id up-to-date isn't much extra to pay. You could alternatively use a level field that records the vertical level of the row, although that is in fact more likely to change under certain types of transformations (e.g. moving a subtree to a different point in the tree).
The plain old link-to-parent representation (i.e. just having parent_id and no left_pos or right_pos), is of course faster for insert/update-heavy workloads, but the only queries it can answer efficiently are "Find the parent of X" and "Find the children of X." Most workloads involve much more reading than writing, so usually MPTT is faster overall -- but perhaps in your case you need to consider moving ("back") to link-to-parent?
The best way to store hierakial data in a database I have heard is to use a string attribute where the content is the list of parents separated by, say colons.