I'm coding a new NoSQL database, and had what I thought was a novel idea (for me anyways) regarding the hashing mechanism used to locate nodes for a given key.
I'm using object keys that incorporate a timestamp. A hash will be used to determine the node(s) holding the data. Pretty common so far.
The (possible) twist lies in that a map will record the times at which nodes have been added to the cluster. That way I can determine for any given object which nodes were present in the cluster when that object was added (and therefore which nodes hold the object's data).
I'm thinking that in this way growing the cluster wont require any data to be transferred. Objects always live on the same node...for ever.
Has anyone tried something like this? Any potential problems that anyone can foresee?
Related
I have been reading a bit about tries, and how they are a good structure for typeahead designs. Aside from the trie, you usually also have a key/value pair for nodes and pre-computed top-n suggestions to improve response times.
Usually, from what I've gathered, it is ideal to keep them in memory for fast searches, such as what was suggested in this question: Scrabble word finder: building a trie, storing a trie, using a trie?. However, what if your Trie is too big and you have to shard it somehow? (e.g. perhaps a big e-commerce website).
The key/value pair for pre-computed suggestions can be obviously implemented in a key/value store (either kept in memory, like memcached/redis or in a database, and horizontally scalled as needed), but what is the best way to store a trie if it can't fit in memory? Should it be done at all, or should distributed systems each hold part of the trie in memory, while also replicating it so that it is not lost?
Alternatively, a search service (e.g. Solr or Elasticsearch) could be used to produce search suggestions/auto-complete, but I'm not sure whether the performance is up to par for this particular use-case. The advantage of the Trie is that you can pre-compute top-N suggestions based on its structure, leaving the search service to handle actual search on the website.
I know there are off-the-shelf solutions for this, but I'm mostly interesting in learning how to re-invent the wheel on this one, or at least catch a glimpse of the best practices if one wants to broach this topic.
What are your thoughts?
Edit: I also saw this post: https://medium.com/#prefixyteam/how-we-built-prefixy-a-scalable-prefix-search-service-for-powering-autocomplete-c20f98e2eff1, which basically covers the use of Redis as primary data store for a skip list with mongodb for LRU prefixes. Seems an OK approach, but I would still want to learn if there are other viable/better approaches.
I was reading 'System Design Interview: An Insider's Guide' by Alex Yu and he briefly covers this topic.
Trie DB. Trie DB is the persistent storage. Two options are available to store the data:
Document store: Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB [4] are good fits for serialized data.
Key-value store: A trie can be represented in a hash table form [4] by applying the following logic:
• Every prefix in the trie is mapped to a key in a hash table.
• Data on each trie node is mapped to a value in a hash table.
Figure 13-10 shows the mapping between the trie and hash table.
The numbers on each prefix node represent the frequency of searches for that specific prefix/word.
He then suggests possibly scaling the storage by sharding the data by each alphabet (or groups of alphabets, like a-m, n-z). But this would unevenly distribute the data since there are more words that start with 'a' than with 'z'.
So he recommends using some type of shard manager where it keeps track of the query frequency and assigns a shard based on that. So if there are twice the amount of queries for 's' as opposed to 'z' and 'x' combined, two shards can be used, one for 's', and another for 'z' and 'x'.
Just a question regarding NoSQL DB. As far as I know, operations are done by the app/website outside the DB. For instance, if I need to add an value to a list, I need to
download the intial list
add the new value in the list on my device
upload the whole updated list.
At the end, a lot of data is travelling (twice the initial list) with no added value.
Is there any way to request directly the DB for simple operations like this?
db.collection("collection_key").document("document_key").add("mylist", value)
Or simply increment a field?
Same for knowing the number of documents in a collection: is it needed to download the whole set of document to get the number ?
Couple different answers:
In Firestore, many intrinsic operations can be done "FieldValues", such as increment/decrement (by supplied value, so really Add/subtract). Also array unions, field deletes, etc. Just search the documentation for FieldValue. Whether this is true for NoSQL in general, I can't say.
Knowing the number of documents, on the other hand. is not trivially done in Firestore - but frankly, I can't think of any situations other than artificially contrived examples where you would need to know. Easy enough to setup ways to "count" documents as you create/delete them, and keep that separately, if for some reason you find yourself needing it.
Or were you just trying to generically put down NoSQL as a concept?
I’m building what can be treated as a slideshow app with CouchDB/PouchDB: each “slide” is its own Couch document, and slides can be reordered or deleted, and new slides can be added in between existing slides or at the beginning or end of the slideshow. A slideshow could grow from one to ≲10,000 slides, so I am sensitive to space- and time-efficiency.
I made the slide creation/editing functionality first, completely underestimating how tricky it is to keep track of slide ordering. This is hard because the order of each slide-document is completely independent of the slide-doc itself, i.e., it’s not something I can sort by time or some number contained in the document. I see numerous questions on StackOverflow about how to keep track of ordering in relational databases:
Efficient way to store reorderable items in a database
What would be the best way to store records order in SQL
How can I reorder rows in sql database
Storing item positions (for ordering) in a database efficiently
How to keep ordering of records in a database table
Linked List in SQL
but all these involve either
using a floating-point secondary key for reordering/creation/deletion, with periodic normalization of indexes (i.e., imagine two documents are order-index 1.0 and 2.0, then a third document in between gets key 1.5, then a fourth gets 1.25, …, until ~31 docs are inserted in between and you get floating-point accuracy problems);
a linked list approach where a slide-document has a previous and next field containing the primary key of the documents on either side of it;
a very straightforward approach of updating all documents for each document reordering/insertion/deletion.
None of these are appropriate for CouchDB: #1 incurs a huge amount of incidental complexity in SQL or CouchDB. #2 is unreliable due to lack of atomic transactions (CouchDB might update the previous document with its new next but another client might have updated the new next document meanwhile, so updating the new next document will fail with 409, and your linked list is left in an inconsistent state). For the same reason, #3 is completely unworkable.
One CouchDB-oriented approach I’m evaluating would create a document that just contains the ordering of the slides: it might contain a primary-key-to-order-number hash object as well as an array that converts order-number-to-primary-key, and just update this object when slides are reordered/inserted/deleted. The downside to this is that Couch will keep a copy of this potentially large document for every order change (reorder/insert/delete)—CouchDB doesn’t support compacting just a single document, and I don’t want to run compaction on my entire database since I love preserving the history of each slide-document. Another downside is that after thousands of slides, each change to ordering involves transmitting the entire object (hundreds of kilobytes) from PouchDB/client to Couch.
A tweak to this approach would be to make a second database just to hold this ordering document and turn on auto-compaction on it. It’ll be more work to keep track of two database connections, and I’ll eventually have to put a lot of data down the wire, but I’ll have a robust way to order documents in CouchDB.
So my questions are: how do CouchDB people usually store the order of documents? And can more experienced CouchDB people see any flaws in my approach outlined above?
Thanks to a tip by #LynHeadley, I wound up writing a library that could subdivide the lexicographical interval between strings: Mudder.js. This allows me to infinitely insert and move around documents in CouchDB, by creating new keys at will, without any overhead of a secondary document to store the ordering. I think this is the right way to solve this problem!
Based on what I've read, I would choose the "ordering document" approach. (ie: slideshow document that has an array of ids for each slide document) This is really straightforward and accomplishes the use-case, so I wouldn't let these concerns get in the way of clean/intuitive code.
You are right that this document can grow potentially very large, compounded by the write-heavy nature of that specific document. This is why compaction exists and is the solution here, so you should not fight against CouchDB on this point.
It is a common misconception that you can use CouchDB's revision history to keep a comprehensive history to your database. The revisions are merely there to aid in write concurrency, not as a full version control system.
CouchDB has auto-compaction enabled by default, and without it your database will grow in size unchecked. Thus, you should abandon the idea of tracking document history using this approach, and instead adopt another, safer alternative. (a list of these alternatives is beyond the scope of this answer)
Does anyone know of any databases (SQL or NoSQL) that have native support for position based indexes?
To clarify, on many occasions I've had the need to maintain a position based collection, where the order or position is maintained by an external entity (user, external service, etc). By maintained I mean the order of the items in the collection will be changed quite often but are not based on any data fields in the record, the order is completely arbitrary as far as the service maintaining the collection is concerned. The service needs to provide an interface that allows CRUD functions by position (Insert after Pos X, Delete at Pos Y, etc) as well as manipulating the position (move from pos X to pos Y).
I'm aware there are workaround ways that you can achieve this, I've implemented many myself but this seems like a pretty fundamental way to want to index data. So I can't help but feel there must be an off the shelf solution out there for this.
The only thing I've seen that comes close to this is Redis's List data type, which while it's ordered by position, is pretty limited (compared to a table with multiple indexes) and Redis is more suited as a Cache rather than a persistent data store.
Finally I'm asking this as I've got a requirement that needs user ordered collections that could contain 10,000's of records.
In case it helps anyone, the best approximation of this I've found so far is to implement a Linked List structure in a Graph Database (like Neo4J). Maintaining the item links is considerably easier than maintaining a position column (especially if you only need next links, i.e. not doubly linked). It's easier as there is no need to leave holes, re-index, etc, you only have to move pointers (or relations). The performance is pretty good but reads slow down linearly if you're trying to access items towards the end of the list by position, as you have to scan (SKIP) the whole list start to end.
I don't know if I don't know the correct terms or if what I'm looking for simply isn't a common structure, so please bear with me as I try to describe what I am looking for.
Right now I have a sorted set. It changes over time with simple modifications. A (k,v) pair is inserted, deleted, or the value of a specific key may change.
No actions are or ever will be executed on more than a single key.
What I need is a way to store each incremental version of the data set and have it be mapped to a point in time. I will need to access any portion of it quickly and be able to generate the exact sorted set that existed at that time, and how it changed over the time period.
It is not feasible to store the actual sorted sets after each mutation themselves as it is about 10kb of data and will have approximately 2-3 mutations per second on average. This is a personal project so writing 2.5 gigabytes of data per set (times 10-20 sets) per day is cost prohibitive.
Now I have come up with a solution - and here lies my question, does the solution I've come up with have a term? Is there a better way to do it?
If I have an initial dataset Orders, the next iteration of data could be written as Orders + (K,V) then instead of storing the entire set twice, I simply store the actual set once, and then the second time it is stored as a reference + the mutation.
Then if I wanted to access Orders[n] I would iterative Orders[0] -> Order[n] applying the mutation and I would generate the set as it existed in time.
There is a big problem with this however. I need to be able to quickly access any range of data - roughly 250,000 iterations per day * months or years - so it is not practical to calculate the set from 0 -> n when n is large. The obvious solution here is to at some interval cache the resulting set and instead of a given data point recursively being calculated all the way back to Orders[0] it would only need to calculate back to Orders[1,500,000] to find the set which existed at Orders[1,500,100].
If I decided this was a good way to structure the data, how often should I cache results?
Does something like this exist? In my research a lot of sources said to use linked lists or binary trees. I don't need a tree as my data is 100% continuous, not branching. So if I used a linked list my confusion lies in actually storing the data. And this is where I am completely stuck. What would be the best database & database schema to store this? (could use any system at this point, though having a node.js wrapper would be ideal as that is what is serving the data to the front-end) Or would writing binary data work better?
Even something as simple as an actual term for what I'm looking for or an alternative data structure to research would be helpful. Thanks!
This sounds like an excellent use case for a persistent binary search tree. A persistent data structure is one where after performing an operation on the structure, you get back two structures - the one before the change and the one after the change. Crucially, the internal representations of the two structures share memory, so if you have a 10KB collection, it takes much less than 20KB to store the before and after snapshots.
Since you want a key/value store, a persistent binary search tree might be your best bet. Like a normal BST, all operations run in O(log n) time. You can then store an array of all the snapshots, giving you O(1) access to any time slice you want.
Hope this helps!
The data structures you are talking about are typically known as "persistent data structures" or sometimes "immutable data structures."