What is the correct way of creating a document that can be queried by any of its elements? - database

I'm working on a project in which I want all documents in a "pool" to be returned by searching for any element in the pool.
So for instance lets say we have 3 pools, each with varying documents labeled by letter
Pool 1: A, B, C
Pool 2: D
Pool 3: E, F, G, H
When I search for A, I want to get A, B, and C. When I search C, I also want to get A, B, and C.
If I add a document I, and it satisfies criteria for Pool 1 and 2, then Pools 1 and 2 should be merged, and any search for any A, B, C, D, I should return all of them.
I know how to do this inefficiently (create a new document with each element as key, then update all documents on each insertion), but I was wondering if there was a better way?
Thanks in advance

I think that with something as abstract as data, particularly database documents, a good visualization helps with conceptualizing the problem. Try viewing this problem from the perspective of attempting to maintain a set of trees of depth no more than 1. Specifically, each document is a leaf and the "rules" that determine which ones are part of the "pool" are the root (i.e. the root is the subset of labels that can be a leaf).
Now, what you're saying you want to do is to be able to add a new leaf. If this leaf is able to connect to more than one root, then those roots should be merged, which means updating what the root is and pointing every leaf from the affected trees to this new root.
Otherwise, what you end up with is the need to jump around from the new leaf to each of the roots it connects to and then to every other leaf. But each other leaf could potentially also be connected to other roots, which means you could be jumping around like this an arbitrary number of times. This is a non-ideal situation.
In order for this query to be efficient, you need to decide what these "roots" are going to be and update those accordingly. You may, for instance, decide to keep a "pool" document and merge these "pools" together as needed, e.g. by having a labels field that is an array of labels to be included in the pool. Merging is then just a matter of merging the arrays themselves. Alternatively, you could use a common ObjectId (not necessarily attached to any particular document) and use this value as a sort of "pseudo root node" in place of having documents. There are a number of options you could explore. In general, however, you should try to reduce any examining of field values for individual documents down to a single value check (e.g. don't keep arrays of other "related" labels in each document!).
Regardless of your approach, keep these tree structures in mind, consider what it means to traverse the nodes in terms of MongoDB queries, and determine how you want to traverse the nodes in order to 1) ensure that the number of "hops" you need between nodes is a constant-time operation, and 2) ensure that you can efficiently and reliably merge those roots without risk of data loss.
Finally, if you're finding that your update queries are too slow, then you're probably running into indexing problems. With proper indexes, updates on collections with even millions of documents shouldn't take any time at all. Additionally, if you're not doing a multi update and are instead running an individual update for each document, then your updates are badly written because you'll be running into O(n) search time and network overhead, which will slow your updates down to a crawl.

Related

Database Position Index

Does anyone know of any databases (SQL or NoSQL) that have native support for position based indexes?
To clarify, on many occasions I've had the need to maintain a position based collection, where the order or position is maintained by an external entity (user, external service, etc). By maintained I mean the order of the items in the collection will be changed quite often but are not based on any data fields in the record, the order is completely arbitrary as far as the service maintaining the collection is concerned. The service needs to provide an interface that allows CRUD functions by position (Insert after Pos X, Delete at Pos Y, etc) as well as manipulating the position (move from pos X to pos Y).
I'm aware there are workaround ways that you can achieve this, I've implemented many myself but this seems like a pretty fundamental way to want to index data. So I can't help but feel there must be an off the shelf solution out there for this.
The only thing I've seen that comes close to this is Redis's List data type, which while it's ordered by position, is pretty limited (compared to a table with multiple indexes) and Redis is more suited as a Cache rather than a persistent data store.
Finally I'm asking this as I've got a requirement that needs user ordered collections that could contain 10,000's of records.
In case it helps anyone, the best approximation of this I've found so far is to implement a Linked List structure in a Graph Database (like Neo4J). Maintaining the item links is considerably easier than maintaining a position column (especially if you only need next links, i.e. not doubly linked). It's easier as there is no need to leave holes, re-index, etc, you only have to move pointers (or relations). The performance is pretty good but reads slow down linearly if you're trying to access items towards the end of the list by position, as you have to scan (SKIP) the whole list start to end.

Suitability of ArangoDB for heavily updated arrays / embedded documents

I'm investigating the suitability of ArangoDB for a specific use case:
I have a relatively high number of root documents.
Each document represents the top of a hierarchy.
The hierarchies are traversed as graphs.
The link between each level in the hierarchy is established via a combination of arrays of embedded documents and via IDs in arrays that point to documents.
I need to be able to push IDs onto & delete from arrays.
I need to be able to to add / remove embedded documents.
My questions:
Is ArangoDB able to update embedded documents without updating the entire container document?
Does it have a mechanism to address individual items in arrays for the purpose of pushing at item onto the end / deleting an item in an efficient manner (i.e. not degrading in speed at something like O(n).
I have looked in the documentation and searched online, but couldn't find clear answers to these questions.
To answer your questions:
1) There is no in-place-updating of documents in ArangoDB. When updating a document, ArangoDB will store the new, updated of the original document. The new version is self-contained, meaning it contains the (updated) entire container. The old version of the document is still kept around because other currently running operations may reference it. Outdated versions of documents will eventually be deleted.
2) As can be seen in answer 1, pushing a value into an array or deleting a value from an array will build a new self-contained version of the entire document. That means pushing/deleting an array value will take as long as the construction of the entire document, as this is proportional to the size of the document (i.e. the more array values, the longer this will take).

how to remember multiple indexes in a buffer to later access them for modification one by one...keeping optimization in mind

i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.

C Database Design, Sortable by Multiple Fields

If memory is not an issue for my particular application (entry, lookup, and sort speed being the priorities), what kind of data structure/concept would be the best option for a multi-field rankings table?
For example, let's say I want to create a Hall of Fame for a game, sortable by top score (independent of username), username (with all scores by the same user placed together before ranking users by their highest scores), or level reached (independent of score or name). In this example, if I order a linked list, vector, or any other sequential data structure by the top score of each player, it makes searching for the other fields -- like level and non-highest scores -- more iterative (i.e. iterate across all looking for the stage, or looking for a specific score-range), unless I conceive some other way to store the information sorted when I enter new data.
The question is whether there is a more efficient (albeit complicated and memory-consumptive) method or database structure in C/C++ that might be primed for this kind of multi-field sort. Linked lists seem fine for simple score rankings, and I could even organize a hashtable by hashing on a single field (player name, or level reached) to sort by a single field, but then the other fields take O(N) to find, worse to sort. With just three fields, I wonder if there is a way (like sets or secondary lists) to prevent iterating in certain pre-desired sorts that we know beforehand.
Do it the same way databases do it: using index structures. You have your main data as a number of records (structs), perhaps ordered according to one of your sorting criteria. Then you have index structures, each one ordered according to one of your other sorting criteria, but these index structures don't contain copies of all the data, just pointers to the main data records. (Think "index" like the index in a book, with page numbers "pointing" into the main data body.)
Using ordered linked list for your index structures will give you a fast and simple way to go through the records in order, but it will be slow if you need to search for a given value, and similarly slow when inserting new data.
Hash tables will have fast search and insertion, but (with normal hash tables) won't help you with ordering at all.
So I suggest some sort of tree structure. Balanced binary trees (look for AVL trees) work well in main memory.
But don't forget the option to use an actual database! Database managers such as MySQL and SQLite can be linked with your program, without a separate server, and let you do all your sorting and indexing very easily, using SQL embedded in your program. It will probably execute a bit slower than if you hand-craft your own main-memory data structures, or if you use main-memory data structures from a library, but it might be easier to code, and you won't need to write separate code to save the data on disk.
So, you already know how to store your data and keep it sorted with respect to a single field. Assuming the values of the fields for a single entry are independent, the only way you'll be able to get what you want is to keep three different lists (using the data structure of your choice), each of which are sorted to a different field. You'll use three times the memory's worth of pointers of a single list.
As for what data structure each of the lists should be, using a binary max heap will be effective. Insertion is lg(N), and displaying individual entries in order is O(1) (so O(N) to see all of them). If in some of these list copies the entries need to be sub-sorted by another field, just consider that in the comparison function call.

How to maintain an ordered table with Core Data (or SQL) with insertions/deletions?

This question is in the context of Core Data, but if I am not mistaken, it applies equally well to a more general SQL case.
I want to maintain an ordered table using Core Data, with the possibility for the user to:
reorder rows
insert new lines anywhere
delete any existing line
What's the best data model to do that? I can see two ways:
1) Model it as an array: I add an int position property to my entity
2) Model it as a linked list: I add two one-to-one relations, next and previous from my entity to itself
1) makes it easy to sort, but painful to insert or delete as you then have to update the position of all objects that come after
2) makes it easy to insert or delete, but very difficult to sort. In fact, I don't think I know how to express a Sort Descriptor (SQL ORDER BY clause) for that case.
Now I can imagine a variation on 1):
3) add an int ordering property to the entity, but instead of having it count one-by-one, have it count 100 by 100 (for example). Then inserting is as simple as finding any number between the ordering of the previous and next existing objects. The expensive renumbering only has to occur when the 100 holes have been filled. Making that property a float rather than an int makes it even better: it's almost always possible to find a new float midway between two floats.
Am I on the right track with solution 3), or is there something smarter?
If the ordering is arbitrary i.e. not inherent in the data being modeled, then you have no choice but to add an attribute or relationship to maintain the order.
I would advise the linked list since it is easiest to maintain. I'm not sure what you mean by a linked list being difficult to sort since you won't most likely won't be sorting on an arbitrary order anyway. Instead, you will just fetch the top most instance and walk your way down.
Ordering by a divisible float attribute is a good idea. You can create an near infinite number of intermediate indexes just by subtracting the lower existing index from higher existing index, dividing the result by two and then adding that result to the lower index.
You can also combine an divisible index with a linked list if you need an ordering for tables or the like. The linked list would make it easy to find existing indexes and the divisible index would make it easy to sort if you needed to.
Core Data resist this kind of ordering because it is usually unnecessary. You don't want to add something to the data model unless it is necessary to simulate the real world object, event or condition that the model describes. Usually, ordering/sorting are not inherent to the model but merely needed by the UI/view. In that case, you should have the sorting logic in the controller between the model and the view.
Think carefully before adding ordering to model when you may not need it.
Since iOS 5 you can (and should) use NSOrderedSet and its mutable subclass. → Core Data Release Notes for OS X v10.7 and iOS 5.0
See the accepted answer at How can I preserve an ordered list in Core Data.

Resources