ArangoDB - Is indexing better, than having more collections? - database

I have 3 types of entity:
Subjects
Topics
Tasks
In each subjects there are topics and tasks. The topics can depend on each other. (Of course, a topic that belongs to sj1 subject, can only be depended on an another topic that also belongs to sj1 subject.)
Between tasks and topics there are connections (also must belong to same subject) that symbolise the fact that to solve a certain task we need to be aware of certain topics.
So a task can require more topics. Also a topic can be required by more tasks. ( N<--->M connection.)
What would be the best solution to store?
solution
Have 3 collections for each type of entity
In tasks and topics have an index for a subject identifier attribute.
and an edge collection for storing connections between topics [N]<-->[M] tasks
solution
Have 1 collection for the subjects
For each subject, have 1 topics, and 1 tasks collections. The connection between subjects and tasks/topics can be based on prefix of collection names. (I.e. for chemistry subject we have chemistry_tasks and chemistry_topics collections)
For each subject, have an edge collection for connections between the tasks and topics and an another edge collection for connections among topics (I.e. chemistry_topics_tasks_connections and chemistry_topics_connections)
This way if I want to search among topics or tasks of a subject, I don't need to pre-filter them based on the subject identifier index. I'll immediately get the desired collection that contains all of my data. Moreover I don't have overhead of index for each document in tasks and topics.
On the other hand, this will result in a mess of collections.
Sidenote: There will be maximum 50 subjects, but the number of tasks and topics are unlimited.

In your terms, "awareness" is generated through the "graph", which requires no extra indexing to work at it's best. ArangoDB automatically creates special "_key" and "_from/_to" indexes, which it uses for graph traversal.
But as for indexing, that about all search performance - indexes are added based on the data you want to find. It really comes down to how you want to search:
one collection with multiple entity types or
multiple collections segregated by entity type.
There is not a penalty for having large collections, and a graph can link documents within a single collection - it doesn't need them to be segregated. Also, you can have multiple edge collections and/or multiple document collections. These are some of the concepts that challenge those of us who, like me, come from a traditional RDBMS - "schemaless" or "multi-model" databases kinda turn normalization on its ear.
Personally, I choose to build fairly large collections based on the data source (I import a data from external sources). Each collection contains documents of multiple object/data schema identified by an objType attribute. The benefit here is that you can search all documents in the collection on a single field (or even an index with multiple fields, like title + objType), very quickly reducing the set of documents to iterate/traverse - this is usually where real performance gains are made.
So... I guess I recommend solution #3?

Related

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

Mongodb large array or query

My question is related to mongo's ability to handle huge arrays.
I would like to send push notification when topic is updated to all subscribers of the topic. Assume a topic can have a million subscribers.
Will it be efficient to hold a huge array in the topic document that holds all users ids that subscribed to it? Or is the conservative way is better - hold an array of subscribed topics for each user and then query the users collection to find subscribers for specific topic?
Edit:
I would hold an array of subscribed topics in the user collection anyway (for views and edits)
Primary Assumption: Topic-related and person-related metadata is stored in different collections and the collection being discussed here is utilized only to keep track of topic subscribers.
Storing subscribers as a list/array associated with a topic identifier as the document key (meaning an indexed field) makes for an efficient structure. Once you have a topic of interest you can lookup the subscriber list by topic identifier. Here, as #Saleem rightly pointed out, you need to be wary of large subscriber lists causing documents to exceed the 16MB documents size limit. But, instead of complicating the design by making a different collection to handle this (as suggested by #Saleem), you can simply split the subscriber list (into as many parts as required, using a modulo 16MB operation) and create multiple documents for a topic in the same collection. Given that the topic identifier is an indexed field, lookup time will not be hurt, since 16MB can accomodate a significantly huge number of subscriber identifiers and number of splits required should be fairly low, if needed at all.
The other structure you suggested, where a subscriber identifier is the document key with all their subscribed topics in the document is intuitively not so efficient for a large dataset. This structure would involve lookup of all subscribers subscribing to the topic at hand. If subscribed topics are stored as a list/array (seems the likely choice) this query would involve a $in clause which is slower than a indexed field lookup, even for small sized topic lists over a significantly large user base.
If your array is very big and cumulative size of document is exceeding 16 MB, then split it into another collection. You can have topic in collection and all of its subscribers into separate collection referencing topic collection.

Graph database modeling: multiple edges are better than single edges with properties?

This is for a project that will map metadata. There are many more nodes but this particular one became a debate in the team.
Which model would yield the best query performance? Or it does not matter?
Option 1
Permission metadata is explicit as edges between nodes.
Option 2
Permission metadata is inside the properties of the edge.
Option 3
???
Let me comment for ArangoDB here, being one of its developers.
There is a third possibility, namely to have a single vertex collections and multiple edge collections for the different access methods. You would then "officially" have 3 graphs that share the same vertex set.
I would expect that this is better in performance, because each access type would only have to deal with a single type of edge and access would be fast.
Obviously it all depends on your queries. My statement holds for queries like "what are all the Entities a Person can update?" or "who can select this Entity?".
I could imagine that your standard query is more "Can this person delete that Entity?" or "Which access rights does this person have for that Entity?".
These two questions are probably not efficient with any of the approaches suggested, because as far as I see, all of them would then require a search, either in the outgoing edges of the Person or in the incoming edges of the Entity.
What would be needed here are a kind of "vertex centric indices", that is an index that can be used for the set of outgoing or incoming edges of a given vertex. If you, for example would use your option 2 (or indeed 1, this does not matter so much), and have a sorted index on all edges that is sorted first by Person and then by Entity. Then it is a lookup with time complexity O(log(#edges)) to find the (probably singleton) set of edges from a given Person to a given Entity.
We at ArangoDB are currently busy to add this feature, which will appear in one of the next two releases.
I can only speak for Neo4j here:
I don't know that it would matter much, but definitely benchmark! Both relationships and properties are stored as linked lists, so it will still need to traverse them. But if you have more relationships between Person and Entity nodes then putting them in properties starts to become more attractive.
I recommend checking out the free O'Reilly book Graph Databases to learn more about the internals of Neo4j. But benchmarks will always be the gold standard.

Solr / rdbms, where to store additonal data

What would be considered best practice when you need additional data about facet results.
ie. i need a friendlyname / image / meta keywords / description / and more.. for product categories. (when faceting on categories)
include it in the document? (can lead to looots of duplication)
introduce category as a new index in solr (or fake by doctype=category field in solr)
use a rdbms to lookup additional data using a SELECT WHERE IN (..category facet result ids..)
Thanks,
Remco
use fast NoSQL db that fits your data
BTW Lucene, which is Solr's underlying layer, is in fact also NoSQL-type storage facility.
If I were you, I'd use MongoDB. That's the first db that came to mind, since you need binary data and they practically invented BSON, which is now widespread mean of transferring binary data in a JSON-like fashion.
If your data structure is more graph-shaped (like social network) check out Neo4j, which has blindingly fast graph traversal algorithms.
A relational DB can reliably enforce the "category is first class entity" thing. You would need referential integrity: a product may not belong to a category that doesnt exist. A deleted category must not have it's child categories lying around. A normalized RDB can enforce referential integrity through schema. A NoSQL DB must work with client-side code (you must write) to enforce referential integrity.
Lets see how "product's category must exist" and "subcategories' parents must exist" are done:
RDB: The table that assigns categories to products (an m:n relation) must be keyed up to the product and category by an ON DELETE CASCADE. If a category is deleted, a product simply cannot have such a category. A category that links up to another category as a child: the relavent field has an ON DELETE CASCADE. This means that if a parent is deleted, it's children cannot exist. This entire method is declarative ("it is declared thus"), all complexities exist in the data, we dont need no stinking code to do it for us. You can model a DB as naturally as you understand their real world implications.
Document store-type NoSQL: You need to write code to do everything. A "category is deleted" is an use case, and you need to find products that have that category, and update each one. You have to write code for each use case. Same goes for managing subcategories. The data model may be incredibly stupid, but their real-world implications must be modeled in the code. And its tougher to reason in code and control flow rather than in data structures.
Do you really have performance needs that require NoSQL databases?
So use RDBMSs to manage your data. Then use Direct Import handler or client-side code to insert/update denormalized entities for searching. If most requests to your site can be expressed in Solr queries, great!
As for expressing hierarchial faceting in Solr, see ' Ways to do hierarchial faceting in Solr? '.
I would think about 2 alternatives:
1.) strong the informations for every document without indexing it (to keep the index small as possible). The point is, that i would not store the image insight Lucene/Solr - only an file pointer.
2.) store the additional data on an rdbms or nosql (linke mongoDB) to lookup, as you wrote.
My favorite is the 2nd. one, because an database is the traditional and most optimized way to storing data.
But finally it depends on your system, because you should keep in mind, that you need time for connecting an database, searching through the data and sending the additional information back to the application.
So it could be faster to store everything on lucene.
Probably an small performance test would be useful.
maybe I am wrong, but if you are on Solr trunk you could benefit from Solr join suport, this would allow you to index several entities with relations among them while enforcing conditions on both.

Django: efficient database search

I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.

Resources