How could I express the following design?
There are two entities: user and group
Group can have users and other groups
User can't have other users or groups
Efficiently query any group and everything it contains
There are conceptually no depth limits (current hardware dictates it, e.g 5 for query speed)
Examples:
I need to use NoSQL and also be able to cache this data (Redis for example, which is NoSQL itself).
---
My current idea:
Every group is a single unit and only contains children (users and groups) IDs. Then I query all the children by IDs. If some of these also have children, I'll make another roundtrip and so on and on..
As you can imagine, this solution requires multiple queries and the amount increases with every "level of deepness". The good news is that I query all these items by ID which should be extremely fast.
Can anyone suggest a better way?
I would use a graph database as they are very powerful when dealing with this kind of queries.
Bear in mind that you won't be able to query the "parents" of a node though.
You could use Neo4j for this. They have a community edition that is free. https://neo4j.com/
Related
I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.
I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?
Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.
I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.
I would like to know if worth the idea of use graph databases to work specifically with relationships.
I pretend to use relational database for storing entities like "User", "Page", "Comment", "Post" etc.
But in most cases of a typical social graph based workload, I have to get a deep traversals that relational are not good to deal and involves slow joins.
Example: Comment -(made_in)-> Post -(made_in)-> Page etc...
I'm thinking make something like this:
Example:
User id: 1
Query: Get all followers of user_id 1
Query Neo4j for all outcoming edges named "follows" for node user with id 1
With a list of ids query them on the Users table:
SELECT *
FROM users
WHERE user_id IN (ids)
Is this slow?
I have seen this question Is it a good idea to use MySQL and Neo4j together?, but still cannot understand why the correct answer says that that is not a good idea.
Thanks
Using Neo4j is a great choice of technologies for an application like yours, that requires deep traversals. The reason it's a good choice is two-fold: one is that the Cypher language makes such queries very easy. The second is that deep traversals happen very quickly, because of the way the data is structured in the database.
In order to reap both of these benefits, you will want to have both the relationships and the people (as nodes) in the graph. Then you'll be able to do a friend-of-friends query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof
and a friend-of-friend-of-friend query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->()->[:friend]->fofof
RETURN john, fofof
...and so on. (Same idea for posts and comments, just replace the name.)
Using Neo4j alongside MySQL is fine, but I wouldn't do it in this particular way, because the code will be much more complex, and you'll lose too much time hopping between Neo4j and MySQL.
Best of luck!
Philip
In general, the more databases/systems/layers you've got, the more complex the overall setup and operating will be.
Think about all those tasks like synchronization, export/import, backup/archive etc. which become quite expensive if your database(s) grow in size.
People use polyglot persistence only if the benefits of having dedicated and specialized databases outweigh the drawbacks of having to cope with multiple data stores. F.e. this can be the case if you have a large number of data items (activity or transaction logs f.e.), each related to a user. It would probably make no sense to store all the information in a graph database if you're only interested in the connections between the data items. So you would be better off storing only the relations in the graph (and the nodes have just a pointer into the other database), and the data per item in a K/V store or the like.
For your example use case, I would go only for one database, namely Neo4j, because it's a graph.
As the other answers indicate, using Neo4j as your single data store is preferable. However, in some cases, there might not be much choice in the matter where you already have another database behind your product. I would just like to add that if this is the case, running neo4j as your secondary database does work (the product I work on operates in this mode). You do have to work extra hard at figuring out what functionality you expect out of neo4j, what kind of data you need for it,how to keep the data in sync and the consequence of suffering from not always real time results. Most of our use cases can work with near real time results so we are fine. Bit it may not be the case for your product. Still, to me , using neo4j in this mode is still preferable than running without it.
We are able to produce a lot of graphy-great stuff as a result of it.
I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.
I have a bunch of items in my program that all belong to a specific category. I'd like to return only the items that belong to that category. The problem is that categories can have parent categories. For example, let's say there's a category "Stuff" with the child category "Food" with the child category "Fruit". I have the items, Apple, Pear, Chocolate, and Computer.
If I want to display all of the fruits, it's easy to do a database query with a "WHERE item.category = FRUIT_ID" clause. However, if I want all foods to be included, I need a way to get the fruits in there, too.
I know that some databases, like Oracle, have a notion of recursive queries, and that might be the right solution, but I don't have a lot of experiences with hierarchical data and am looking for general suggestions. Assume I have unlimited control over the database schema, the category tree only goes maybe 5 categories deep maximum, and I need it to be as ridiculously fast as possible.
Have a look at the adjacency list model - it's not perfect (it's very slow to update), but in some situations (hierarchical queries), it's a great representation, especially for problems like yours.
There's a whole book full of design strategies for representing trees in SQL. It's worth looking at just for the sheer clever points.
Assuming your category tree is small enough to be cached, you might be better off keeping the category tree in memory and have a function over that tree that will generate a list of category id's that are below a given category.
Then when you query the database, you just use an IN clause with the list of child IDs
One possible solution is to separate the hierarchy from the actual categorization. For instance, an apple could be categorized as both a fruit and a food. The categorization has no knowledge that a fruit is a food, but you could define that somewhere else. Then, your query would be as simple as where category='food'.
Alternatively, you could go through the hierarchy before building your query and it would require something like where category='food' or category='fruit'.
I think your database schema is quite fine, but the implementation of this search really depends on your specific RDBMS. A lot of them have ways to perform this sort of recursion. One example I can think of is SQL Server's support of Common Table Expressions which are lightning fast alternatives to those nasty cursors.
If you specify which RDBMS you're using, you might get more specific answers.