Graph database query - database

I have undertaken a project that tracks shops from where a user can buy an item. I have decided to use Neo4j as the database.
This is a part of the database that I have in mind: There is a node for each shop, and each of these nodes has child-nodes that store items available in that store.
Consider this scenario: Now a particular user (who always goes to one particular shop to buy all his items) wants to know alternative shops from where he can get all (or maximum) number of the items he wants to purchase.
The problem is that an exhaustive search of all the shops and all their items, followed by their matching will take a lot of space/time. Is there any procedure/algorithm that could help me solve this problem with minimum space/time cost?
P.S.: Please note that I would like to stick with Neo4j only, because it solves many of the other database problems very efficiently.

your use case is actually perfect for a graph db. Could I recommend you implement your items as primary nodes and connect them to your stores?
Index your store nodes using Indexing Service. That will give you quick lookup for store and then any particular item is one traversal away. Getting all other stores for an item will also just be edge traversals at that point.
Hope this helps.

Related

Data organisation in graph database

This is more of a logical question rather than technical one. I am asking for data organisation guidance for my requirements. Please keep in mind that I am willing to use a graph database for this purpose (though I am pretty new at that). So guidance in graph database context would be much appreciated.
Let me provide an overview of the scenario. There are two entities in the app, User and House. User can owns a house or rents a house. If an user rents a house, there should be time period mentioned for which the user has rented the house. An user may rent same house for different periods.
Demo Dataset:
A (User) -owns-> H1, H2, H3 (House) - one-liner for brevity
X -rents-> H2 (start=DATE1, end=DATE2)
Y -rents-> H2 (start=DATE3, end=DATE4)
X -rents-> H2 (start=DATE5, end=DATE6) - user rents same house again
I am assuming that User and House would be nodes and owns and rents would be edges. Rent period would be properties of rents edges. Please point out if there is any better way.
Questions:
Is this possible in graph database in general to have multiple edges of same type between two nodes? Should I keep just one edge for rent of a specific user to specific house and add periods? Or should I maintain multiple edges for multiple periods?
Is it possible to query for something like: "fetch all the houses that were empty for a period of 3 months"? This should fetch the houses that have a gap of 3 months between consecutive end and next start dates in rents. These houses may not be empty now.
I have checked neo4j, cayley, dgraph etc. Which may be better with this scenario?
Any guidance of how I should keep the data with relationships would be much appreciated. Have a nice day.
I think this may be solved, but I would just add that TerminusDB may be useful to assess as part of your process. The reason that I sat this is that you are:
TerminusDB uses an expressive logical schema language that allows anything that is logically expressible. So you could have multiple edges of same type between two nodes. However, data modeling is something of an art - as your question suggests - so it will depend on your domain. (I always think that 'deal' could be an edge or a node depending on context - you can participate in a deal or you could strike a deal with another party).
As TerminusDB is a native revision control database, time bound queries can be relatively straightforward. You can get a delta, or a series of deltas, between two events.
There could be a better answer than this, still posting my experience with graph on the given requirement if this of any help to you.
I think it is the best fit for the graph DB for your requirement and to answer your questions.
It is more of designing your graph model to suit the purpose and I think you can have multiple rent edges with different periods from node user to node house.
Which way you can maintain the history and you can later delete the older/expired period edges if you want.
[Just to avoid duplicates] Assume here you need to make sure the edge would be created between nodes (user & house) only if the period slot is free.
You can add logic to the query while creating the edge between nodes.
With the given demo data set, here is the sample graph I have created based on the scenario you have described.
http://console.neo4j.org/?id=bxu3sp
Click on the above console link and you can run the below cypher query in the query window at the bottom
MATCH (user:User)-[rent:RENT]->(house:House)
WITH house, rent
ORDER BY rent.startDate
WITH collect(rent) as rents, house
UNWIND range(0, size(rents)-1) as index
WITH rents, index, house
WHERE duration.inDays(date(rents[index].endDate), date(rents[index+1].startDate)).days > 30
RETURN house
This would get the list of houses that was with no allocation for a given period range.
I'm not an expert and never used other than neo4j and so far with my experience on neo4j, documentation is really good and it is powerful with additions like Kafka integrations, GraphQL, Halin monitoring, APOC, etc.
I'd say it is a learning curve, just explore and play around with it to get yourself into the graph DB world.
Update:
In case of the same user renting the same house for different periods then the graph would look something like below as said you should avoid creating duplicates by not allowing edges for the same/overlapping window period between any user node and any house node. here in this graph, I have created edges for the different and non-overlapping start/end date so which is valid and not a duplicate.

"Tinder Data Model": how to select candidates?

I have been thinking about of how Tinder might have setup their data model - especially the part to select the candidates to be shown (I'm not talking about the algorithm that determines the order, but only how to get all possible candidates in the first place). This process should only display other profile, that the current user did not already vote on.
So I could imagine this:
A table for the Users (>40mio entries), and another one for the swipes (>1,5 billion new entries each day).
When selecting the candidates, one could join the two tables (+ obviously apply certain other selection criteria like the location, age range etc) and only return the users that the current user has not yet swiped for.
But: does that scale? Both of those tables are rather huge - so I guess at some point you would run into problem, right?
Furthermore, I read that Tinder is using AWS DynamoDB - so not a relational model. And this makes it even harder I guess...
So my question is: do you have an idea on how Tinder accomplished this?

Graph database modeling: multiple edges are better than single edges with properties?

This is for a project that will map metadata. There are many more nodes but this particular one became a debate in the team.
Which model would yield the best query performance? Or it does not matter?
Option 1
Permission metadata is explicit as edges between nodes.
Option 2
Permission metadata is inside the properties of the edge.
Option 3
???
Let me comment for ArangoDB here, being one of its developers.
There is a third possibility, namely to have a single vertex collections and multiple edge collections for the different access methods. You would then "officially" have 3 graphs that share the same vertex set.
I would expect that this is better in performance, because each access type would only have to deal with a single type of edge and access would be fast.
Obviously it all depends on your queries. My statement holds for queries like "what are all the Entities a Person can update?" or "who can select this Entity?".
I could imagine that your standard query is more "Can this person delete that Entity?" or "Which access rights does this person have for that Entity?".
These two questions are probably not efficient with any of the approaches suggested, because as far as I see, all of them would then require a search, either in the outgoing edges of the Person or in the incoming edges of the Entity.
What would be needed here are a kind of "vertex centric indices", that is an index that can be used for the set of outgoing or incoming edges of a given vertex. If you, for example would use your option 2 (or indeed 1, this does not matter so much), and have a sorted index on all edges that is sorted first by Person and then by Entity. Then it is a lookup with time complexity O(log(#edges)) to find the (probably singleton) set of edges from a given Person to a given Entity.
We at ArangoDB are currently busy to add this feature, which will appear in one of the next two releases.
I can only speak for Neo4j here:
I don't know that it would matter much, but definitely benchmark! Both relationships and properties are stored as linked lists, so it will still need to traverse them. But if you have more relationships between Person and Entity nodes then putting them in properties starts to become more attractive.
I recommend checking out the free O'Reilly book Graph Databases to learn more about the internals of Neo4j. But benchmarks will always be the gold standard.

Second Order Relationship in Graph Database

I'm creating an app which is quite relationship heavy. One of the features of the site is a recommendation feature, where users can rate things for others. For this, it seems like a Graph DB would be ideal so I am planning on using Neo4j, alongside Ruby.
This all seems fairly straight forward, however I would like to include a feature where users can rate a specific relationship. For example, a user could recommend a hotdog in a specific restaurant, etc. The only way I can really think about doing this with a Graph DB is to either add a 'joining node' between the two nodes, connecting all three, or by adding lists of properties to the relationship (ie adding hotdog_5 to the user-restaurant relationship). Obviously the rating could just be added to the hotdog-restaurant relationship, but you wouldn't be able to trace the users that rated it, to prevent them rating more than once.
Any thoughts on the problem would be appreciated.
You may want to retrieve all the comments from a user, or the comments about hotdogs in all restaurant, or all the comments about all type of food in a restaurant so I would recommend to do it like :
1. user-[:write]->comment
2. comment-[:about]->hotdog
3. comment-[:concern]->restaurant
4.restaurant-[serve]->hotdog
Not sure about the last one it may be useless due to 2 and 3, it depend a lot on the queries you'll run

Django: efficient database search

I need an efficient way to search through my models to find a specific User, here's a list,
User - list of users, their names, etc.
Events - table of events for all users, on when they're not available
Skills - many-to-many relationship with the User, a User could have a lot of skills
Contracts - many-to-one with User, a User could work on multiple contracts, each with a rating (if completed)
... etc.
So I got a lot of tables linked to the User table. I need to search for a set of users fitting certain criteria; for example, he's available from next Thurs through Fri, has x/y/z skills, and has received an average 4 rating on all his completed contracts.
Is there some way to do this search efficiently while minimizing the # of times I hit the database? Sorry if this is a very newb question.
Thanks!
Not sure if this method will solve you issue for all 4 cases, but at least it should help you out in the first one - querying users data efficiently.
I usually find using values or values_list query function faster because it slims down the SELECT part of the actual SQL, and therefore you will get results faster. Django docs regarding this.
Also worth mentioning that starting with new dev version within values and values_list you can query any type of relationship, including many_to_one.
And finally you might find in_bulk also useful. If I do a complex query, you might try to query the ids first of some models using values or values_list and then use in_bulk to get the model instances faster. Django docs about that.

Resources