Data organisation in graph database - database

This is more of a logical question rather than technical one. I am asking for data organisation guidance for my requirements. Please keep in mind that I am willing to use a graph database for this purpose (though I am pretty new at that). So guidance in graph database context would be much appreciated.
Let me provide an overview of the scenario. There are two entities in the app, User and House. User can owns a house or rents a house. If an user rents a house, there should be time period mentioned for which the user has rented the house. An user may rent same house for different periods.
Demo Dataset:
A (User) -owns-> H1, H2, H3 (House) - one-liner for brevity
X -rents-> H2 (start=DATE1, end=DATE2)
Y -rents-> H2 (start=DATE3, end=DATE4)
X -rents-> H2 (start=DATE5, end=DATE6) - user rents same house again
I am assuming that User and House would be nodes and owns and rents would be edges. Rent period would be properties of rents edges. Please point out if there is any better way.
Questions:
Is this possible in graph database in general to have multiple edges of same type between two nodes? Should I keep just one edge for rent of a specific user to specific house and add periods? Or should I maintain multiple edges for multiple periods?
Is it possible to query for something like: "fetch all the houses that were empty for a period of 3 months"? This should fetch the houses that have a gap of 3 months between consecutive end and next start dates in rents. These houses may not be empty now.
I have checked neo4j, cayley, dgraph etc. Which may be better with this scenario?
Any guidance of how I should keep the data with relationships would be much appreciated. Have a nice day.

I think this may be solved, but I would just add that TerminusDB may be useful to assess as part of your process. The reason that I sat this is that you are:
TerminusDB uses an expressive logical schema language that allows anything that is logically expressible. So you could have multiple edges of same type between two nodes. However, data modeling is something of an art - as your question suggests - so it will depend on your domain. (I always think that 'deal' could be an edge or a node depending on context - you can participate in a deal or you could strike a deal with another party).
As TerminusDB is a native revision control database, time bound queries can be relatively straightforward. You can get a delta, or a series of deltas, between two events.

There could be a better answer than this, still posting my experience with graph on the given requirement if this of any help to you.
I think it is the best fit for the graph DB for your requirement and to answer your questions.
It is more of designing your graph model to suit the purpose and I think you can have multiple rent edges with different periods from node user to node house.
Which way you can maintain the history and you can later delete the older/expired period edges if you want.
[Just to avoid duplicates] Assume here you need to make sure the edge would be created between nodes (user & house) only if the period slot is free.
You can add logic to the query while creating the edge between nodes.
With the given demo data set, here is the sample graph I have created based on the scenario you have described.
http://console.neo4j.org/?id=bxu3sp
Click on the above console link and you can run the below cypher query in the query window at the bottom
MATCH (user:User)-[rent:RENT]->(house:House)
WITH house, rent
ORDER BY rent.startDate
WITH collect(rent) as rents, house
UNWIND range(0, size(rents)-1) as index
WITH rents, index, house
WHERE duration.inDays(date(rents[index].endDate), date(rents[index+1].startDate)).days > 30
RETURN house
This would get the list of houses that was with no allocation for a given period range.
I'm not an expert and never used other than neo4j and so far with my experience on neo4j, documentation is really good and it is powerful with additions like Kafka integrations, GraphQL, Halin monitoring, APOC, etc.
I'd say it is a learning curve, just explore and play around with it to get yourself into the graph DB world.
Update:
In case of the same user renting the same house for different periods then the graph would look something like below as said you should avoid creating duplicates by not allowing edges for the same/overlapping window period between any user node and any house node. here in this graph, I have created edges for the different and non-overlapping start/end date so which is valid and not a duplicate.

Related

Using a graph database to store and retrieve sorted users by personality scores

I am interested in storing a set of users that have personality scores.
I would like to get them to be more connected (closer?) to each other based on formulas that are applied to their scores. The more similar the users are, the more connected or closer to each other they are (like in a cluster). The closest nodes are to one-another, the more similar they are.
I currently do this over multiple steps (some in SQL and other in code) from a relational database.
Most posts out there and documentation seems to focus on how to get started and what the advantages are at a high level compared to relational databases.
I am wondering if Graph databases are better suited for this and would do most of the heavy lifting out of the box or more natively. Any details are greatly appreciated.
You could consider modeling it like this:
Where a vertex type/label named Score_range was introduced, together with the label User(with property score).
User vertices are connected to Score_range vertex like User with score: 101 is connected to Score_range(vertexID=100) which stands for [100, 110).
Thus, those vertices with closer score are more connected/clusterred in this graph, and in your applicaiton, you need to make connection changes when the score are recaculated/changed to the graph database.
Then, either to run cluster algorithm(i.e. Louvain) on the whole graph or graph query to find path between any two user nodes(i.e. FIND PATH in Nebula Graph, an opensource distributed graph database speaks opencypher), the closeness will be reflected.
But, I think due to this connection/closness is actually numerical/sortable, simply handling this closeness relationship may not need a graph database from the context you already provided.
PS. I drew a picture of a graph in the above schema:

"Tinder Data Model": how to select candidates?

I have been thinking about of how Tinder might have setup their data model - especially the part to select the candidates to be shown (I'm not talking about the algorithm that determines the order, but only how to get all possible candidates in the first place). This process should only display other profile, that the current user did not already vote on.
So I could imagine this:
A table for the Users (>40mio entries), and another one for the swipes (>1,5 billion new entries each day).
When selecting the candidates, one could join the two tables (+ obviously apply certain other selection criteria like the location, age range etc) and only return the users that the current user has not yet swiped for.
But: does that scale? Both of those tables are rather huge - so I guess at some point you would run into problem, right?
Furthermore, I read that Tinder is using AWS DynamoDB - so not a relational model. And this makes it even harder I guess...
So my question is: do you have an idea on how Tinder accomplished this?

Graph database modeling: multiple edges are better than single edges with properties?

This is for a project that will map metadata. There are many more nodes but this particular one became a debate in the team.
Which model would yield the best query performance? Or it does not matter?
Option 1
Permission metadata is explicit as edges between nodes.
Option 2
Permission metadata is inside the properties of the edge.
Option 3
???
Let me comment for ArangoDB here, being one of its developers.
There is a third possibility, namely to have a single vertex collections and multiple edge collections for the different access methods. You would then "officially" have 3 graphs that share the same vertex set.
I would expect that this is better in performance, because each access type would only have to deal with a single type of edge and access would be fast.
Obviously it all depends on your queries. My statement holds for queries like "what are all the Entities a Person can update?" or "who can select this Entity?".
I could imagine that your standard query is more "Can this person delete that Entity?" or "Which access rights does this person have for that Entity?".
These two questions are probably not efficient with any of the approaches suggested, because as far as I see, all of them would then require a search, either in the outgoing edges of the Person or in the incoming edges of the Entity.
What would be needed here are a kind of "vertex centric indices", that is an index that can be used for the set of outgoing or incoming edges of a given vertex. If you, for example would use your option 2 (or indeed 1, this does not matter so much), and have a sorted index on all edges that is sorted first by Person and then by Entity. Then it is a lookup with time complexity O(log(#edges)) to find the (probably singleton) set of edges from a given Person to a given Entity.
We at ArangoDB are currently busy to add this feature, which will appear in one of the next two releases.
I can only speak for Neo4j here:
I don't know that it would matter much, but definitely benchmark! Both relationships and properties are stored as linked lists, so it will still need to traverse them. But if you have more relationships between Person and Entity nodes then putting them in properties starts to become more attractive.
I recommend checking out the free O'Reilly book Graph Databases to learn more about the internals of Neo4j. But benchmarks will always be the gold standard.

Graph database query

I have undertaken a project that tracks shops from where a user can buy an item. I have decided to use Neo4j as the database.
This is a part of the database that I have in mind: There is a node for each shop, and each of these nodes has child-nodes that store items available in that store.
Consider this scenario: Now a particular user (who always goes to one particular shop to buy all his items) wants to know alternative shops from where he can get all (or maximum) number of the items he wants to purchase.
The problem is that an exhaustive search of all the shops and all their items, followed by their matching will take a lot of space/time. Is there any procedure/algorithm that could help me solve this problem with minimum space/time cost?
P.S.: Please note that I would like to stick with Neo4j only, because it solves many of the other database problems very efficiently.
your use case is actually perfect for a graph db. Could I recommend you implement your items as primary nodes and connect them to your stores?
Index your store nodes using Indexing Service. That will give you quick lookup for store and then any particular item is one traversal away. Getting all other stores for an item will also just be edge traversals at that point.
Hope this helps.

Serializing data into a single text field - denormalization gone too far?

I'd love some opinions on whether this database design I'm currently pursuing is sound or not.
Lets assume I'm building a table called "Home", this table has a text field called "rooms". In this field is the serialized data for a set of rooms that this house has. My first instinct was to, of course, normalize this data into a separate "Rooms" table. However, due to some frustrating experiences with overly normalized databases in the past, I stopped to ask myself a few questions:
Will I ever need to find a specific room?
Will I ever need to update an individual room?
Will any Home records ever share Room records?
The answer to each of these questions is "no". Room records are all unique to each Home. Queries will never need to be performed to find out how many Homes in the database have bathrooms, for instance. Data will always be pulled from the perspective of the Home. The number of bedrooms and bathrooms will be explicitly stored on the Home record for searching.
So instead of having to constantly join Rooms, I wondered what would be the harm in serializing this data and just popping it into a text field.
This makes a lot of sense to me, but I'm hoping for a sanity check. Thanks for any input!
A pragmatic answer...
a) probability that you might want to decompose it in the future
b) benefit of not doing so now
c) cost of changing the schema later on.
If a * c > b then you should decompose now.
Well, you might not have a need TODAY to query to find out things like:
What is the average number of bathrooms in a home in Ohio?
Where do homes have more bedrooms? The East Coast or the West Coast?
How does house price correlate with the size of the master bedroom? What would be the average dollar value return of increasing the master bedroom size by 30%?
etc, etc.
You will be in a much better position in the future if you design your foundation correctly to begin with... no matter how enticing the short-cut may seem right now.
Plus, with a separate ROOMS table, you will be able to add additional room fields that make sense later (like width/height, color, floor level, etc.) which would all be very hard if the data were just globbed into a single field.
People will want to query in unexpected ways, like:
I have bad knees. Can you list houses with the master bedroom and master bathroom on the first floor?
In general, having a ROOMS table will just make your application more powerful, and easier to use.
Hey, I get what you're saying about "overly normalized data". We've all been there, and it DOES bite. However, having a ROOMS table in a database with housing info isn't being "overly normalized". It's just building the app the right way.
In addition to what others have said about doing the right thing, I would like to add a comment about performance.
Since you will be storing the serialized room data as a column in table Home, the row size will increase significantly. This will result in worse performance for all other queries.
Well, you say that room records are unique, but you can't enforce that. So you have no way to know this for sure in your current design: all your code should be perfect in representing this.
"constantly joining" isn't that hard to do, but if it is, you can always make a View for that, and you're done.

Resources