Context
I am designing a data model for a node based system used to perform tasks. The system includes node, plug and edge objects.
A node is an object which performs an action. You can think of nodes as being like a program or executable. The functionality of the node may be altered via data passed through from other nodes.
Data is passed from one node to another via a connection. A connection between two nodes is called an edge.
Nodes are connected using plugs. Each node has a list of plugs which determine the input and output for the node. You can think of plugs as being like the arguments to a program or executable.
The relationship between nodes and plugs is a one-to-many relationship. So a node can have many plugs but a plug can only have one node. In this case I will store a reference to the node on each plug. Edges are really just an association between two plugs. Below is an example of how I imagine the data is stored:
The node table:
|-------------|-----|-------|
| PRIMARY_KEY | ID | TYPE |
|-------------|-----|-------|
| NODE.1 | 1 | NODE |
|-------------|-----|-------|
| NODE.2 | 2 | NODE |
|-------------|-----|-------|
The plug table:
|-------------|-----|-------|---------|
| PRIMARY_KEY | ID | TYPE | NODE |
|-------------|-----|-------|---------|
| PLUG.1 | 1 | PLUG | NODE.1 |
|-------------|-----|-------|---------|
| PLUG.2 | 2 | PLUG | NODE.2 |
|-------------|-----|-------|---------|
| PLUG.3 | 3 | PLUG | NODE.2 |
|-------------|-----|-------|---------|
The edge table:
|-------------|-----|-------|----------|----------|
| PRIMARY_KEY | ID | TYPE | SRC_PLUG | DST_PLUG |
|-------------|-----|-------|----------|----------|
| EDGE.1 | 1 | EDGE | PLUG.1 | PLUG.2 |
|-------------|-----|-------|----------|----------|
| EDGE.2 | 1 | EDGE | PLUG.1 | PLUG.3 |
|-------------|-----|-------|----------|----------|
Question
Assuming this is not completely wrong, my question is about how I would construct a node object from the data. It seems to me that a node is useless without the plugs which are associated to it. This suggests we must find all the plugs associated to the node at the time we create the node. Where and how is this information usually stored? In other words, how does the process used to create the node know to do the query for associated plugs?
All suggestions are much appreciated.
It sounds like plugs are children of nodes and cannot exist until the node is created, unless the Node property of Plug can be null. In that case you could pass one or more edges to the Node creator, and the node plugs would be the distinct set of destination plugs from them.
Having said that, it seems backwards in your example to create the plugs first, then the edges, then the nodes. I would think the object which performs the action (node) would be created first and dictate the destination plugs it requires. Edges would be defined last and would be more mutable over the lifetime of the application as different connections are created. It feels more natural to define and create a node and its associated plugs together.
I'm not sure I understand the ID column of the Edge table or its relationship to PRIMARY_KEY or ID of another object.
Related
I am designing a way to store history of a graph in a graph database. I have the following in mind:
History of a node, say Vertex_A, is maintained by creating another history node, say History_Vertex_A. Whenever Vertex_A is modified, a new version node, say Vertex_A_Ver_X, is created. This new node stores the old values of the changed data. A new edge is created between the history node and the version node. Following diagram depicts this idea. Is there a better way to store history of a vertex/node in a graph database?
+------------------+
| Vertex_A (Ver N) |
+---------+--------+
|
+-----------v-----------+
| Edge_Vertex_A_History |
+-----------+-----------+
|
+---------v--------+
| History_Vertex_A |
+---------+--------+
|
+---------------------+----------+----------------+----------------------+
| | | |
+------v-------+ +------v-------+ +--------v-------+ +-------v--------+
| Edge_A_Ver_0 | | Edge_A_Ver_1 | | Edge_A_Ver_N-2 | | Edge_A_Ver_N-1 |
+------+-------+ +------+-------+ +--------+-------+ +-------+--------+
| | | |
+--------v---------+ +--------v---------+ +----------v---------+ +---------v----------+
| Vertex_A (Ver 0) | | Vertex_A (Ver 1) | .... | Vertex_A (Ver N-2) | | Vertex_A (Ver N-1) |
+------------------+ +------------------+ +--------------------+ +--------------------+
Now, say I have the following relation. Vertex_A is connected to Vertex_B via edge Edge_AB.
+----------+ +---------+ +----------+
| Vertex_A +------> Edge_AB +-------> Vertex_B |
+----------+ +---------+ +----------+
I can store the history of vertices as per the above design, but I cannot use that same idea to store history of edges, edge Edge_AB in this case. This is because it won't be possible to have an edge connecting to it's corresponding history vertex. An edge cannot connect to a vertex. So what is the best way to store history of an edge in a graph database?
Your approach is universally working among different graph databases.
One more approach that we are doing with NebulaGraph is to leverage the rank concept in its edge defination.
In NebulaGraph, the factor to define one instance of an edge is: [src, dst, edge_type, rank], where the rank is an int to represent things like transaction_id, timestamp, version or whatever generates multiple between two vertices in one edge type.
note, rank field could be ommited, where the value will be 0, thus it brings nothing new to us with same mind model from other graph databases when using it.
With rank, we could easily design the versioning of edges here. But how could we design the versioning of vertecies then? Our approach will be to introduce an edge with dst-vertex of itself, and put the propertis that could differ from different versions of vertices in this edge, where the rank is the version and the properties are on the edge.
ref:
https://docs.nebula-graph.io/3.2.1/1.introduction/2.data-model/
https://github.com/vesoft-inc/nebula
Let's say that I have the following SQL table where each value has a reference to the previous one:
ChainedTable
+------------------+--------------------------------------+------------+--------------------------------------+
| SequentialNumber | GUID | CustomData | LastGUID |
+------------------+--------------------------------------+------------+--------------------------------------+
| 1 | 792c9583-12a1-4c95-93a4-3206855d284f | OtherData1 | 0 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 2 | 1022ffd3-afda-4e20-9d45-eec884bc2a50 | OtherData2 | 792c9583-12a1-4c95-93a4-3206855d284f |
+------------------+--------------------------------------+------------+--------------------------------------+
| 3 | 83729ad4-2564-4146-b451-00d82585bd96 | OtherData3 | 1022ffd3-afda-4e20-9d45-eec884bc2a50 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 4 | d7197e87-d7d6-4175-8172-12656043a69d | OtherData4 | 83729ad4-2564-4146-b451-00d82585bd96 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 5 | c1d3d751-ef34-4079-a73c-8952f93d17db | OtherData5 | d7197e87-d7d6-4175-8172-12656043a69d |
+------------------+--------------------------------------+------------+--------------------------------------+
If I were to insert the sixth row, I would retrieve the data of the last row using a query like this:
SELECT TOP 1 (SequentialNumber, GUID) FROM ChainedTable ORDER BY SequentialNumber DESC;
After that selection and before the insertion of the next row, an operation outside the database will take place.
That would suffice if it is ensured that only one entity is using the table every time. However, if more entities can do this same operation, there is a risk of a race condition. There is the possibility that one entity requests the information of the last row and before doing the insert on the second one.
At first, I thought of creating a new table with a value that indicates if the table is being used or not (the value can be null or the identifier of the process that has access to the table). In that solution, the entity won't start the request of the last operation if the value indicates that the table is being used by another process. However, one of the things that can happen in this scenario is that the process using the table can die without releasing the table, blocking the whole system.
I'm sure this is a "typical" computer science problem and that there are well known solutions to implement this. Can anyone point me in the right direction, please?
I think using Transaction in SQL may solve the problem For example, if you create a transaction that will add a new row, no one else will be able to do the same transaction until the first one is completed.
I'm developing an application that uses a mysql database and we wanted to do an approach for history purposes, that we store the current state and the history in the same table for performance reasons (on updates the application doesn't have the id for an entity just a key pair, so it is easier just to insert a new row).
The table looks like this:
+------+-------+-----------+------------------------------+
| id |user_id| type |content |
+------+-------+-----------+------------------------------+
| 1 |'1-2-3'| position | *creation |
| 2 |'1-2-3'| position | *something_changed |
| 3 |'1-2-3'| device | *creation |
| 4 |'1-2-4'| position | *creation |
| 5 |'1-2-4'| device | *creation |
| 6 |'1-2-4'| device | *something_changed |
+------+-------+-----------+------------------------------+
Every entity is described with the user_id and type "key" pair, when something is changed in the entity a new row is inserted. The current state of an entity is selected by the highest id row from the group, which is grouped by the user_id and type. Performance wise the updates should be super fast and the selects can be slower, because those are not used often.
I would like to look up best practices and other people experiences with this method, but I don't know how to search for them. Can you help me? I'm interested in your experiences or opinions on this topic as well.
I know about Kafka and other streaming platforms, but that was sadly not an option for this.
Hi I hope it’s ok that I write this question here. I’m currently outlining a data structure that will sit in a database where there are movies, and each movie has a lot of descriptors.
I want to be able to search through the entire database and find movie X that has attribute, Y, Z and doesn’t have A, B, C.
What I’m thinking is to store the descriptors/attributes like this:
Movie ID | Attribute | Has_Attribute
1 | Action | 0
1 | Adventure | 1
1 | Comedy | 1
2 | Action | 1
Is this the best way to store all the attributes for a record?
Presumably for every subsequent call, I would search where Action == 0 AND Comedy == 1 ... n == n_has_attribute to begin to narrow down the search.
In the designing table, you do not need to store the attributes that do not exist. You need just to record the attributes that a movie has. Hence, your design would be like:
Movie ID | Attribute
1 | Adventure
1 | Comedy
2 | Action
Moreover, if the number of attributes is not too many, you can define them as a column in the table which has a binary value:
Movie Id | Adventure | Comedy | Action
1 | 1 | 1 | 0
2 | 0 | 0 | 1
Therefore, to choose a better data structure, you need to clarify more the space of the problem in terms of the number of attributes, number of movies.
In addition, if you need to store the data in a decision tree, the breaking point of the nodes will be the attributes and it is more like to the second tabling architecture than the first design.
I have got a rather complex relationship between several entities:
TeacherTable
|
TeacherClassLinkTable
|
ClassTable
|
StudentClassLinkTable
|
StudentTable
|
StudentTestResults
|
TestTable
|
TestModuleTable
This works for most things that I need to do with it but it fails when I try to find what modules are taken by a class. I am able to find out what modules have been taken by Students that are part of a class but Students can belong to multiple classes taking different modules in rare cases. So I would not necessarily get an accurate result to finding what modules are taken by a class. I therefore want to insert a new table which would be ClassModuleLinkTable. This would allow me to make that link easily, however it would form a loop in my database structure and I'm not sure whether my database would therefore remain in 3rd normal form.
TeacherTable
|
TeacherClassLinkTable
|
ClassTable----------------------------
| |
StudentClassLinkTable |
| |
StudentTable |
| |
StudentTestResults |
| |
TestTable |
| |
TestModuleTable--------------ClassModuleLinkTable
I don't think that this is a problem, and I don't actually think it's what I would call a loop or circular reference.
A circular reference is where e.g. table A has a non-nullable FK to table B, which has a non-nullable FK to table A (or the circle could be A to B to C to D to A). If both tables are empty you cannot actually add a row to either of them, as both require a reference to a row in the other. I'm not actually sure that this situation is against 3NF, but it's plainly a problem!
Your situation does not have a circular reference and so as far as I'm concerned it's fine.