I am designing a database model that is based on a risk analysis table which is similar to the following (apologies for the example quality):
Basically, the user should be able to create such tables to do risk analysis through the UI. From the example above I hope it becomes clear that there can be multiple main steps, which can have multiple sub steps, which in turn can have multiple risks, which can have multiple measures.
A quick draft of a model that would do the job is:
RiskAnalysis:
id: number
name: string
MainStep:
id: number
description: string
risk_analysis: fk to RiskAnalysis
SubStep:
id: number
description: string
main_step: fk to MainStep
Risk:
id: number
description: string
sub_step: fk to SubStep
Measure:
id: number
description: string
risk: fk to Risk
However, my concern is that this design may be terrible for performance and querying.
An alternative model would be:
RiskAnalysis:
id: number
name: string
RiskAnalysisEntry:
id: number
main_step: string
sub_step: string
risk: string
measure: string
risk_analysis: fk to RiskAnalysis
While this reduces complexity at the database level, data is duplicated potentially resulting in more complexity in the application layer.
A minor improvement on this design may be the following:
RiskAnalysis:
id: number
name: string
RiskAnalysisEntry:
id: number
name: string
risk_analysis: fk to RiskAnalysis
main_step: fk to MainStep
sub_step: fk to SubStep
risk: fk to Risk
measure: fk to Measure
MainStep:
id: number
description: string
SubStep:
id: number
description: string
Risk:
id: number
description: string
Measure:
id: number
description: string
The improvement here would be that we only "duplicate" references (foreign keys) instead of values.
Unfortunately, I am unable to figure out which solution best fits my needs. Therefore my question is:
Would you advise against using the deeply nested model presented as the first option? And if so, what would be the most elegant way to deal with deeply nested data? Does one of my other designs qualify or are there beter options to consider?
This is a common design pattern. The usual example is a Bill of Materials.
In your case, you can just have one table.
Step
----
Step ID
Step Name
Parent Step ID
The risk analysis would have a null Parent Step ID.
The main step would have a Parent Step ID pointing to the risk analysis.
And so on, for as many levels deep as you need.
Related
Using a classic example - Suppose that we have an application which has a Courses collection and a Students collection, while each student can participate in many courses, and each course can have many participants.
We will need to query all the courses that one student participates in efficiently, But we also need to query the students those are participating in a single course.
I know that using relational database to handle this will be the optimal solution, but for now I just want to use one type of databases which is MongoDB, now I want to ask if this schema design could work efficiently? what is the cons and pros of using it? and which design could be better?
User: {
_id,
//...properties
}
Course: {
_id,
//...properties
}
CourseParticipate: {
_id,
userId,
courseId,
//...properties
}
CourseAdmin: {
_id,
userId,
courseId,
//...properties
}
Now I like this design because in the future if I have the ability to work with multiple databases, it will be easy to transfer these collections to a relational DB (or not?), I also like it because it is fast to write the data and to remove the relations between the objects, but it will make the reading queries a little bit slower(or a lot?) as I can see.
Because I never seen this design before in the internet, I already know that there is better solutions (I hope that I don't hear heartful comments and answers because I'm new).
I also want to hear from you whether Neo4j can handle this problem or not? and what relational DBs works the best next to MongoDB?
Links to documentations and articles will be very helpful!
Thanks!
This is a case of having the data with Many-to-Many relationship. I would think there are few thousand students and a few hundred courses in your database.
To start with I can use the following design with course details embedded with each student as an array of sub-documents called as courses.
- students collection
id:
name:
courses: [ { id: 1, name: }, { id: 5, name: }, ... ]
- courses collection
id:
name:
description:
Note, the course id and name are stored in both collections. This is duplication of data. This should be okay, as the duplicated details do not change often (or may not change at all).
Query all courses a student is enrolled into, for example: db.students.find( { name: "John" } ). This will return one student document with the matching name and all the courses (the array field). See db.collection.find.
Query all students enrolled into a particular course: db.students.find( { "courses.name": "Java Programming" } ). This will return all the student documents who have the course name matching the criteria "java Programming". See Query an Array of Embedded Documents.
Further, you can use projection to exclude and include fields from the result.
NOTES:
You can embed students info within the courses collection, instead of the courses into the students. The queries will be similar to the above ones, but you will be querying the courses collection. It depends upon your use case.
You can just store the course id field in the courses array, of the students collection; this is the case where you have course name field changes often. The queries will use Aggregation $lookup (a "join" operation) to get the course and from the courses collection.
Information on Data Model Design for document based MongoDB data.
I'm trying to use AWS recognition to pass "Attendance" and insert the record (Date&time) in a DynamoDB table for every day that a kid shows up to school, the problem is that I'm not familiar with NoSQL and I'm wondering whats the best possible way to create the table with this in mind.
Some of the attributes that I'm using are:
Enrollment No. (PartionKey)
First Name
Last Name
since the date/attendance is going to be a "dynamic attribute"(whether or not the kid goes to school or not), I'm not sure if I should:
Create a new table for every-day/week or month and only have the enrollment No. as an attribute and have a lambda trigger to put a timestamp when the kid is spotted meaning the kid attended to class (This will have a lot ...a lot of tables, ruining the purpose of dynameDB I believe)
In the same table insert the attendance as an attribute and as a list type (which could be an array for inserting the timestamp every day the kid is spotted)..this option would make the item/table in DynamoDB weight more than it should? causing it to slow down??
Any ideas on a possible way to approach this? is there another way that's more cost and memory-optimized?
I'm not mentioning about the triggers, lambda functions, AWS recognition for this to work since it's out of the scope of this post
The simplest solution is to have one table for all attendance records using enrollmentID as the partition key and day (ISO 8601 date string, like “2019-09-27”) as the range key.
This makes it simple to add an attendance—just insert an enrollmentID-date pair into your table. It’s simple to query when a student attended using a variety of key condition expressions.
All attendance for student 123: enrollmentID = 123
All attendance for student 123 in a given year: enrollmentID = 123 and begins_with(day, “2019”)
All attendance for student 123 in a given month: enrollmentID = 123 and begins_with(day, “2019-09”)
As a bonus, you can also find all the students who attended on a given day by creating a GSI with day as the partition key.
Any additional data (such as first name, last name, etc.) can go in a separate table if you like or in the same table with something like “info” as the sort key value instead a real date.
It is also possible for you to use a list of Booleans to represent a year of attendance. You can use enrollmentID and year as the partition and sort keys.
{
enrollmentID: 123,
year: 2019,
attendance: [ 0, 1, 1, 1, 0, 1... ],
firstName: “John”,
lastName: “Smith”
}
This is a more efficient use of storage, but it limits your query options and it’s easier to accidentally ruin your data with an off-by-one error when indexing into the attendance list.
I want to design a PostgreSQL database for my product which needs to handle ordered many to many relation. There is two solution for that:
(Normalized DB)create a middle table and put the order of relations in that
(Denormalized DB)use denormalized database and save all data in one table
My data model is like this:
Table 1(Exercise):
id
name
Table 2(Workout):
set of Exercises with order
every user can create custom workout(list of exercises with defined order). my problem is saving the order of relations in database, because default relation not preserve order.
As has been said in the comments, "best practice" supposes that there is exactly one best way to do things, and that is not the case. In pretty much all software design solutions, you have to trade things, in messy, unpredictable ways, which are nearly always context-dependent.
If I understand your question, your problem domain has the concept of a "user" who creates zero or more work-outs, and each work-out has one or more exercises, and the sequence in which that exercise occurs is an important attribute of the relationship between workout and exercise.
The normal way to store attributes of a many-to-many relationship is as additional columns on the joining table.
The normalized way of storing this would be something like:
user
-----
user_id
name
...
Workout
-----
workout_id
name
....
Exercise
------
exercise_id
name
....
workout_exercise
----------
workout_id
exercise_id
sequence -- this is how you capture the sequence of exercises within a workout
... -- there may be other attributes, e.g. number of repetitions, minimum duration, recommended rest period
user_workout
--------
user_id
workout_id
.... -- there may be other attributes, e.g. "active", "start date", "rating"
In terms of trade-offs, I'd expect this to scale to hundreds of millions of rows on commodity hardware without significant challenges. The queries can get moderately complex - you'll be joining 4 tables - but that's what databases are designed for. The data model describes (what I understand to be) the problem domain using separate entities etc.
I checked some questions here like Understanding Cassandra Data Model and Column-family concept and data model, and some articles about Cassandra, but I'm still not clear what is it's data model.
Cassandra follows a column-family data model, which is similar to key-value data model. In column-family you have data in rows and columns, so 2 dimensional structure and on top of that you have a grouping in column families? I suppose this is organized in column families to be able to partition the database across several nodes?
How are rows and columns grouped into column families? Why do we have column families?
For example let's say we have database of messages, as rows:
id: 123, message: {author: 'A', recipient: 'X', text: 'asd'}
id: 124, message: {author: 'B', recipient: 'X', text: 'asdf'}
id: 125, message: {author: 'C', recipient: 'Y', text: 'a'}
How and why would we organize this around column-family data model?
NOTE: Please correct or expand on example if necessary.
Kinda wrong question. Instead of modeling around data, model around how your going to query the data. What do you want to read? You create your data model around that since the storage is strict on how you can access data. Most likely the id is not the key, if you want the author or recipient as on reads you use that as the partition key, with the unique id (use uuid not auto inc) as a clustering index. ie:
CREATE TABLE message_by_recipient (
author text,
recipient text,
id timeuuid,
data text,
PRIMARY KEY (recipient, id)
) WITH CLUSTERING ORDER BY (id DESC)
Then to see the five newest emails to "bob"
select * from message_by_recipient where recipient = 'bob' limit 5
using timeuuid for id will guarantee uniqueness without a auto increment bottleneck and also provide sorting by time. You may duplicate writes on a new message, writing to multiple tables so each read is a single look up. If data can get large, may want to replace it with a uuid (type 4) and store it in a blob store or distributed file system (ie s3) keyed by it. It would reduce impact on C* and also reduce the cost of the denormalization.
I have three tables in my database: Book, Series, Author.
These all have many-to-many relationships between each other.
Book - Series: many-to-many
Book - Author: many-to-many
Author - Series: many-to-many
So, this means I'll have to add three more tables in my database:
1. tableBookSeries, which will have id, idBook, idSeries
2. tableBookAuthor, which will have id, idBook, idAuthor
3. tableAuthorSeries, which will have id, idAuthor, idSeries
This means I'll have 6 tables in my database.
Is there a way for optimization of this database or is it good now? Can I shorten the number of tables, and will it do me any good?
In my Opinion there is no direct relation in Author - Series.
An author writes Books only, which happen to be part or not of a Serie.
In a DB would be something like this (Using 5 tables instead of 6):
Book - Series: many-to-many - Correct
Book - Author: many-to-many - Correct
Author - Series: many-to-many - Incorrect (don't do this)
Extra tables:
tableBookSeries, which will have id, idBook, idSeries - Correct
tableBookAuthor, which will have id, idBook, idAuthor - Correct
tableAuthorSeries, which will have id, idAuthor, idSeries - Incorrect (don't do this)
Avoid connection between Author and Series. If you need to know which series are related to an Author, you can still know by searching for his books.
Likewise, to know who wrote a Serie, you can get the list of Authors using the associated Books.
Hope i was helpful