Need help in developing DB logic - database

This is a mini-project of mine - Airline reservation system - lets call this airline FlyMi : I have a database (Not decided which one, friend of mine wants to go with MongoDB). Anyhoo, this is my requirement :
I have a table which has details of the flight - Flight number, schedule etc. I'm going to use this table to perform various operations - booking , cancellation , modification
This is where I'm stuck : For the desktop app and the web application - I'm offering an option to select seats. This means I've got to keep track of which seats are booked , which ones are not. And assume I have an UI , which shows seats as Red - Booked Green - Not Booked.And all of this - for each and every flight. My question is : What do you think would be the most efficient way to track seat bookings , for each flight in that airline?
Current Idea : Keep a table named passenger - with all the details such as name , address etc. which keep track of all passengers, and maintain a passenger ID such that , first 4 characters are flight ID, Last 2 character are seat numbers they have chosen, with random number in-between ( I say random because I think it is immaterial here). So, for any flight , If I have to find out number of un-booked seats, I will have to scan through every passenger , who has booked, and who has booked in that flight. I think this is really in-efficient. Provide me with the most efficient logic to do this.

Don't use "smart keys".
This is a bad idea called "smart keys" or "encoding information in keys".
See this answer which contains this excerpt:
Despite it now being easy to implement a Smart Key, it is hard to recommend that you create one of your own that isn't a natural key, because they tend to eventually run into trouble, whatever their advantages, because it makes the databases harder to refactor, imposes an order which is difficult to change and may not be optimal for your queries, requires a string comparison if the Smart Key includes non-numeric characters, and is less effective than a composite key in helping range-based aggregations. It also violates the basic relational guideline that every column should store atomic values
Smart Keys also tend to outgrow their original coding constraints
(Notice that seat locations are typically identified by smart keys in that they are a row number and a count across a row. But they are also typically visibly physically permanently bolted into that formation. And imagine if they were labelled and rearranged.)
Educate yourself about database design.
Just describe your business in the most straightforward terms. That is how relational model databases & DBMSs work.
Find enough fill-in-the-[named-]blanks sentence templates to describe your business situations:
"customer [cid] has name [firstname] [lastname]
AND customer [cid] has a phone number [phonenumber] of type [type] ..."
"customer [cid] can use credit card #[card_no]"
"seat [seatid] is at row [row] and column [column]"
"seat [seatid] is booked"
"seat [seatid] is temporarily committed to an unfinished booking"
...
For each such parameterized sentence template (aka predicate) have a base table where the names of the blanks/parameters are column names. Each row in a table states the statement (proposition) got from filling in the blanks per its column values; each row not in a table states NOT the statement from filling in the blanks per its column values.
Then for each table find every functional dependency (FD) that holds. (When a predicate can be expressed in the form "... AND column = F(column1,...)" then we say that column set {column1,...} functionally determines column column and that FD set → column holds.) Then identify every candidate key (CK). (A superkey is a column set that functionally determines every column. Ie that is unique, ie where each subrow of values for those columns appears only in one row of a table. A CK is a superkey that doesn't contain a smaller superkey.) Then find every join dependency (JD). (Some predicates say "... AND ..." for some number of ANDs & "..."s. There is a JD when the table for each predicate "..." would look like what you get from taking only its columns from the original table.) Note that every FD comes with an associated (binary) JD.
Then normalize your tables to fifth normal form (5NF). This means decomposing (ie replacing a table in which JD "... AND ..." holds by tables whose predicates are the "..."s) until each JD that holds is implied by the CKs (ie must hold when the JDs from the FDs from the CKs hold.) (For performance reasons one can also then denormalize by combining to base tables that aren't in 5NF.)
See this answer and this one.
Then we query by describing the rows we want. We do this by connecting base table predicates with logical operators (ie AND, OR, NOT, FOR SOME, FOR ALL etc) and function calls to give the predicates for the tables we want and/or by connecting base table names by relation operators (ie JOIN, UNION, MINUS/EXCEPT, PROJECT/SELECT, RENAME/AS) to give the values of the tables we want and/or both (eg RESTRICT/WHERE).
The JOIN of two tables holds the rows that make a true statement from, ie has as predicate, the AND of their predicates; and the UNION the OR, the MINUS/EXCEPT the AND NOT; and that PROJECT/SELECT columns of a table puts FOR SOME all-other-columns before its predicate; and RESTRICT/WHERE puts AND condition after its predicate; and the RENAME/AS of column renames that parameter in its predicate. So a table expression corresponds to a predicate: A table (base table or query result) value contains the rows that make a true statement from its (base table's or query expression's) predicate.
See this answer.
The same goes for constraints, which are true statements that collectively describe the application situations and database states than can arise given the situations that can arise and the base table predicates.
See this answer.

Related

Am I Properly Normalizing this Data

I am completing normalization exercises from the web to test my abilities to normalize data. This particular problem was found at: https://cs.senecac.on.ca/~dbs201/pages/Normalization_Practice.htm (Exercise 1)
The table this problem is based of is as follows:
The unnormalized table that can be created from this table is:
To comply with First Normal form, I have to get rid of repeating fields in the table by moving visitdate, procedure_no, and procedure_name to their own respective tables:
This also complies with 2NF and 3NF which makes me question whether I have performed the process of normalization correctly. Please provide feedback if I did not properly move from UNF to 1NF.
In a first step you could create the following tables (assuming pet_id is unique in the table):
Pets: pet_id, pet_name, pet_type, pet_age, owner
Visits: pet_id, visit_date, procedure
Going further you could split procedure since the description is repeating:
Pets: pet_id, pet_name, pet_type, pet_age, owner
Visits: pet_id, visit_date, procedure_id
Procedures: procedure_id, description
Although there can be multiple procedures on the same visit_date for the same pet_id, I see no reason to split those further: a date could (in theory) be stored in 2 bytes, and splitting that data would create more overhead (plus an extra index).
You would also want to change pet_age to pet_birth_date since the age changes over time.
Since this is the first exercise in your list, the above will probably be more than enough.
Going even further:
An owner can have multiple pets, so another table could be created:
Pet_owners: owner_id, owner_name
and then only use owner_id in the Pets table. In a real system there would be customer_id, name, address, phone, email, etc. - so that should always be in a separate table.
You could even do the same for pet_type and store the id in 1 or 2 bytes, but it all depends on the type of queries you want to do later on the data.
The question is poorly presented. Look at the last two columns. The askers do not mean that each column's types are sets. They mean that pairs of values on the same line make an element of a set. They should have had one column whose values were triplets--date, number & name. That's what they did when they used just one column (the last one) for number & name. Notice that their solution in the pdf linked to by the page you link to has a table that has all three of date, number & name.
But how are you supposed to know that the values should be paired? After all if the date column gave the set of a pet's visit dates & the procedure column gave the set of procedure number & names a pet ever had then we wouldn't be supposed to take a pair of values on the same line as an element of a set. Unfortunately you are just supposed to magically guess correctly. (A hint is that the number of dates & number-name pairs for a pet are always the same.)
The above took the blank areas in the illustration to be there to make room for the vertical display of set-valued attributes; the portrayed table has 4 rows. But maybe they are there because you are supposed to get a relation from this illustration by interpreting a blank subrow as representing the most recent non-blank subrow. Then the table wouldn't have any set-valued columns; the portrayed table has 9 rows. It happens that this interpretation disagrees with the linked answer's UNF & 1NF sections.
If they weren't going to explain the table & were just relying on your guesses it would have been clearer if they put a visit's procedure date, number & name under one column--just as they put a procedure number & name in one column. But really, they should always tell you how to read the illustration. But really, you should always ask how read an illustration. If you have any interpretation conventions from a related course/textbook then you should have put it in your question for us to know.
Unfortunately "UNF" tables are almost always similarly poorly given without any description about how they are to be interpreted. Also "1NF" has no standard meaning & there is no standard notion of "normalizing to 1NF".

Data Modelling multiple table of same type into a single table to aggregate all the table into one single table

I have a question regarding data modelling. Suppose I have following tables 3 student tables. Source_table1 contains A_ID as primary key and Name as an attribute. Source_table2 has B_ID as Primary key and Name & Address as other attributes.Source_table3 has C_ID as Primary key and Name, Address and Age as attributes. If we want to create a new table as Student Master with all the records in that table, how can we do that? If we are creating a cross reference table then how should we approach that problem?
Integrating data from different sources is complicated. In the end, you want to end up with something like:
student (student_id PK, name, address, source1_id, source2_id, source3_id)
However, there are some issues to resolve to get there.
Identity
How will you identify matching records in the different sources? It looks like your sources use surrogate identifiers, but those have no meaning outside the context of the source databases. What you're looking for is a suitable natural key. The only common denominator among the sources is a student's name, but names are notoriously poor identifiers.
It can be useful to actually test the data rather than assume it will or won't work. For example, a query such as:
SELECT s1.name, COUNT(*) AS amount
FROM student_source_1 s1
INNER JOIN student_source_2 s2 ON s1.name = s2.name
GROUP BY s1.name
HAVING COUNT(*) > 1
repeated for (student_source_2, student_source_3) and (student_source_1, student_source_3) should give you some insight into the size of the problem.
You could match student_source_2 and student_source_3 based on both name and address. That might give better results, or worse if the two sources have different addresses (or spellings thereof) for the same student. That brings us to our second concern:
Inconsistency
Assuming you can resolve the identity problem, you may need to deal with inconsistent data. What if sources 2 and 3 have different addresses for the same student? How do you determine the correct address?
In some cases, it could be sufficient to just map the sources without resolving inconsistencies.
Winging it in the real world
One technique I use on harder cases is to build a mapping table by hand, e.g.
student_map (student_id PK, source1_id, source2_id, source3_id)
Each of the source_id columns should have a unique constraint, and usually all 3 will be nullable. This is a first step toward the student table above.
I would start by inserting all the perfect 1-to-1 matches, then left join each of the sources with the mapping table to get the unmatched records. Having the unmatched source records side-by-side and sorted makes it easy to visually spot likely matches. It's tedious and error-prone work, but sometimes it must be done regardless. For inconsistencies I might choose the most complete/best looking source as base, and fill in the gaps from the other sources. If you can involve teachers or people who are familiar with the actual students, or present them with alternatives to choose from, by all means do so.
More data can be extremely useful. If the sources have social security numbers, family information, etc, these can be used to match students. I would use any number of queries to find perfect matches among various pieces of information, and insert those into the mapping table, before doing the side-by-side matching.
You may well find that a source has internal consistency problems due to poor design - e.g. multiple records for the same student. This may require fixing the source data before continuing.
A good understanding of the relational model of data is invaluable for this kind of work, since you'll be identifying candidate keys, following dependencies and encountering anomalies.

Is it a good idea to include flags in Fact table

The transactional fact table of one ofthe star schemas need to anser questions like Is the first application is final application.This is associated with one of the business process.
Is it a good idea to keep this as a part of the fact table with a column name,
IsFirstAppLastFlag.
There are not much flags to create a seperate dimension.Also this flag(calculated flag) is essential in the report writing.In this context do we need to keep it in Dimension or in Fact!
I assume the creation of junk dimension is for those flags /low cardinality columns which are not so useful can kept it inside a dimension?!
This will depend on your own needs but if you like the purest view of the fact table then the answer is no, these fields should not be included in your fact table.
The fact table should include dimension keys, degenerate dimension keys, and facts.
IsStatusOne, IsStatusTwo, etc are attributes and as you rightly suggest would be well suited to a junk dimension in the absence of them belonging to a more suitable dimension, e.g., IsWeekDay would be suited to dimension "Date" table.
You may start off with only a few "Is" attributes in your fact table but over time you may need more and more of these attributes, you will look back and possibly wish you created a junk dimension.
Performance:
Interestingly if you are using bit columns for your flags then then there is little storage difference in using 8 bit flags in your fact table then having one tinyint dimension key, however when your flags are more verbose or have multiple status values then you should use the junk dimension to improve performance on the fact table, less storage, memory, more rows in a page, etc..
Personally, I would junk them
That seems fine, as long as it it an attribute of the fact, not of one of the dimensions. In some cases I think you might have a slowly changing dimension in which it would be more appropriately placed.
I would be concerned that this plan might require updates on the fact table, for example if you were intending to flag that a particular fact was the most recent for a customer. If that was the case it might be better to keep a transaction number in the fact table, and a "most recent transaction number" in the dimension table, and provide an indexing method to effectively retrieve the most recent per-customer.
You can use Junk Dimension.
Instead of creating several dimension with few rows you can create on dimnsion with all possible combination of value then you add just one foregion key in your fact table.
you can populate your junk dimension with a query like below.
WITH cteFlags AS
(
SELECT 'N' AS Value
UNION ALL
SELECT 'Y'
)
SELECT
Flag1.Value,
Flag2.Value,
Flag3.Value
FROM
cteFlags Flag1
CROSS JOIN cteFlags Flag2
CROSS JOIN cteFlags Flag3

Best approach to avoid Too many columns and complexity in database design

Inventory Items :
Paper Size
-----
A0
A1
A2
etc
Paper Weight
------------
80gsm
150gsm etc
Paper mode
----------
Colour
Bw
Paper type
-----------
glass
silk
normal
Tabdividers and tabdivider Type
--------
Binding and Binding Types
--
Laminate and laminate Types
--
Such Inventory items and these all needs to be stored in invoice table
How do you store them in Database using proper RDBMS.
As per my opinion for each list a master table and retrieval with JOINS. However this may be a little bit complex adding too many tables into the database.
This normalisation is having bit of problem when storing all this information against a Invoice. This is causing too many columns in invoice table.
Other way putting all of them into a one table with more columns and then each row will be a combination of them.. (hacking algorithm 4 list with 4 items over 24 records which will have reference ID).
Which one do you think the best and why!!
Your initial idea is correct. And anyone claiming that four tables is "a little bit complex" and/or "too many tables" shouldn't be doing database work. This is what RDBMS's are designed (and tuned) to do.
Each of these 4 items is an individual property of something so they can't simply be put, as is, into a table that merges them. As you had thought, you start with:
PaperSize
PaperWeight
PaperMode
PaperType
These are lookup tables and hence should have non-auto-incrementing ID fields.
These will be used as Foreign Key fields for the main paper-based entities.
Or if they can only exist in certain combinations, then there would need to be a relationship table to capture/manage what those valid combinations are. But those four paper "properties" would still be separate tables that Foreign Key to the relationship table. Some people would put an separate ID field on that relationship table to uniquely identify the combination via a single value. Personally, I wouldn't do that unless there was a technical requirement such as Replication (or some other process/feature) that required that each table had a single-field key. Instead, I would just make the PK out of the four ID fields that point to those paper "property" lookup tables. Then those four fields would still go into any paper-based entities. At that point the main paper entity tables would look about the same as they would if there wasn't the relationship table, the difference being that instead of having 4 FKs of a single ID field each, one to each of the paper "property" tables, there would be a single FK of 4 ID fields pointing back to the PK of the relationship table.
Why not jam everything into a single table? Because:
It defeats the purpose of using a Relational Database Management System to flatten out the data into a non-relational structure.
It is harder to grow that structure over time
It makes finding all paper entities of a particular property clunkier
It makes finding all paper entities of a particular property slower / less efficient
maybe other reasons?
EDIT:
Regarding the new info (e.g. Invoice Table, etc) that wasn't in the question when I was writing the above, that should be abstracted via a Product/Inventory table that would capture these combinations. That is what I was referring to as the main paper entities. The Invoice table would simply refer to a ProductID/InventoryID (just as an example) and the Product/Inventory table would have these paper property IDs. I don't see why these properties would be in an Invoice table.
EDIT2:
Regarding the IDs of the "property" lookup tables, one reason that they should not be auto-incrementing is that their values should be taken from Enums in the app layer. These lookup tables are just a means of providing a "data dictionary" so that the database layer can have insight into what these values mean.

What is the best way to represent a constrained many-to-many relationship within a relational database?

I would like to establish a many-to-many relationship with a constraint that only one or no entity from each side of the relationship can be linked at any one time.
A good analogy to the problem is cars and parking garage spaces. There are many cars and many spaces. A space can contain one car or be empty; a car can only be in one space at a time, or no space (not parked).
We have a Cars table and a Spaces table (and possibly a linking table). Each row in the cars table represents a unique instance of a car (with license, owner, model, etc.) and each row in the Spaces table represents a unique parking space (with address of garage floor level, row and number). What is the best way to link these tables in the database and enforce the constraint describe above?
(I am using C#, NHibernate and Oracle.)
If you're not worried about history - ie only worried about "now", do this:
create table parking (
car_id references car,
space_id references space,
unique car_id,
unique space_id
);
By making both car and space references unique, you restrict each side to a maximum of one link - ie a car can be parked in at most one space, and a space can has at most one car parked in it.
in any relational database, many to many relationships must have a join table to represent the combinations. As provided in the answer (but without much of the theoretical background), you cannot represent a many to many relationship without having a table in the middle to store all the combinations.
It was also mentioned in the solution that it only solves your problem if you don't need history. Trust me when I tell you that real world applications almost always need to represent historical data. There are many ways to do this, but a simple method might be to create what's called a ternary relationship with an additional table. You could, in theory, create a "time" table that also links its primary key (say a distinct timestamp) with the inherited keys of the other two source tables. this would enable you to prevent errors where two cars are located in the same parking spot during the same time. using a time table can allow you the ability to re-use the same time data for multiple parking spots using a simple integer id.
So, your data tables might look like so
table car
car_id (integers/numbers are fastest to index)
...
table parking-space
space_id
location
table timeslot
time_id integer
begin_datetime (don't use seconds unless you must!)
end_time (don't use seconds unless you must!)
now, here's where it gets fun. You add the middle table with a composite primary key that is made up of car.car_id + parking_space.space_id + time_id. There are other things you could add to optimize here, but you get the idea, I hope.
table reservation
car_id PK
parking_space_id PK
time_id PK (it's an integer - just try to keep it as highly granular as possible - 30 minute increments or something - if you allow this to include seconds / milliseconds /etc the advantages are cancelled out because you can't re-use the same value from the time table)
(this would also be the place to store variable rates, discounts, etc distinct to this particular account, reservation, etc).
now, you can reduce the amount of data because you aren't replicating the timestamp in the join table (reservation). By using an integer, you can re-use that timeslot for multiple parking spaces, but you could also apply a constraint preventing two cars from renting that given spot for the same "timeslot" for a given day / timeframe. This would also make it easier to store some history about the customers - who knows, you might want to see reports on customers who rent more often and offer them discounts or something.
By using the ternary relationship model, you are making each spot unique to a given timeslot (perhaps with some added validation rules), so the system can only store one car in one parking spot for one given time period.
By using integers as keys instead of timestamps, you are assured that the database won't need to do any heavy lifting to index the keys and sort / query against. This is a common practice in data warehousing / OLAP reporting when you have massive datasets and you need efficiency. I think it applies here as well.
create a third table.
parking
--------
car_id
space_id
start_dt
end_dt
for the constraint, i guess the problem with your situation is that you need to check a complex rule against the intersection table itself. if you try this in a trigger, it will report a mutation.
one way to avoid this would be to replicate the table, and query against this replication for the constraint.

Resources