I have this table to normalize for a uni project, now every time I think it should just
be two tables, I then think no it should be three... I am going to throw this out to one of you guys superior knowledge as maybe you can indicate the best way it should be done and why.
Number Type Single rate Double rate Family rate
1 D 56 72
2 D 56 72
3 T 50 72
4 T 50 72
5 S 48
6 S 48
7 S 48
8 T 50 72
9 T 50 72
10 D 56 72
11 D 56 72
12 D 56 72
13 D 56 72
14 F 56 72 84
15 F 56 72 84
16 S 48
17 S 48
18 T 50 72
20 D 56 72
Many thanks for anyone that can help me to see the corret way
It is not possible to produce correct table design unless one understands exactly what the columns mean and how the data columns depend on one another. However, here is an attempt that can be refined once you provide more information for us. The used naming is not as good as I'd like it to be but as I said, the purpose is not clear in the question. Anyway, this is a start, hope it would help you.
Also note that Normalization is not always required for all types of applications. For example, Business Intelligence could use schema that are deliberately not fully normalized (e.g. Star Schema). So the database design may sometimes depend on the application nature and how data change.
Main
----
MainID int PK
MainTypeID Char(1) Example: D, T, S etc.
MainRateIntersectionID Int
MainRateIntersection
--------------------
MainRateIntersectionID int PK
MainID int
RateCategoryID int
The combination of MainID and RateCategoryID should be constrained
using UNIQUE INDEX
RateCategory
------------
RateCategoryID int PK
RateCategoryText Varchar2(15) Not Null Example:Single, Family, etc.
RateValue Int Nullable
MainType
---------
MainTypeID Char(1) PK
Edit
Based on the new information, I have revised the model. I have removed the 'artificial' IDs since this is a training project for Normalization. Artificial IDs (surrogate keys) are correct to add but is not your objective as I guess. I have to add booking table that where a row would be inserted for each customer that makes a booking. You need to add appropriate customer information in that table. The table you provided is more of a logical view that could be returned form a query but not a physical table to store and update in the database. Instead, the bookings table should be used.
I hope this could help you.
Related
I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example.
Every day I collect data which would look like this in a single table:
TABLE: CAR_ALL
ID
DATE
CAR
LOCATION
FUEL
FUEL_USAGE
MILES
BATTERY
123
01.01.2021
Toyota
New York
40.3
3.6
79321
78
520
01.01.2021
BMW
Frankfurt
34.2
4.3
123232
30
934
01.01.2021
Mercedes
London
12.7
4.7
4321
89
123
05.01.2021
Toyota
New York
34.5
3.3
79515
77
520
05.01.2021
BMW
Frankfurt
20.1
4.6
123489
29
934
05.01.2021
Mercedes
London
43.7
5.0
4400
89
In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:
TABLE: CAR_CONSTANT
ID
CAR
LOCATION
123
Toyota
New York
520
BMW
Frankfurt
934
Mercedes
London
TABLE: CAR_MEASUREMENT
GUID
ID
DATE
FUEL
FUEL_USAGE
MILES
BATTERY
1
123
01.01.2021
40.3
3.6
79321
78
2
520
01.01.2021
34.2
4.3
123232
30
3
934
01.01.2021
12.7
4.7
4321
89
4
123
05.01.2021
34.5
3.3
79515
77
5
520
05.01.2021
20.1
4.6
123489
29
6
934
05.01.2021
43.7
5.0
4400
89
I have two questions:
Does it make sense to create an extra table for DATE?
It is possible that new cars will be included through the collected data.
For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it.
But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.
The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.
Pros:
With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.
Cons:
As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.
To answer your questions directly:
I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.
I have one set of data with fields
StudentId, Name , Address in one dataset and being used in one Tablix.
also another set of data: StudentID Subject Marks in another Dataset and using Matrix to Pivot in the Report.
I am able to fetch the Report in this way
StudentID Name Address MAths Physcis Chemistry Median
1 Mike NJ 85 70 90 2
2 David CA 81 85 90 1
I was calculating Median by counting number of Subject Marks greater than 80.
Now how do I use the value of Median in Tablix instead of in Matrix.
Below should be the expected output format
StudentID Median Name Address MAths Physcis Chemistry
1 2 Mike NJ 85 70 90
2 3 David CA 81 85 90
Note: I am using Matrix to Pivot Subject Column in SSRS Report. I am using Pivot operation in SSRS instead of performing in SP because I get 40 columns after Pivoting in SP and need to physically map 40 columns. Here in example I have only given 3 columns(Maths, Physcis and Chemistry).
Also please do let me know if expected output format is at least possible.
Is there any way that I will be able to Pivot Subject Columns inside the Tablix itself instead of using the another Matrix??
Thank you.
There are two ways to typically go about an aggregation like this. If you stick with the two existing datasets, you'll have to use the Lookup or LookupSet functions to get data from the other dataset. For example, if your table/matrix is using the second dataset as it's source, you would Lookup the Name of each student. Keep in mind that this is not efficient for large reports.
The other approach, which I would recommend, is to join these two datasets in SQL and use that as the data source for the report. This is more efficient and makes the report simpler to maintain.
It's good that you are letting the report do the pivoting for you, it works much better that way.
I'm using COLUMNS_UPDATED() in a trigger to identify those columns whose values should be written to an audit table. The trigger / auditing had been working fine for multiple years. I noticed yesterday that the auditing is no longer working consistently.
I've listed the first forty columns of the table in question at the bottom for reference, along with the ORDINAL_POSITION from INFORMATION_SCHEMA.COLUMNS. The table has a total of 109 columns.
I added print COLUMNS_UPDATED() to my trigger to get some debug info.
When I update CurrentOnFleaTick, the 9th column, I see this printed:
0x0001000000000000000000000000
This is expected - the 9th column should be represented as the least significant bit of the second byte. Similarly, if I update HasAttackedAnotherAnimalExplanation I see this:
0x0000010000000000000000000000
Again, expected - the 17th column should be represented as the least significant bit of the third byte.
But... when I update HouseholdIncludesCats, I see this:
0x0000000200000000000000000000
Not expected! Where you see the 2 there should be a 1, as HouseholdIncludesCats ordinal position is 25, making it the first column represented in the fourth byte, which should be represented in the least significant bit of that byte.
I narrowed things down by updating every column between HasAttackedAnotherAnimalExplanation and HouseholdIncludesCats and found that the 'off by one' problem I'm having starts with HouseTrainedId, ordinal position 24. When updating HouseTrainedId I'm expecting
0x0000800000000000000000000000
but instead I get
0x0000000100000000000000000000
which I believe is wrong, and it is what I expect to be getting for updates to the HouseholdIncludesCats column.
I don not believe the mask should skip ahead. The mask is currently not using the most significant bit of the 3rd byte.
I did recently drop a column, but I don't have a record of its ordinal position. Based on the original code that would have created the table, I believe the ordinal position of the column that was dropped was NOT 24. (I think it was 7... It had been defined after the BreedIds.)
I'm not necessarily looking for a deep root cause determination. If there was something I could do to reset whatever internal data SQL Server uses that'd be fine. Sort of like a rebuild index idea for table metadata? Is there something like that that might fix this?
Thanks in advance for helpful answers! :)
COLUMN_NAME ORDINAL_POSITION
PetId 1
AdopterUserId 2
AdoptionDeadline 3
AgeMonths 4
AgeYears 5
BreedIds 6
Color 7
CreatedOn 8
CurrentOnFleaTick 9
CurrentOnHeartworm 10
CurrentOnVaccinations 11
FoodTypeId 12
GenderId 13
GuardianForMonths 14
GuardianForYears 15
HairCoatLength 16
HasAttackedAnotherAnimalExplanation 17
HasAttackedAnotherAnimalId 18
HasBeenReferredByShelter 19
HasHadTraining 20
HasMedicalConditions 21
HasRecentlyBittenExplanation 22
HasRecentlyBittenId 23
HouseTrainedId 24
HouseholdIncludesCats 25
HouseholdIncludesChildren5to10 26
HouseholdIncludesChildrenUnder5 27
HouseholdIncludesDogs 28
HouseholdIncludesOlderChildren 29
HouseholdIncludesOtherPets 30
HouseholdOtherPets 31
KnowsCommandDown 32
KnowsCommandPaw 33
KnowsCommandSit 34
KnowsCommandStay 35
KnowsOtherCommands 36
LastUpdatedOn 37
LastVisitedVetOn 38
ListingCodeId 39
LitterTypeClumping 40
So... I thought I had googled enough before posting this, but I guess I hadn't. I found this:
https://www.sqlservercentral.com/forums/topic/columns_updated-and-phantom-fields
using COLUMNPROPERTY() to get ColumnID is definitely the way to go.
I have two database tables:
report(id, description) (key: id) and
registration(a, b, id_report) (key: (a, b));
id_report is a foreign key that references report id.
In the table registration there is the functional dependency a -> id_report.
So the table registration is 1nf but not 2nf.
Despite this i can not find insert/update/delete problems in the table registration. Is it possible?
Thanks
YOu said in a comment that you couldn't "find how problems could occur." (Emphasis added.) Here's how.
Let's say your table "registration" starts off with data like this.
a b id_report
--
1 10 13
1 11 13
1 12 13
2 27 14
2 33 14
The functional dependency a->id_report still holds. When we know the value for "a", we find one and only one value for "id_report".
But the dbms can't directly enforce that dependency. That means the dbms will permit this update statement to run without error.
update registration
set id_report = 15
where a = 1 and b = 10;
a b id_report
--
1 10 15
1 11 13
1 12 13
2 27 14
2 33 14
Now your data is broken. When we know the value for "a", we now find two values for "id_report". In the earlier table, knowing that "a" equaled 1 meant we knew that "id_report" equaled 13. We no longer know that; if "a" equals 1, id_report might be either 13 or 15.
A table can be denormalized and still not have any existing referential integrity issues.
The reason to normalize is to make it more difficult or impossible to create insert, update and delete anomalies. It is possible, but pretty hard, to manage all of the redundant data such that it remains consistent.
It's still a better idea to use a database in 3NF (or higher, if applicable) so that you aren't relying on programmers and users to keep you out of trouble. Sooner or later mistakes will happen.
To start here is some sample data
Sample Input
ID Date Value
10 2012-06-01 00:01:45 20
10 2012-06-01 00:01:51 12
10 2012-06-01 00:01:56 21
10 2012-06-01 00:02:01 43
10 2012-06-01 00:02:06 12
17 2012-06-01 00:02:43 64
17 2012-06-01 00:02:47 53
17 2012-06-01 00:02:52 23
17 2012-06-01 00:02:58 45
17 2012-06-01 00:03:03 34
Desired Output
ID Date
10 2012-06-01 00:01:45 2012-06-01 00:02:06 20 12
17 2012-06-01 00:02:43 2012-06-01 00:03:03 64 34
So I am looking to get the first and last date, and values for both into a single line. The ID value in my table will also have other entries at later dates, so I only want to get the first and last for a chain of entries. Each entry is 5 secs apart. If they are greater then that it is a new chain.
Any suggestions?
Thanks
I'm just beginning the search process on this but it looks like LATERAL VIEW and EXPLODE coupled with maybe a user defined function or two are your friend.
I ended up creating a MapReduce job to work on the csv files of my data instead of using hive.
I "mapped" based on ID. Then set a parameter where if data's were further then 2 hours I separated them.
In the end it was easily to hack the MapReduce code then ponder hive queries.