I have been trying to understand data modelling and warehousing, Given the best practices of having only one type of granularity in the table can I have one table store both the low granular data with the aggregate data.
2 Table Structure
Table 1
TransactionID Transaction_Dt ProductID Items Cost_Total
11111 1/1/2020 1 10 100
11111 1/1/2020 2 5 200
11111 1/1/2020 3 4 400
11111 1/1/2020 4 5 100
11111 1/1/2020 5 6 600
11111 1/1/2020 6 10 100
Table 2 (Aggregated)
TransactionID Transaction_Dt Total_Items Cost_Total
11111 1/1/2020 40 1500
One table Structure
Aggregate Data in the table
TransactionID Transaction_Dt ProductID Items Cost_Total Type
11111 1/1/2020 1 10 100 ind_Item
11111 1/1/2020 2 5 200 ind_Item
11111 1/1/2020 3 4 400 ind_Item
11111 1/1/2020 4 5 100 ind_Item
11111 1/1/2020 5 6 600 ind_Item
11111 1/1/2020 6 10 100 ind_Item
**11111 1/1/2020 ALL 40 1500 all_Item**
Here we have one record for the entire transaction with the sum of all items and sum of all cost.
Can anyone help me on the Cons of the 2nd approach where we have aggregated data in the same table
Some thoughts about this:
I am not a fan of storing data at multiple levels of aggregation in one table for the reason that #Marmite Bomber suggested - if you do a select sum and don't filter out the aggregates, you'll get a multiple of the answer you're looking for.
If you do still want to put everything into one table, I'd add another column, perhaps called agg_level, and indicate the aggregation level of the row in that table. (you are already kind of doing this with your 'Type' column although Type is a very ambiguous term).
I'd recommend against changing the TransactionID value (you propose adding some asterisks to indicate that it's an aggregate). Modifying it will make it harder to search for the one you want and users will have to understand your notation to get the right records. If you do add an agg_level column and leave the TransactionIDs in their original form, you could put an easily recognizable term in the agg_level column. For example, a record could say "raw", or "Transaction Total", or "Monthly Aggregate"...
If you have to put your aggregates into your base data table, like you've shown, you should consider creating views on top of the table, each view providing only detail at one level of aggregation. You likely would give users access only to these views, and not to the base data. In this way, you store everyone in one table but, to users, it looks like you have multiple tables and you needn't worry about a user accidentally misforming a query that brings back duplicate totals.
It's a good question, Snehasish, and it shows that you've been giving it some thought. Best of luck as you navigate the need going forward!
Related
I'm having difficulty connecting a dimension table (recursive/hierarchical) to a fact table as there are concerns/issues to deal with:
The dimension table belongs to a parent-child relationship structure
From the original table, it keeps growing
id
item_name
parent_id
1
classification
null
2
category
null
3
group
null
4
modern
1
5
modified
1
6
tools
2
7
meters
2
8
metal
3
9
plastic
3
10
lead
8
11
alloy
8
Denormalizing this kind of table is not suitable as a new entity type comes in, it would affect the dimension structure.
What is the best approach to this type?
Kindly provide an example and what would be the query statement after connecting the fact and dimension.
I need to add column headers in SSRS report but that are dynamic in nature.
For example, sometimes Query will return 5 different named columns with it's data and sometime will return 9 different named columns with it's data and so on.
So how to drag or refresh columns in Dataset and how to show in SSRS report dynamically.
I am totally confused seen many articles but not able to get solution.
How to implement this in SSRS report. I have the query, depending on parameters columns gets generated. Check below sample report preview
its showing date in different columns
In SSRS , the dataset must always return the same number of columns with the same names and datatypes, so you cannot of what you want directly.
You have two options.
Option 1
Normalise the data.
So instead of returning something like
SomeID ColumnA ColumnB ColumnC
1 10 20 30
2 15 25 35
3 100 200 300
You need to return
SomeID ColName Amount
1 'ColumnA' 10
1 'ColumnB' 20
1 'ColumnC' 30
2 'ColumnA' 15
2 'ColumnB' 25
2 'ColumnC' 35
3 'ColumnA' 10
3 'ColumnB' 200
3 'ColumnC' 300
Once you have your data in this format, you can simply use a matrix in your report. Set the rowgroup to SomeID, set the Column Group to ColName and the data value to Amount
This is by far the simplest solution.
Option 2
Deconstruct and rebuild the table in code
There are several drawbaks to this method but if you are interested, read my answer to this question asked a few days ago
SQL Server - SSRS - Display the content of a Table/View directly in the report (and not using table/matrix)
I am performing large scale wind simulations to produce hourly wind patterns over a city. The results is a time series of 2-dimensional contours. Currently I am storing the results in SQLite3 database tables with the following structure
Table: CFD
id, timestamp, velocity, cell_id
1 , 2010-01-01 08:00:00, 3.345, 1
2 , 2010-01-01 08:00:00, 2.355, 2
3 , 2010-01-01 08:00:00, 2.111, 3
4 , 2010-01-01 08:00:00, 6.432, 4
.., ..................., ....., .
1000 , 2010-01-01 09:00:00, 3.345, 1
1001 , 2010-01-01 10:00:00, 2.355, 2
1002 , 2010-01-01 11:00:00, 2.111, 3
1003 , 2010-01-01 12:00:00, 6.432, 4
.., ..................., ....., .
Actual create statement:
CREATE TABLE cfd(id INTEGER PRIMARY KEY, time DATETIME, u, cell_id integer)
CREATE INDEX idx_cell_id_cfd on cfd(cell_id)
CREATE INDEX idx_time_cfd on cfd(time)
(There are three of these tables, each for a different result variable)
where cell_id is a reference to the cell in the domain representing a location in the city. See this picture to have an idea of what it looks like at a specific timestep.
The typical query performs some kind of aggregation on the time dimension and group by on cell_id. For example, if I want to know the average local wind speed in each cell during a specific time interval, I would execute
select sum(time in ('2010-01-01 08:00:00','2010-01-01 13:00:00','2010-01-01 14:00:00', ...................., ,'2010-12-30 18:00:00','2010-12-30 19:00:00','2010-12-30 20:00:00','2010-12-30 21:00:00') and u > 5.0) from cfd group by cell_id
The number of timestamps can vary from 100 to 8,000.
This is fine for small databases, but it gets much slower for larger ones. For example, my last database was 60GB, 3 tables and each table had 222,000,000 rows.
Is there a better way to store the data? For example:
would it make sense to create a different table for each day?
would be better to use a separate table for the timesteps and then use a join?
is there a better way of indexing?
I have already adopted all the recommendations in this question to maximise the performance.
This particular query is hard to optimize because the sum() must be computed over all table rows. It is a better idea to filter rows with WHERE:
SELECT count(*)
FORM cfd
WHERE time IN (...)
AND u > 5
GROUP BY cell_id;
If possible, use a simpler expression to filter times, such as time BETWEEN a AND b.
It might be worthwhile to use a covering index, or in this case, when all queries filter on the time, a clustered index (without additional indexes):
CREATE TABLE cfd (
cell_id INTEGER,
time DATETIME,
u,
PRIMARY KEY (cell_id, time)
) WITHOUT ROWID;
I'm in an intro to database management course and we're learning about normalizing data (1NF, 2NF, 3NF, etc.) and I'm super confused on how to actually go about and do it. I've read up on this, consulted various sites and youtube videos and I still can't seem to get it to click. I am using Microsoft Access 2013 if that's of any help.
This is the data I'm working with.
Thanks.
Edit1: Alright, I think I have the tables set up correctly. But now I'm having trouble actually inputting data to go from one table to the next. Here's my relationship table.
On a very basic level, any repeating values in a table are candidates for normalization. Duplicated data is usually a bad idea. Say you needed to update a patient's surname - you now have to update all the occurrences in this table, and possibly many others throughout the rest of the database. Much better to store each patient's details in one place only.
This is where normalization comes in. Looking down the columns, you can see that there are repeating values for data about dentists, patients and surgeries, so we should normalize towards having tables for each of these entities, as well as the original table that contains appointments, giving you four tables in total.
Extract the entities out into their own tables, and give each row a primary (unique) key - just use an incrementing integer for now. (Edit: as suggested in the comment we could use the natural keys of PatientNo, StaffNo and SurgeryNo instead of creating surrogates.)
Then, instead of each patient's name and number appearing multiple times in the appointments table, we just reference the key of the master record in the Patient table. This is called a foreign key.
Then, do the same for Dentist and Surgery.
You will end up with tables looking something like this:
APPOINTMENT
AppointmentID DentistID PatientID AppointmentTime SurgeryID
----------------------------------------------------------------
1 1 1 12 Aug 03 10:00 1
2 1 2 ... 2
3 2 3 ... 1
4 2 3 ... 1
5 3 2 ... 2
6 3 4 ... 3
DENTIST
DentistID Name StaffNo
--------------------------------------
1 Tony Smith S1011
2 Helen Pearson S1024
3 Robin Plevin S1032
PATIENT
PatientID Name PatientNo
---------------------------------------
1 Gillian White P100
2 Jill Bell P105
3 Ian MackKay P108
4 John Walker P110
SURGERY
SurgeryID SurgeryNo
-------------------------
1 S10
2 S15
3 S13
The first step is to data modelling and denormalization is to understand your data. Study it an understand the domain "objects" or tables that exist within your model. That will give you an idea of how to start. Sometimes a single table or query sample is not enough to fully understand the database, but in your case, we can use the sample data and make some assumptions.
Secondly, look for repeated / redundant data. If you see copies of names, there is a good chance that is a candidate for a foreign key. Our assumption tells us that STAFF_NO is a primary key candidate for DENTIST because each unique STAFF_NO correlates to a unique DENTIST_NAME, so I see a good candidate DENTIST table (STAFF_NO, DENTIST_NAME)
Example in some table of SURGERY:
ID STAFF_NO DENTIST_NAME
1 1 Fred Sanford
2 1 Fred Sanford
3 3 Lamont Sanford
4 3 Lamont Sanford
Why store these over and over? What happens when Fred says "But my correct name is Fred G Sanford", so you have to update your database. In the current table, you have to update the name is many rows. If you had normalized it, you'd have a single location for the name, in the DENTIST table.
So I can take the unique dentists and store them in DENTIST
create table DENTIST(staff_no integer primary key, dentist_name varchar(100));
-- One possible way to populate our dentist table is to use a distinct query from surgery
insert into DENTIST
select distinct staff_no, dentist_name from surgery;
STAFF_NO DENTIST_NAME
1 Fred Sanford
3 Lamont Sanford
SURGERY table now points to DENTIST table
ID STAFF_NO
1 1
2 1
3 3
4 3
And you can now create a view, VIEW_SURGERY to join the DENTIST_NAME back in to satisfy the needs of typical queries.
select s.id, d.staff_no, d.dentist_name
from surgery s join dentist d
on s.staff_no = d.staff_no -- join here
So now a unique update to DENTIST, by the dentist primary key will update a single row.
update dentist set name = 'Fred G Sanford' where staff_no = 1;
Add query view will show the updated name for N rows:
select * from view_surgery
ID STAFF_NO DENTIST_NAME
1 1 Fred G Sanford
2 1 Fred G Sanford
3 3 Lamont Sanford
4 3 Lamont Sanford
In short, you are removing redundancy.
This is just a sample, and one way to do it. Manual normalization like this is not as common when you have modelling tools, but the point is, we can look at data, spot redundancies and factor those redundancies into new tables, and relate those new tables by foreign keys and joins, then build views to represent the original data.
I need to selectively retrieve data from two tables that have a 1 to many relationship. A simplified example follows.
Table A is a list of events:
Id | TimeStamp | EventTypeId
--------------------------------
1 | 10:26... | 12
2 | 11:31... | 13
3 | 14:56... | 12
Table B is a list of properties for the events. Different event types have different numbers of properties. Some event types have no properties at all:
EventId | Property | Value
------------------------------
1 | 1 | dog
1 | 2 | cat
3 | 1 | mazda
3 | 2 | honda
3 | 3 | toyota
There are a number of conditions that I will apply when I retrieve the data, however they all revolve around table A. For instance, I may want only events on a certain day, or only events of a certain type.
I believe I have two options for retrieving the data:
Option 1
Perform two queries: first query table A (with a WHERE clause) and store data somewhere, then query table B (joining on table A in order to use same WHERE clause) and "fill in the blanks" in the data that I retrieved from table A.
This option requires SQL Server to perform 2 searches through table A, however the resulting 2 data sets contain no duplicate data.
Option 2
Perform a single query, joining table A to table B with a LEFT JOIN.
This option only requires one search of table A but the resulting data set will contain many duplicated values.
Conclusion
Is there a "correct" way to do this or do I need to try both ways and see which one is quicker?
Ex
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
More REF