I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example.
Every day I collect data which would look like this in a single table:
TABLE: CAR_ALL
ID
DATE
CAR
LOCATION
FUEL
FUEL_USAGE
MILES
BATTERY
123
01.01.2021
Toyota
New York
40.3
3.6
79321
78
520
01.01.2021
BMW
Frankfurt
34.2
4.3
123232
30
934
01.01.2021
Mercedes
London
12.7
4.7
4321
89
123
05.01.2021
Toyota
New York
34.5
3.3
79515
77
520
05.01.2021
BMW
Frankfurt
20.1
4.6
123489
29
934
05.01.2021
Mercedes
London
43.7
5.0
4400
89
In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:
TABLE: CAR_CONSTANT
ID
CAR
LOCATION
123
Toyota
New York
520
BMW
Frankfurt
934
Mercedes
London
TABLE: CAR_MEASUREMENT
GUID
ID
DATE
FUEL
FUEL_USAGE
MILES
BATTERY
1
123
01.01.2021
40.3
3.6
79321
78
2
520
01.01.2021
34.2
4.3
123232
30
3
934
01.01.2021
12.7
4.7
4321
89
4
123
05.01.2021
34.5
3.3
79515
77
5
520
05.01.2021
20.1
4.6
123489
29
6
934
05.01.2021
43.7
5.0
4400
89
I have two questions:
Does it make sense to create an extra table for DATE?
It is possible that new cars will be included through the collected data.
For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it.
But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.
The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.
Pros:
With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.
Cons:
As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.
To answer your questions directly:
I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.
Related
All,
I am looking at some tips with regards to optimizing a ELT on a Snowflake Fact table with approx. 15 billion rows.
We get approximately 30,000 rows every 35 mins like the one below, we always will get 3 Account Dimension Key values i.e. Sales, Cogs & MKT.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
Y
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
This is then PIVOTED based on Finance Key and Date Key and loaded into another table for reporting, like the one below
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
IsCurrent
001
2019-01-01
001
0012
100
300
200
Y
At times there is an adjustment made and for 1 Account key.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
001
2019-01-01
001
0012
SALES_001
50
Hence we have to do this
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
X
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
001
2019-01-01
001
0012
SALES_001
50
Y
And the resulting value should be
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
001
2019-01-01
001
0012
50
300
200
However my question is how do I go about optimizing the query to scan and update the Pivoted table for approx. 15 billion rows in Snowflake.
This is more of a optimization the read and write .
Any pointers
Thanks
So a longer form answer:
I am going to assume the Date_Key values being very is the past, is just a feature of the example date, because if you have 15B rows, and every 35 minutes you have ~10K (30K /3) updates to apply, and they span the whole date range of your data, then there is very little you can do to optimize it. Snowflake Query Acceleration Service might help with the IO.
But the primary way to improve total processing time, is process less.
For example in my old job we had IoT data, that could have messages up to two weeks old. And we duplicated all messages on load (which is effectively a similar process), as part of our pipeline error handling. We found that handling batches with a min date of -2 week against of full message tables (that also had billions of rows) used most of the time, reading/writing the tables. By altering the ingestion to sideline message older than a couple of hours, and deferring their processing until "midnight" we could get the processing of all the timely points in the batch done in under 5 minutes (we did a large amount of other processing, in that interval) for every batch, and we could turn the warehouse cluster size down out of core data hours, and use the saved credits to run a bigger instance at the midnight "catchup" to bring all the days worth of sidelined data on board. This eventually consistent approach worked very well for our application.
So if your destination table is well clustered, so the read/writes is just a fraction of the data then this should be less of a problem for you. But by the nature of asking the question I assume this is not the case.
So if the tables natural clustering is unaligned with the loading data, is that because the destination table needs to be a different shape for the critical read the tables is handling, at which point which is more cost/time sensitive. Another option is to have the destination table clustered by the date (again assuming the fresh data is a small window of time) and have a materialize view/table on top of that table, with a different sort, so that the rewrite is being done for you by Snowflake, now this in not a free-lunch. But it might allow faster upsert times, and faster usage performance. Assuming both a time sensitive.
I'm working on a shop floor Equipment Data Collection project which aims to analyze production orders historically and with real-time data (HMI close to the operator).
Actual database status:
Data is extracted from different equipment (with different protocols) and placed in an SQL server with the following structure:
PROCESS table: As the main table, whenever a batch (production unit) is started, a ProcessID is created as well as varied information:
ProcessID
Room
EquipmentID
BatchID
Program
Operator
Start
End
209486
Room1
1010
985985
RecipeA
Jim
2022.04.05 13:58:02
2022.04.05 15:58:02
Equipment family table: For each equipment family (mixers, ovens etc.) a table is created in which its sensor values (humidity, temperature, speed etc.) are collected every 5 seconds. Here is an example the BatchID above, where ProcessID = Mix ID on the equipment family table - dbo.Mixer :
MixID
EquipmentID
Humidity
Temperature
Speed
DateTime
209486
1010
2.5
70
250
2022.04.05 13:58:02
209486
1010
2.6
73
215
2022.04.05 13:58:07
....
....
....
....
....
....
So, the database is structured with a main PROCESS table and several equipment family tables that are being created during the project development (dbo.mixer, dbo.oven etc).
have the following data flow: SQLServer(source) - RDS Server - Power BI.
Problems of actual status & doubts
With the development of the project, 2 problems arise:
MANUAL WORK: Insertion, in the source, of new tables and columns (in existing tables) causes the need of manual alteration in the RDS server and in Power BI. Every time a new equipment communication is developed and is a new equipment family, a new table is created or if we need to introduce a new sensor in an existing table since the sensors are headers of the table.
Real-time data The actual architecture makes it difficult to implement real time dashboarding.
With these two big problems we are currently analyzing that the new system architecture should be:
SQLServer(source) - DataLake - Snowflake(DataWarehouse) - Power BI (or any application).
However, this won't solve the manual work defined in 1). For this problem we are looking to restructure the source to just 2 tables: Process (equal) and Sensors table(new). This new table would be a narrow big big big table with billions of timestamps of all the different equipment sensors (over 60 equipment), structured as follows:
. dbo.Sensors:
ProcessId
EquipmentID
SensorID
SensorValue
DateTime
209486
1010
1
2.5
2022.04.05 13:58:02
209486
1010
2
70
2022.04.05 13:58:02
209486
1010
3
250
2022.04.05 13:58:02
with a corresponding Sensor Dimension Table (could be created at DataWarehouse) :
SensorID
EquipmentID
SensorName
SensorUnit
1
1010
Humidity
%
2
1010
Temperature
ÂșC
3
1010
Speed
rpm
So, would it be a better way to restructure source and create this giant tall table rather than continuing current structure? At least it will solve the problem of new table or new columns input.
On the other hand, the size of this table will be enormous given that more and more equipment and more sensors are continually being inserted.
Hoping someone might point us in the right direction.
Date State | City | Zip | Water | Weight
-------------------------------------------------------------------
01/01/2016 Arizona Chandler 1011 10 ltr 40 kg
01/04/2016 Arizona Mesa 1012 20 ltr 50 kg
06/05/2015 Washington Spokane 1013 30 ltrs 44 kg
06/08/2015 Washington Spokane 1013 30 ltrs 44 kg
What I want are complex queries, like I want to know average water, weight by passing a city or state or ip for a date range or month, or any field or all fields.
I am not sure how to go about this. Read about map reduce, but cant guess how will I get above output
If you have link for examples which covers above scenarios that will also help.
Thanks in advance
So first we need to model your structured data in JSON. Something like this would work:
{
"date": "2016-01-01",
"location": "Arizona Chandler",
"pressure": 1101,
"water": 10,
"weight": 40
}
Here's your data in a Cloudant database:
https://reader.cloudant.com/so37613808/_all_docs?include_docs=true
Next we'd need to create a MapReduce view to aggregate the a specific field by date. A map function to create an index whose key is the date and whose value is the water would look like this:
function(doc) {
emit(doc.date, doc.water);
}
Every key/value pair emitted from the map function is added to an index which can be queried later in its entirety or by a range of keys (keys which in this case represent a date).
And if an average is required we would use the built-in _stats reducer. The Map and Reduce portions are expressed in a Design Document like this one: https://reader.cloudant.com/so37613808/_design/aggregate
The subsequent index allows us to get an aggregate across the whole data set with:
https://reader.cloudant.com/so37613808/_design/aggregate/_view/waterbydate
Dividing the sum by the count gives us an average.
We can use the same index to provide data grouped by keys too:
https://reader.cloudant.com/so37613808/_design/aggregate/_view/waterbydate?group=true
Or we can select a portion of the data by supplying startkey and endkey parameters:
https://reader.cloudant.com/so37613808/_design/aggregate/_view/waterbydate?startkey=%222016-01-01%22&endkey=%222016-06-03%22
See https://docs.cloudant.com/creating_views.html for more details.
I am building an OLAP cube in MS SQL Server BI Studio. I have two main tables that contain my measures and dimensions.
One table contains
Date | Keywords | Measure1
where date-keyword is the composite key.
One table contains looks like
Date | Keyword | Product | Measure2 | Measure3
where date-keyword-product is the composite key.
My problem is that there can be a one-to-many relationship between date-keyword's in the first table and date-keyword's in the second table (as the second table has data broken down by product).
I want to be able to make queries that look something like this when filtered for a given Keyword:
Measure1 Measure2 Measure3
============================================================
Tuesday, January 01 2013 23 19 18
============================================================
Bike 23
Car 23 16 13
Motorcycle 23
Caravan 23 2 4
Van 23 1 1
I've created dimensions for the Date and ProductType but I'm having problems creating the dimension for the Keywords. I can create a Keyword dimension that affects the measures from the second table but not the first.
Can anyone point me to any good tutorials for doing this sort of thing?
Turns out the first table had one row with all null values (a weird side effect of uploading an excel file straight into MS SQL Server db). Because the value that the cube was trying to apply the dimension to was null in this one row, the whole cube build and deploy failed with no useful error messages! Grr
I've seen this sort of problems a few times and am trying to decide the best way of storing ranges in a non-overlapping manner. For example, when scheduling some sort of resource that only one person can use at a time. Mostly what I've seen is something like this:
PERSON ROOM START_TIME END_TIME
Col. Mustard Library 08:00 10:00
Prof. Plum Library 10:00 12:00
What's the best way of preventing new entries from overlapping an existing schedule, like say if Miss Scarlet wants to reserve the library from 11:00 to 11:30? An inline constraint won't work and I don't think this can easily be done in a trigger. A procedure that handles all inserts that initially looks for an existing conflict in the table?
Secondly, what's the best way to handle concurrency issues? Say Miss Scarlet wants the library from 13:00 to 15:00 and Mrs. White wants it from 14:00 to 16:00. The procedure from (1) would find both these schedules acceptable, but clearly taken together, they are not. The only thing I can think of is manual locking of the table or some sort of mutex.
What's a good primary key for the table above, (room, start_time)?
Fast working way for the cases, where you have fixed time ranges, you can store all ranges in separate table then simply link it to the "reserves" table. It can do the trick for fixed ranges, for example you can reserver library only with 30min interval and working hours is like from 8am to 8pm, just 24 records needed.
--Person table---------------
ID PERSON ROOM
1 Col. Mustart Library
2 Proof. Plum Library
--Timeshift table------------
ID START_TIME END_TIME
1 08:00 08:30
2 08:30 09:00
....
24 19:30 20:00
--Occupy table----
DATE TIMESHIFT PERSON
TRUNC(SYSDATE) TS_ID P_ID
08/12/2012 4 1
08/12/2012 5 1
08/12/2012 9 2
08/12/2012 10 2
Now you make it PK or UK and your database driven check is ready. It will be fast, with little data overhead. However using the same routine for every second will be not that effective.
More universal and complex way is to let some procedure (or trigger) check, is your range occupied or not and you will have to check all current records.