I'm wondering what are best practices around recording stocks and flows. Do you store only flows, and calculate stocks? Or do you store both?
It seems like the important thing to persist are the flows. (For example, in a bank database, it would be debits and credits to an account), and the stocks (funds remaining) can be calculated from these. But if there are lots of bank accounts, and I want a table of multiple bank accounts with funds remaining, then I would have to recalculate this amount for each one. This seems quite slow.
On the other hand, I thought one of the main goals of databases is to not have duplicated data.
Is there a general practice around storing stocks? Should this be a calculated field, or rather inserted by program logic?
In database design, we have Derived Data:
A table can have derived columns, which are columns for which values are computed, based on the values of other table columns. If all columns are derived, it is said to be a derived table.
For example:
Student Age,
Account Balance,
Number of likes or up-votes and likes of posts and comments (like stackoverflow).
In this case we have 2 options with pros and cons:
Delete the derived data and calculate them
pros: we do not have any Redundancy in our database design.
cons: we should calculate the Aggregation data (Count, Sum, Avg,...) in most queries
Use derived data instead of calculate them
pros: we have all Aggregation data ready and do not need to calculate them
cons: we have a little Redundancy.
cons: we should update derived data when the original data changes.
Therefor we have a trade-off between choosing option 1 or 2. We should calculate their costs in our application and choose one of them.
First: Redundancy
I my idea the redundancy is not so important case in this trade-off. Because there is no so many duplicate data, we only use an extra field (like Integer of Big Integer)
Second:
I think we should calculate the costs between these options:
in Deleting derived data
cost of Performance of retrieving Aggregation data
Using Aggregation columns
cost of Updating Aggregation columns
So, how can we calculate them in our application? There are some evaluation parameters that directly related to the cost:
the number of records in original table (and secondary table).
the number of inserts in original tables in a specified period of time.
the number of updates (update and delete)in original table in a specified period of time.
the number of selects from original data (or secondary table), containing Aggregation data, in a specified period of time
so many other parameters.
Finally: to get very formal approaches, I recommend to read DAX Patterns.
Related
I just started in Data Warehouse modeling and I need help for the modeling of a problem.
Let me tell you the facts: I work on flight data (aeronautical data),
so I have two Excel (fact) files, linked together, one file 'order' and the other 'services'.
the 'order' file sets out a summary of each flight (orderId, departure date, arrival date, City of departure, City of arrival, total amount collected, etc.)
the 'services' file lists the services provided by flight (orderId, service name, quantity, amount / qty, etc.)
with a 1-n relationship (order-services) each order has n services
I already see some dimensions (Time, Location, etc ...). However, I would like to know how I could design my Data Warehouse, knowing that I have two fact files linked together by orderId.
I thought about it, and the star and snowflake schema do not work in my case (since I have two fact tables) and the galaxy schema requires to have dimensions in common, but I block it, is that I put the order table as a dimension and not as a fact table or I should rather put the services table as a dimension, but these are fact tables. I get a little confused.
How can I design my model?
First of all realize that in a star schema it is not a problem to have more fact tables that are connected - see the discussion here.
So the first draw will simple follow your two fact tables with the native provided dimensions.
Order is in one context a fact table, in other context a dimensional table for the service table.
Dependent on your expected queries you could find useful to denormalize some dimensions of the order table in the service table. So the service will have defined the departure date, arrival date etc. dimensions.
This will be done at the load time in the ETL job.
I will be somehow careful to denormalize the measures from order to service - which will basically eliminate the whole order table.
There will be no problem with the measure total amount collected if this is a redundant sum of the service amounts - you may safely get rid of it.
But you will need for sure the number of flights or number of people transported - those measure are better defined in the order fact table; you can not simple replicate them in the N rows for each service.
A workaround is possible, if you define a main service for each order and those measures are defined only in this row - in other rows the value is NULL. This could lead to unexpected results if queried naively, e.g. for number of flights per service.
So basically I'd start with the two fact tables and denormalize some dimensions to the services if this would help to optimize the queries.
I would start with one fact table of Services. This fact would include all of the dimensions you might associate with the Order including a degenerated dimension of OrderId.
Once this fact is built out and some information products are consuming it, return to the Order and re-evaluate it to see if there are any reporting needs which are not being served, or questions which are difficult to answer with the Services fact.
Joining two facts together is always a bad idea. Performance is terrible. You are always better off bring the dimensions from, in your case, Order to Services. Don't forget to include the context of the dimension in the column name and a corresponding role-playing dimension view for this context. E.G. OrderArrivalCity, OrderDepartureDate, OrderDepartureTime.
You can also get yourself a copy of Ralph Kimball's The Data Warehouse Toolkit
I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)
I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)
Let's say I have an ordering system which has a table size of around 50,000 rows and grows by about 100 rows a day. Also, say once an order is placed, I need to store metrics about that order for the next 30 days and report on those metrics on a daily basis (i.e. on day 2, this order had X activations and Y deactivations).
1 table called products, which holds the details of the product listing
1 table called orders, which holds the order data and product id
1 table called metrics, which holds a date field, and order id, and metrics associated.
If I modeled this in a star schema format, I'd design like this:
FactOrders table, which has 30 days * X orders rows and stores all metadata around the orders, product id, and metrics (each row represents the metrics of a product on a particular day).
DimProducts table, which stores the product metadata
Does my performance gain from a huge FactOrders table only needing one join to get all relevant information outweigh the fact that I increased by table size by 30x and have an incredible amount of repeated data, vs. the truly normalized model that has one extra join but much smaller tables? Or am I designing this incorrectly for a star schema format?
Do not denormalize something this small to get rid of joins. Index properly instead. Joins are not bad, joins are good. Databases are designed to use them.
Denormalizing is risky for data integrity and may not even be faster due to the much wider size of the tables. IN tables this tiny, it is very unlikely that denormalizing would help.
I'm part of a team architecting an Operational Data Store (ODS) database, using SQL Server 2012, that will be used by some of our analysts to do predictive modeling. The ODS will contain manufacturing production data for a single product we make.
We will have hundreds of tables in the ODS. However, we will have a single core table that will contain critical information (lifecycle info) about each item manufactured (tens of millions each year). Our product is manufactured in a manufacturing plant and spends roughly 2.5 hours moving through various processes along a production line. We want to store various, individual, pieces of manufacturing and post manufacturing information in this core table. An example piece of data might be the time the product entered a particular oven.
We have a decision to make on how to architect this table. We can create a wide table (many columns) or a narrow table where most columns are rows (as property values). I have never designed and worked with a table structure that is very narrow and columns are treated as rows in the table.
I'd like some feedback on the pros and cons of a wide table vs. a narrow table. The following might be useful in helping with this discussion:
Number of products produced each year: Several million (each of these product instances will be a row in the core table)
Will this table be queried often: Yes, very often. It will be the parent to many child tables.
Potential number of columns (or row properties): 75 to 150+
If more information would be useful, I'd be glad to provide it.
Wide tables, static properties
You are tracking a single product through a well-defined manufacturing process. This data model sounds very static, and would lend itself to a wide table with many columns that are consistently populated with data.
Narrow tables, dynamic properties
If you had many, many products with lots of variation in the manufacturing process, it would be better suited for a narrow table, where you could easily add new properties for tracking.
Difficult to query a narrow table
However, even simple querying of a narrow table can extremely difficult. For example, what if you needed to sort the data by a certain property when that property is shuffled amongst 100+ other property rows? How would you get all the rows together to form a single "record" and then sort the record groups within your result set?
Flat tables simpler to query
Depending on how you need to view and analyze the data, you may find yourself constantly using pivot or crosstab queries. If that's the case, then why not flatten out the storage table to begin with?
Or do both
Another option is to do both: Store the data narrowly, and use a transformation process to flatten it out for ease of reporting. That way you can quickly begin tracking new properties (just by adding rows), and then you can work on getting your reporting tables and transformation process updated to utilize the new data.
How wide is too wide? Well, there can be several problems with wide tables.
One problem is that wide tables tend to deviate from the rules for normalizing data. This in turn can result in tricky update problems where you have to be careful to prevent the database from entering a self contradictory state. There's no particular answer to how wide it too wide here. Just apply the normalization rules, and you'll end up decomposing the table.
However, some databases are not built with normalization as the guiding principle. In particular, consider fact tables in star schemas. There are times when some of the coulmns are determined by some subset of the FK's, and this can violate 3NF or even 2NF. Keeping fact tables skinny is still important in star schemas, but it's for a different reason, namely speed. Sometimes, a fact table can be made skinnier by pushing data out to one of the dimension tables. Sometimes, you can decompose a star into two or more related stars.
Your case sounds like the second reason given above, even though your design probably isn't a star schema. Still, star schema design principles might help you improve your design.