Okey, context:
I have a system that requires to do a monthly, weekly and dayly reports.
Architecture A:
3 tables:
1) Monthly reports
2) Weekly reports
3) Daily reports
Architecture B:
1 table:
1) Reports: With extra column report_type, with values: "monthly", "weekly", "daily".
Which one would be more performant and why?
The common method I use to do this is use two tables, following similarly to your B approach. One table would be as you describe with report data and an extra column, but instead of hard coding the values, this column would hold an id to a reference table. The reference table would then hold the names of these values. This set up allows you to easily reference the intervals with other tables should you need that later on, but also makes name updates much more efficient. Changing the name of say "Monthly" to "Month" would require one update here, vs n updates if you stored the string in your report table.
Sample structure:
report_data | interval_id
xxxx | 1
interval_id | name
1 | Monthly
As a side note, you would rarely want to take your first approach, approach A, due to how it limits changing the interval type of entered data. If all of a sudden you want to change half of your Daily entries to Weekly entries, you need to do n/2 deletes and n/2 inserts, which is fairly costly especially if you start introducing indexes. In general tables should describe types of data (ie Reports) and columns should describe that type (ie How often a report happens)
Related
I have a bunch of tables which refer to some number of other tables (zero, one, two or more).
My example tables might contain following columns:
Id | StatementTable1Id | StatementTable2Id | Value
where StatementTable1 will contain following columns:
Id | Name | Label
I wish to get all possible combinations and join all of them.
I found this link very useful (query which produce information about dependencies).
I would imagine my code as follows:
Prepare list of tables which I wish to query.
Query link for all my tables and save results into temporary table.
Check maximum number of dependent tables. Prepare query template - for example if maximum number of dependent tables is equal two:
Select
Id, '%Table1Name%' as Table1Name,
'%StatementLabelTable1%' as StatementLabelTable1,
'%Table2Name%' as Table2Name,
'%StatementLabelTable2%' as StatementLabelTable2, Value"
Use cursor - for each dependent table replace appropriate part with dependent table name and label of elements within it.
When all dependent tables have been used - replace all remaining columns with empty string.
add "UNION ALL" and proceed to next table
Run query
Could you tell me if there's any easier or better way?
What you've listed there sounds like you'll need to do if you don't know the column details ahead of time. There's likely going to be some trial-and-error to get the details correct, but it's a good plan to start.
That being said, why on earth would you want to do such a thing? It sounds like you need to narrow down your requirements on what data is actually needed. Otherwise, as you add data to your database, this query and resulting data set is going to quickly become quite unwieldy (these data sets are the kinds you hear about becoming daily "door-stop reports"; no one uses them, but they never remember why it was created, so they keep running the report, and just use it as a door-stop).
This is a theoretical question which I ask due to a request that has come my way recently. I own the support of a master operational data store which maintains a set of data tables (with the master data), along with a set of lookup tables (which contain a list of reference codes along with their descriptions). There has been recently a push from the downstream applications to unite the two structures (data and lookup values) logically in the presentation layer so that it is easier for them to find out if there have been updates in the overall data.
While the request is understandable, my first thought is that it should be implemented at the interface level rather than at the source. Combining the two tables logically (last_update_date) at ODS level is almost similar to the de-normalization of data and seems contrary to the idea of keeping lookups and data separate.
That said, I cannot think of any reason of why it should not be done at ODS level apart from the fact that it does not "seem" to be right... Does anyone have any thoughts around why such an approach should or should not be followed?
I am listing an example here for simplicity's sake.
Data table
ID Name Emp_typ_cd Last_update_date
1 X E1 2014-08-01
2 Y E2 2014-08-01
Code table
Emp_typ_cd Emp_typ_desc Last_Update_date
E1 Employee_1 2014-08-23
E2 Employee_2 2013-09-01
The downstream request is to represent the data as
Data view
ID Name Emp_typ_cd Last_update_date
1 X E1 2014-08-23
2 Y E2 2014-08-01
or
Data view
ID Name Emp_typ_cd Emp_typ_desc Last_update_date
1 X E1 Employee_1 2014-08-23
2 Y E2 Employee_2 2014-08-01
You are correct, it is demoralizing the database because someone wants to see the data in a certain way. The side effects, as you know, are that you are duplicating data, reducing flexibility, increasing table size, storing dissimilar objects together, etc. You are also correct that their problem should be solved somewhere or somehow else. They won’t get what they want if they change the database the way they want to change it. If they want to make it “easier for them to find out if there have been updates in the overall data” but they duplicate massive amounts of it, they’re just opening themselves up to errors. In your example the Emp_typ_cd Updated value must be updated for all employees with that emp type code. A good update statement will do that, but still, instead of updating a single row in the lookup table you’re updating every single employee that has the emp type.
We use lookup tables all the time. We can add a new value to a lookup table, add employees to the database with a fk to that new attribute, and any report that joins on that table now has the ID, Value, Sort Order, etc. Let’s say we add ‘Veteran’ to the lu_Work_Experience. We add an employee with the veteran fk_Id and now any existing query that joins on lu_Work_Experience has that value. They sort Work Experience alphabetically or by our pre-defined sort.
There is a valid reason for flattening your data structure though, and that is speed. If you’re running a very large report it will be faster with now joins (and good indexing). If the business knows it’s going to run a very large report many times and is worried about end user wait times, then it is a good idea to build a single table for that one report. We do it all the time for calculated measures. If we know that a certain analytic report will have a ton of aggregation and joins we pre-aggregate the data into the data store. That being said, we don’t do that very often in SQL because we use cubes for analytics.
So why use lookup tables in a database? Logical separation of data. An employee has a employee code, but it does NOT have a date of when an employee code was updated. Reduce duplicate data. Minimize design complexity. To avoid building a table for a specific report and then having to build a different table for a different report even if it has similar data.
Anyway, the rest of my argument would be comprised of facts from the Database Normalization wikipedia page so I’ll skip it.
We have a SQL Server table called Deals which is a general table including all financial trade entries, the number of which can be hundreds of thousands or even millions. Each row in Deals has one column called Product, which is of type varchar and corresponds to different trade type: bond trade, equity trade, future or option trade, some are simple and some are more complex. Now we want to search the whole table to get all deals which are not matured. Depending on the type of the trade, the term "matured" has different meanings. For example, for Product type "BOND" and some others, we check the value of column MaturityDate in Deals table, is larger than the given date, i.e., MaturityDate > #DATE. And for the deal type "FUTURE", we have the column ValueDate in the same table larger than given date AND MaturityDate larger than given date. So my first draft of the query would look like this:
SELECT * FROM Deals
WHERE
(
(Product IN ('ProdA', 'ProdB') AND MaturityDate > #CD) OR
(Product IN ('ProdC', 'ProdD') AND (MaturityDate > #CD AND ValueDate > #CD)) OR
(Product IN ('ProdE', 'ProdF', SOME_OTHER_PRODUCT_TYPE) AND (MaturityDate > #CD AND ValueDate > #CD) AND SOME_OTHER_CRITERIA) OR
...
)
We have over 30 Product values and and least 8 sets of criteria (the most common criteria set applies to maybe 7 or 8 Product values, and the least common criteria sets apply to only one or two Product values). The column Product is indexed, some columns (but not all) in the criteria, for example, MaturityDate are also indexed. Most of the criteria sets only check the column values in Deals table, but there are indeed some criteria sets which involve with JOIN to some other tables and check the values there as well.
So now my question is, how I can optimize such a query (as a SW developer, I am really not an expert in DB and seldom write data access code)? Because I read from somewhere that it might be a good idea to replace OR clause with UNION. However, when I executed queries using UNION and using OR, the former took 5 seconds on my development machine (with less than 100,000 items in Deals table) while the latter took 3 seconds. And like I said, I have limited knowledge so I don't know if there is some other way to optimize such query. Could anyone share some experience? Thanks!
I'm writing an application that stores different types of records by user and day. These records are divided in categories.
When designing the database, we create a table User and then for each record type we create a table RecordType and a table Record.
Example:
To store data related to user events we have the following tables:
Event EventType
----- ---------
UserId Id
EventTypeId Name
Value
Day
Our boss pointed out (with some reason) that we're gonna store a lot of rows ( Users * Days ) and suggested an idea that seems a little crazy to me: Create a table with a column for each day of the year, like so:
EventTypeId | UserId | Year | 1 | 2 | 3 | 4 | ... | 365 | 366
This way we only have 1 row per user per year, but we're gonna get pretty big rows.
Since most ORMs (we're going with rails3 for this project) use select * to get the database records, aren't we optimizing something to "deoptimize" another?
What's the community thoughs about this?
This is a violation of First Normal Form. It's an example of repeating groups across columns.
Example of why this is bad: Write a query to find which day a given event occurred. You'll need a WHERE clause with 366 terms, separated by OR. This is tedious to write, and impossible to index.
Relational databases are designed to work well even if you have a lot of rows. Say you have 10000 users, and on average every user generates 10 events every day. After 10 years, you will have 10000*366*10*10 rows, or 366,000,000 rows. That's a fairly large database, but not uncommon.
If you design your indexes carefully to match the queries you run against this data, you should be able to have good performance for a long time. You should also have a strategy for partitioning or archiving old data.
That's breaks the DataBase normal forms principles
http://databases.about.com/od/specificproducts/a/normalization.htm
if it's applicable why don't you replace Day columns with a DateTime column in your event table with a default value (GetDate() we are talking about SQL)
then you could group by Date ...
I wouldn't do it. As long as you take the time to index the table appropriately, the database server should work well with tables that have lots of rows. If it's significantly slowing down your database performance, I'd start by making sure your queries aren't forcing a lot of full table scans.
Some other potential problems I see:
It probably will hurt ORM performance.
It's going to create maintainability problems on down the road. You probably don't want to be working with objects that have 366 fields for every day of the year, so there's probably going to have to be a lot of boilerplate glue code to keep track of.
Any query that wants to search against a range of dates is going to be an unholy mess.
It could be even more wasteful of space. These rows are big, and the number of rows you have to create for each customer is going to be the sum of the maximum number of times each different kind of event happened in a single day. Unless the rate at which all of these events happens is very constant and regular, those rows are likely to be mostly empty.
If anything, I'd suggest sharding the table based on some other column instead if you really do need to get the table size down. Perhaps by UserId or year?
I developing a tool which may got more than a million data to fill in.
current i have designed single table with 36 coloumns. my question is do I need to divide these into multiple tables or single??
If single what is the advantage and disadvantage
if multiple then what is the advantage and disadvantage
and what will be the engine to use for speed...
my concern is a large database which will have atleast 50000 queries perday..
any help??
Yes, you should normalize your database. A general rule of thumb is that if a column that isn't a foreign key contains duplicate values, the table should be normalized.
Normalization involves splitting your database into tables, and helps to:
Avoid modification anomolies.
Minimize impact of changes to the data structure.
Make the data model more informative.
There is plenty of information about normalization on Wikipedia.
If you have a serious amount of data and don't normalize, you will eventually come to a point where you will need to redesign your database, and this is incredibly hard to do retrospectively, as it will involve not only changing any code that accesses the database, but also migrating all existing data to the new design.
There are cases where it might be better to avoid normalization for performance reasons, but you should have a good understanding of normalization before making this decision.
First and foremost ask yourself are you repeating fields or attributes of fields. Does your one table contain relationships or attributes that should be separated. Follow third normal form...we need more info to help but generally speaking one table with thirty six columns smells like a db fart.
If you want to store a million rows of the same kind, go for it. Any decent database will cope even with much bigger tables.
Design your database to best fit the data (as seen from your application), get it up, and optimize later. You will probably find that performance is not a problem.
You should model your database according to the data you want to store. This is called "normalization": Essentially, each piece of information should only be stored once, otherwise a table cell should point to another row or table containing the value. If, for example, you have table containing phone numbers, and one column contains the area code, you will likely have more than one phone number with the same value in the same column. Once this happens, you should set up a new table for area codes and link to its entries by referencing the primary key of the row the desired area code is stored in.
So instead of
id | area code | number
---+-----------+---------
1 | 510 | 555-1234
2 | 510 | 555-1235
3 | 215 | 555-1236
4 | 215 | 555-1237
you would have
id | area code id | number | area code
---+---------- ---+----------+-----------
1 | 510 1 | 555-1234 | 1
2 | 215 2 | 555-1235 | 1
3 | 555-1236 | 2
4 | 555-1237 | 2
The more occurences of the same value you have, the more likely will you save memory and get quicker performance if you organize your data in this way, especially when you're handling string values or binary data. Also, if an area code would change, all you need to do is update a single cell instead of having to perform an update operation on the whole table.
Try this tutorial.
Correlation does not imply causation.
Just because shitloads of columns usually indicate a bad design, doesn't mean that a shitload of columns is a bad design.
If you have a normalized model, you store whatever number of columns you need a single table.
It depends!
Does that one table contain a single 'entity'? i.e. Are all 36 columns attributes a single thing, or are there several 'things' mixed together?
If mixed, then you should normalise (separate into distinct entities with relationships between them). You should aim for at least Third Normal Form (3NF).
A best practice is to normalise as much as you can; if you later identify a performance problem, then denormalise as little as you can.