To use or not to use computed columns for performance and maintainability - sql-server

I have a table where am storing a startingDate in a DateTime column.
Once i have the startingDate value, am supposed to calculate the
number_of_days,
number_of_weeks
number_of_months and
number_of_years
all from the startingDate to the current date.
If you are going to use these values in two or more places in the application and you do care much about the applications response time, would you rather make the calculations in a view or create computed columns for each so you can query the table directly?

Computed columns are easy to maintain and provide an ideal solution to your problem – I have used such a solution recently. However, be aware the values are calculated when requested (when they are SELECTed), not when the row is INSERTed into the table – so performance might still be an issue. This might be acceptable if you can off-load work from the application server to the database server. Views also don’t exist until they are requested (unless they are materialised) so, again, there will be an overhead at runtime, but, again it’s on the database server, not the application server.

Like nearly everything: It depends.
As #RedX suggest it probably not much of a performance difference either way, so it becomes a question of how will use them. To me this is more of a feel thing.
Using them more than once doesn't wouldn't necessary drive me immediately to either a view or computed columns. If I only use them in a few places or low volume code paths I might calc them in-line in those places or use a CTE. But if the are in wide spread or heavy use I would agree with a view or computed column.
You would also want them in a view or cc if you want them available via ORM tools.
Am I using those "computed columns" individual in places or am I using them in sets? If using them in sets I probably want a view of the table that shows included them all.
When i need them do I usually want them associated with data from a particular other table? If so that would suggest a view.
Am I basing updates on the original table of those computed values? If so then I want computed columns to avoid joining the view in these case.

Calculated columns may seem an easy solution at first, but I have seen companies have trouble with them because when they try to do ETL with CDC for real-time Change Data Capture with tools like Attunity it will not recognize the calculated columns since the values are not there permanently. So there are some issues. Also if the columns will be retrieve many, many times by users, you will save time in the long run by putting that logic in the ETL tool or procedure and write it once to the database instead of calculating it many times for each request.

Related

Creating an Efficient (Dynamic) Data Source to Support Custom Application Grid Views

In the application I am working on, we have data grids that have the capability to display custom views of the data. As a point of reference, we modeled this feature using the concept of views as they exist in SharePoint.
The custom views should have the following capabilities:
Be able to define which subset of columns (of those that are
available) should be displayed in the view.
Be able to define one or
more filters for retrieving data. These filters are not constrained
to use only the columns that are in the result set but must use one
of the available columns. Standard logical conditions and operators
apply to these filters. For example, ColumnA Equals Value1 or
ColumnB >= Value2.
Be able to define a set of columns that the data will be sorted by. This set of columns can be one or more columns
from the set of columns that will be returned in the result set.
Be
able to define a set of columns that the data will be grouped by.
This set of columns can be one or more columns from the set of
columns that will be returned in the result set.
I have application code that will dynamically generate the necessary SQL to retrieve the appropriate set of data. However, it appears to perform poorly. When I run across a poorly performing query, my first thought is to determine where indexes might help. The problem here is that I won't necessarily know which indexes need to be created as the underlying query could retrieve data in many different ways.
Essentially, the SQL that is currently being used does the following:
Creates a temporary table variable to hold the filtered data. This table contains a column for each column that should be returned in the result set.
Inserts data that matches the filter into the table variable.
Queries the table variable to determine the total number of rows of data.
If requested, determines the grouping values of the data in the table variable using the specified grouping columns.
Returns the requested page of the requested page size of data from the table variable, sorted by any specified sort columns.
My question is what are some ways that I may improve this process? For example, one idea I had was to have my table variable only contain the columns of data that are used to group and sort and then join in the source table at the end to get the rest of the displayed data. I am not sure if this would make any difference which is the reason for this post.
I need to support versions 2014, 2016 and 2017 of SQL Server in addition to SQL Azure. Essentially, I will not be able to use a specific feature of an edition of SQL Server unless that feature is available in all of the aforementioned platforms.
(This is not really an "answer" - I just can't add comments yet because my reputation score isn't high enough yet.)
I think your general approach is fine - essentially you are making a GUI generator for SQL. However a few things:
This type of feature is best suited for a warehouse or read only replica database. Do not build this on a live production transactional database. There are permutations that you haven't thought of that your users will find that will kill your database (it's also true from a warehouse standpoint, but they usually don't have response time expectations as a transactional database)
The method you described for doing paging is not efficient from a database standpoint. You are essentially querying, filtering, grouping, and sorting the same exact dataset multiple times just to cherry pick a few rows each time. If you have the data cached, that might be ok, but you shouldn't make that assumption. If you have the know how, figure out how to snapshot the entire final data set with an extra column to keep the data physically sorted in the order the user requested. That way you can quickly query the results for your paging.
If you have a Repository/DAL layer, design your solution so that in the future certain combinations of tables/columns can utilize hardcoded queries/stored procedures. There will inevitably be certain queries that pop up that cause you performance issues and you may have to build a custom solution for specific queries in order to get the desired performance that can't be obtained by your dynamic sql

Keeping Data Fresh via Azure Data Factory

What is the proper way to keep data slices refreshed? Imagine I have a table with various column, but importantly a DATE_CREATED and a DATE_MODIFIED column.
If my data slicing strategy is based on DATE_CREATED, I could periodically reprocess old slices. This follows the ADF guidance of "repeatability". I don't think ADF has a way of doing this automatically, but I could externally trigger the refresh via the API (I'm guessing.) This seems like perhaps the most correct way, but given that ADF doesn't seem to support this as a feature, it makes me feel like there's a better way of doing it ... it also seems mildly wasteful.
If my data slicing strategy is based on DATE_MODIFIED, I run into issues with the ADF activities not being repeatable. An old slice, when refreshed, would give different results because rows that were within the window may have moved to a different window. On the other hand, the latest slice would always catch rows that have changed. The other issue is preventing row duplication. The pre-activity clean up actions would need to somehow be able to clean up records in the destination table prior to the copy. Or some type of UPSERT method must be employed.
The final option is to TRUNCATE the destination table every day. This is fine for smaller tables but has its own downsides, (1) we're not really "slicing" at all anymore. This is just scorched earth. (2) Any time any slice is being processed, all downstream slices from all dates are in danger of failing, due to the table being blown away. (3) practically impossible if your table has any respectable amount of data in it.
No option seems excellent but the first option seems better. Looking for advice from someone who has solved this problem or is experienced with ADF.

Understanding metadata in Postgres?

I'm currently writing some code for one of my classes involving distributed and parallel database processing. I'm doing horizontal fragmentation on some data and required to keep track of different pieces of data.
The professor recommends storing "metadata" to keep track of some basic computations. Is this as simple as creating another table and storing some basic information, or is there a much more efficient way of doing this?
Example:
I need to track ranges for min/max values of every table in my database. Should I store that information in an entirely new table or is there a better way of achieving this?
Example: I need to track ranges for min/max values of every table in my database. Should I store that information in an entirely new table or is there a better way of achieving this?
Yes, you should store min/max in a different table. Depending on your application, you might need more than one of those kinds of tables.
Each insert, update, or delete statement can change either or both of those values. Think about how you want to handle that. (Triggers, probably.)
Terminology
Metadata just means "data about other data", and min/max values for one or more columns in each table is arguably data about other data. But I've never seen such data called metadata. It's always either summary or aggregate data.
I think you'll find that when most DBAs and database developers use metadata, they're talking about system tables or the information_schema views that are built on top of system tables.

Too many columns in a single preference db table?

I have an application that is essentially built out of many smaller applications. Each application has their own individual preferences, but all of them share the same 5 preferences, for example, whether the application is displayed in the nav, whether it is public, whether reports should be generated, etc.
All of these common preferences need to be known by any page in the web app because the navigation is constructed from it. So originally I put all these preferences in a single table. However as the number of applications grow (10 now, eventually around 30), the number of columns will end up being around 150-200 total. Most of these columns are just booleans, but it still worries me having that many columns in one table. On the other hand, if I were to split them apart into separate tables (preferences per app), I'd have to join them all together anyway every time I need to see the preferences, so why not just leave them all together?
In the application I can break the preferences into smaller objects so they are easier to work with, but from a db perspective they are a single entity. Is it better to leave them in one giant table, or break them apart into smaller ones but force many joins every time they are requested?
Which database engine are you using ? normally you will find some recommendations about recommended number of columns per table in your DB engine. Mostly Row size limitations, which should keep you safe.
Other options and suggestions include:
Assign a bit per config key in an integer, and use the logical "AND" operation to show only the key you are interested in at a given point in time. Single value read from DB, one quick Logical operation for each read of a config key.
Caching the preferences in memory, less round trips to DB servers, Based on frequency of changes , you may also having to clear the cache of each preference when it is updated.
Why not turn the columns into rows and use something like this:
This is a typical approach for maintaining lists of settings values.
The APP_SETTING table contains the value of the setting. The SETTING table gives you the context of what the setting is.
There are ways of extending this to add information such as which settings apply to which applications and whether or not the possible values for a particular setting are constrained to a specific list.
Well CommonPreferences and ApplicationPreferences would certainly make sense, and perhaps even segregating them in code (two queries instead of a join).
After that a table per application will make more sense.
Another way is going down the route suggested By Joel Brown.
A third would be instead of having individual colums or row per setting, you stuff all the non-common ones in to an xml snippet or serialise from a preferences class.
Which decision you make revolves around how your application does (or could use the data).
If you go down the settings table approach getting application settings as a row will be 'erm painful. Go down the xml snippet route and querying for a setting across applications will be even more painful than several joins.
No way to say what you should compromise on from here. I think I'd go for CommonPreferences first and see where I was at after that.

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

Resources