Stored Procedure or Calculated Columns - sql-server

This is a question of what is the best practice and best performance.
I have inherited a database that contains data for turbine engines. I have found 20 data points that are calculated from several fields from the turbine. The way it was done in the past is a view was create to pull data for some turbines and calculate some of the 20 data point. Then other views for the same turbines but different data point and then other views for different turbines and data point. So the same equations are used over and over.
I want to consolidate all of the equations (20 data point) into one place. My debate is either creating a user function that will do all 20 calculations or creating them as computed columns in the table. With a function it would calculate all 20 for each turbine even thou I might only need 2 or 3 for a view. But as a computed column it would only calculate the columns the view pulled.

The answer is probably "it depends".
The factors when making this determination include:
Is the column deterministic? (e.g. can you persist it or not)
How often is data inserted into the table?
How often is data retrieved from the table?
The trade offs for computed and specifically persisted computed columns are similar to that when considering an index on your table. Having persisted columns will increase the amount of time an insert takes on the table, but allows retrieval to happen faster. Whereas on the other end, computed columns (that aren't persisted), or a function you would have faster on the insert but slower on the retrieval.
The end solution would likely depend on the utilization of the table (how often writes and reads occur) - which is something that you would need to determine.
Personally, I wouldn't do a function for the columns, but rather I'd persist them, or write a view/computed columns that accomplished them, depending on the nature of the usage on the table.

Related

Performance of Column Family in Cassandra DB

I have a table where my queries will be purely based on the id and created_time, I have the 50 other columns which will be queried purely based on the id and created_time, I can design it in two ways,
Either by multiple small tables with 5 column each for all 50 parameters
A single table with all 50 columns with id and created_at as primary
key
Which will be better, my rows will increase tremendously, so should I bother on the length of column family while modelling?
Actually, you need to have small tables to decrease the load on single table and should also try to maintain a query based table. If the query used contains the read statement to get all the 50 columns, then you can proceed with single table. But if you are planning to get part of data in each of your query, then you should maintain query based small tables which will redistribute the data evenly across the nodes or maintain multiple partitions as alex suggested(but you cannot get range based queries).
This really depends on how you structure of your partition key & distribution of data inside partition. CQL has some limits, like, max 2 billion cells per partitions, but this is a theoretical limit, and practical limits - something like, not having partitions bigger than 100Mb, etc. (DSE has recommendations in the planning guide).
If you'll always search by id & created_time, and not doing range queries on created_time, then you may even have the composite partition key comprising of both - this will distribute data more evenly across the cluster. Otherwise make sure that you don't have too much data inside partitions.
Or you can add another another piece into partition key, for example, sometimes people add the truncated date-time into partition key, for example, time rounded to hour, or to the day - but this will affect your queries. It's really depends on them.
Sort of in line with what Alex mentions, the determining factor here is going to be the size of your various partitions (which is an extension of the size of your columns).
Practically speaking, you can have problems going both ways - partitions that are too narrow can be as problematic as partitions that are too wide, so this is the type of thing you may want to try benchmarking and seeing which works best. I suspect for normal data models (staying away from the pathological edge cases), either will work just fine, and you won't see a meaningful difference (assuming 3.11).
In 3.11.x, Cassandra does a better job of skipping unrequested values than in 3.0.x, so if you do choose to join it all in one table, do consider using 3.11.2 or whatever the latest available release is in the 3.11 (or newer) branch.

Can denormalization become inefficient?

We have a table that we denormalized because we have a big risk that the joins would be too slow for the amount of data of our users. So we created 10 columns info (INFO0, INFO1... INFO9). Most of the time, only the 2-3 first column are use, the others are null.
But now, we need to add two more type of infos with 10 columns each (for a total of 20 new columns). The tricky part, is that our design will make impossible for the users to use all of the 30 denormalized columns. At all time, they will always be able to use a maximum of 10 on each row. Moreover, we could need to add even more new denormalized columns, but we will never be able to use more than 10 on each row.
I know it is not a good design, but we don't really have the choice. So my question is : can this design become inefficient? Can having a lot of columns with null values slow down my queries? If yes, can it become a big deal?
Yes it could. You don't say what database you're using or what data type the extra columns are, but adding more columns is going to increase the 'width' of your table which means that more logical reads are needed to retrieve the same number of records, more reads equals slower speed. So what you gain by denormalisation may eventually be lost by adding too many columns, but the extent of this will depend on your database design.
If it does affect performance an intermediate solution could be to vertically split the table placing infrequently referenced columns in a second table.

SQL Server: Many columns in a table vs Fewer columns in two tables

I have a database table (called Fields) which has about 35 columns. 11 of them always contains the same constant values for about every 300.000 rows - and act as metadata.
The down side of this structure is that, when i need to update those 11 columns values, i need to go and update all 300.000 rows.
I could move all the common data in a different table, and update it only one time, in one place, instead of 300.000 places.
However, if i do it like this, when i display the fields, i need to create INNER JOIN's between the two tables, which i know makes the SELECT statement slower.
I must say that updating the columns occurs more rarely than reading (displaying) the data.
How you suggest that i should store the data in database to obtain the best performances?
I could move all the common data in a different table, and update it only one time, in one
place, instead of 300.000 places.
I.e. sane database design and standad normalization.
This is not about "many empty fields", it is brutally about tons of redundant data. Constants you should have isolated. Separate table. This may also make things faster - it allows the database to use memory more efficient because your database is a lot smaller.
I would suggest to go with a separate table unless you've concealed something significant (of course it would be better to try and measure, but I suspect you already know it).
You can actually get faster selects as well: joining a small table would be cheaper then fetching the same data 300000 times.
This is a classic example of denormalized design. Sometimes, denormalization is done for (SELECT) performance, and always in a deliberate, measurable way. Have you actually measured whether you gain any performance by it?
If your data fits into cache, and/or the JOIN is unusually expensive1, then there may well be some performance benefit from avoiding the JOIN. However, the denormalized data is larger and will push at the limits of your cache sooner, increasing the I/O and likely reversing any gains you may have reaped from avoiding the JOIN - you might actually lose performance.
And of course, getting the incorrect data is useless, no matter how quickly you can do it. The denormalization makes your database less resilient to data inconsistencies2, and the performance difference would have to be pretty dramatic to justify this risk.
1 Which doesn't look to be the case here.
2 E.g. have you considered what happens in a concurrent environment where one application might modify existing rows and the other application inserts a new row but with old values (since the first application hasn't committed yet so there is no way for the second application to know that there was a change)?
The best way is to seperate the data and form second table with those 11 columns and call it as some MASTER DATA TABLE, which will be having a primary key.
This primary key can be referred as a foreign key in those 30,000 rows in the first table

Database Implementation Help : Time-Series data

This is the re-submission of my previous question:
I have a collection of ordered time-series data(stock minute price information). My current database structure using PostgreSQL is below:
symbol_table - where I keep the list of the symbols with the symbol_id as a primary key(serial).
time_table, date_table - time/date values are stored there. time_id/date_id are primary keys(serial/serial).
My main minute_table contains the minute pricing information where
date_id|time_id|symbol_id are primary keys(also foreign keys from the corresponding tables)
Using this main minute_table I'm performing different statistical analyses and keep the results in a separate tables, like one_minute_std - where one minute standard deviation measures are kept.
Every night I'm updating the tables with the current price information from the last day's closing prices.
With the current implementation my tables contain all the symbols with around 50m records each.
Primary keys are indexed.
If I want to query for all the symbols where closing price > x and one_minute_std >2 and one_minute_std < 4 for the specific date it takes about 3-4 minutes for the search.
To speed up the process I was thinking of separating each symbol to its own table but not 100% sure if this is a 'proper' way of doing it.
Could you advise me on how I can speed up the query process?
It sounds like you want a combination of approaches.
First, you should look into table partitioning. This stores a single table across multiple storage units ("files"), but still gives you the flexibility of a single table. (Here is postgres documentation http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html).
You would want to partition either by day or by ticker symbol. My first reaction would be by time (day/week/month), since that is the unit of updates. However, if you analyses are only by a single ticker and often span multiple days, then there is an argument for using that instead.
After partitioning, you may want to consider indexes. However, I suspect that partitioning will solve your performance problems.
Since your updates are at night, you should be folding in your summarization process in with the updates. For instance, one_minute_std should be calculated during this process. You might find it best to load the nightly data into a temporary table, do the calculation for summaries such as one_minute_std, and then load the data into the final partitioned table scheme.
With so many rows that have so few columns, you are probably better off with a good partitioning scheme than an indexing scheme. In particular, indexes have a space overhead, and the smaller the record in each row, the more that using the index incurs an overhead comparable to scanning the entire table.

Best choice for a huge database table which contains only integers (have to use SUM() or AVG() )

I'm currently using a MySQL table for an online game under LAMP.
One of the table is huge (soon millions of rows) and contains only integers (IDs,timestamps,booleans,scores).
I did everything to never have to JOIN on this table. However, I'm worried about the scalability. I'm thinking about moving this single table to another faster database system.
I use intermediary tables to calculate the scores but in some cases, I have to use SUM() or AVERAGE() directly on some filtered rowsets of this table.
For you, what is the best database choice for this table?
My requirements/specs:
This table contains only integers (around 15 columns)
I need to filter by certain columns
I'd like to have UNIQUE KEYS
It could be nice to have "INSERT ... ON DUPLICATE UPDATE" but I suppose my scripts can manage it by themselves.
i have to use SUM() or AVERAGE()
thanks
Just make sure you have the correct indexes on so selecting should be quick
Millions of rows in a table isn't huge. You shouldn't expect any problems in selecting, filtering or upserting data if you index on relevant keys as #Tom-Squires suggests.
Aggregate queries (sum and avg) may pose a problem though. The reason is that they require a full table scan and thus multiple fetches of data from disk to memory. A couple of methods to increase their speed:
If your data changes infrequently then caching those query results in your code is probably a good solution.
If it changes frequently then the quickest way to improve their performance is probably to ensure that your database engine keeps the table in memory. A quick calculation of expected size: 15 columns x 8 bytes x millions =~ 100's of MB - not really an issue (unless you're on a shared host). If your RDBMS does not support tuning this for a specific table, then simply put it in a different database schema - shouldn't be a problem since you're not doing any joins on this table. Most engines will allow you to tune that.

Resources