Trying to design a column that should sum values from another table - database

sorry if the title is not very clear. I will try to explain now:
I have two tables: table A and table B. The relation between them is one (for A table) to many (for B table). So, it's something like master-detail situation.
I have a column 'Amount' in table B that is obviously decimal and a column 'TotalAmount' in table A.
I'm trying ti figure out how to keep the value in table A up to date. My suggestion is to make a view based on table A with aggregate query counting the Amounts from table B. Of course with the proper indexes ...
But, my team-mate suggest to update the value in table A every time we change something in table B from our application.
I wonder, what will be the best solution here? May be a third variant?
Some clarification ... We expected this tables to be the fastest growing tables in our database. And The table B will grow much much faster than table A. The most frequent operation in table B will be insert ... and almost nothing else. The most frequent operation in table A will be select ... but not only.

I see a number of options:
Use an insert trigger on table B and do the updates table A as your friend suggests. This will keep table B as up to date as possible.
Have a scheduled job that updates table A every x minutes (x = whatever makes sense for your application).
When updating table B, do an update on table A in your application logic. This may not work out if you update table B in many places.

If you have a single place in your app where you insert new rows to table B, then the most simple solution is to send an UPDATE A set TotalAmount=TotalAmount + ? where ID = ? and pass the values you just used to insert into B. Make sure you wrap both queries (the insert and the update) in a transaction so either both happen or none.
If that's not simple, then your next option is a database trigger. Read the docs for your database how to create them. Basically a trigger is a small piece of code that gets executed when something happens in the DB (in your case, when someone inserts data in table B).
The view is another option but it may cause performance problems during selects which you'll find hard to resolve. Try a "materialized view" or a "computed column" instead (but these can cause performance problems when you insert/remove columns).

If this value is going to change a lot, you're better off using a view: It is definitely the safer implementation. But even better would be using triggers (if your database supports them.)
I would guess your mate suggests updating the value on each insert because he thinks that you will need the value quite often and it might lead to a slow-down recalculating the value each time. If that is the case:
Your database should take care of the caching, so this probably won't be an issue.
If it is, nonetheless, you can add that feature at a later stage - this way you can make sure your application works otherwise and will have a much easier time debugging that cache column.

I would definitely recommend using a trigger over using application logic, as that ensures that the database keeps the value up-to-date, rather than relying on all callers. However, from a design point of view, I would be wary of storing generated data in the same table as non-generated data -- I believe it's important to maintain a clear separation, so people don't get confused between what data they should be maintaining and what will be maintained for them.
However, in general, prefer views to triggers -- that way you don't have to worry about maintaining the value at all. Profile to determine whether performance is an issue. In Postgres, I believe you could even create an index on the computed values, so the database wouldn't have to look at the detail table.
The third way, recalculating periodically, will be much slower than triggers and probably slower than a view. That it's not appropriate for your use anyway is the icing on the cake :).

Related

Techniques for populating large databases when testing

We are running 3 Oracle databases with 100+ tables in each. In order to test our code we are looking into alternatives for populating them with testdata. The tools we have found this far are DBSetup and DBUnit.
The problem we are having with these tools are all the manual work needed to specify data. For example if I am to test a table D I am required to populate tables A, B, C with data also. I dont care what the data is, i only care about the data in table D. The reason I have to also populate A B C is because the concistency checks on the derived keys in table D.
My questions is how is this type of problem uasually handled? Is this a sign of a badly designed database from a testability point of view?
If the database is strictly for testing purposes, I don't see anything stopping you from dropping the consistency checks (FK etc), populate the data, test it, truncate the table, and readd the consistency checks again later.
Other alternative that I can think of
1. copy the table structure (columns etc), and do your testing there
2. alter all FKs with "deferrable initially deferred" that basically will postpone the consistency checking until you commit the transaction
we stopped using dbunit and started to populate the db programmaticaly. this way you can create one factory method that creates your entity with all its dependencies. and then you can this one method as many times as possible.
i would avoid removing constraints during the process of testing. if the code you test create/modifies the data, you may get false positive results without constraints. same with performance
if you have to populate 100 tables just to insert anything to table X then yes, i would say there might be some problem with design. think if you can better divide your domain into smaller aggregate roots. but if you need just a few other tables then i would say there is nothing wrong with it

How can I store temporary, self-destroying data into a t-sql table?

Suppose I have a table which contains relevant information. However, the data is only relevant for, let's say, 30 minutes.
After that it's just database junk, so I need to get rid of it asap.
If I wanted, I could clean this table periodically, setting an expiration date time for each record individually and deleting expired records through a job or something. This is my #1 option, and it's what will be done unless someone convince me otherwise.
But I think this solution may be problematic. What if someone stops the job from running and no one notices? I'm looking for something like a built-in way to insert temporary data into a table. Or a table that has "volatile" data itself, in a way that it automagically removes data after x amount of time after its insertion.
And last but not least, if there's no built-in way to do that, could I be able to implement this functionality in SQL server 2008 (or 2012, we will be migrating soon) myself? If so, could someone give me directions as to what to look for to implement something like it?
(Sorry if the formatting ends up bad, first time using a smartphone to post on SO)
As another answer indicated, TRUNCATE TABLE is a fast way to remove the contents of a table, but it's aggressive; it will completely empty the table. Also, there are restrictions on its use; among others, it can't be used on tables which "are referenced by a FOREIGN KEY constraint".
Any more targeted removal of rows will require a DELETE statement with a WHERE clause. Having an index on relevant criteria fields (such as the insertion date) will improve performance of the deletion and might be a good idea (depending on its effect on INSERT and UPDATE statements).
You will need something to "trigger" the DELETE statement (or TRUNCATE statement). As you've suggested, a SQL Server Agent job is an obvious choice, but you are worried about the job being disabled or removed. Any solution will be vulnerable to someone removing your work, but there are more obscure ways to trigger an activity than a job. You could embed the deletion into the insertion process-- either in whatever stored procedure or application code you have, or as an actual table trigger. Both of those methods increase the time required for an INSERT and, because they are not handled out of band by the SQL Server Agent, will require your users to wait slightly longer. If you have the right indexes and the table is reasonably-sized, that might be an acceptable trade-off.
There isn't any other capability that I'm aware of for SQL Server to just start deleting data. There isn't automatic data retention policy enforcement.
See #Yuriy comment, that's relevant.
If you really need to implement it DB side....
Truncate table is fast way to get rid of records.
If all you need is ONE table and you just need to fill it with data, use it and dispose it asap you can consider truncating a (permanent) "CACHE_TEMP" table.
The scenario can become more complicated you are running concurrent threads/jobs and each is handling it's own data.
If that data is just existing for a single "job"/context you can consider using #TEMP tables. They are a bit volatile and maybe can be what you are looking for.
Also you maybe can use table variables, they are a bit more volatile than temporary tables but it depends on things you don't posted, so I cannot say what's really better.

SQL Server: Many columns in a table vs Fewer columns in two tables

I have a database table (called Fields) which has about 35 columns. 11 of them always contains the same constant values for about every 300.000 rows - and act as metadata.
The down side of this structure is that, when i need to update those 11 columns values, i need to go and update all 300.000 rows.
I could move all the common data in a different table, and update it only one time, in one place, instead of 300.000 places.
However, if i do it like this, when i display the fields, i need to create INNER JOIN's between the two tables, which i know makes the SELECT statement slower.
I must say that updating the columns occurs more rarely than reading (displaying) the data.
How you suggest that i should store the data in database to obtain the best performances?
I could move all the common data in a different table, and update it only one time, in one
place, instead of 300.000 places.
I.e. sane database design and standad normalization.
This is not about "many empty fields", it is brutally about tons of redundant data. Constants you should have isolated. Separate table. This may also make things faster - it allows the database to use memory more efficient because your database is a lot smaller.
I would suggest to go with a separate table unless you've concealed something significant (of course it would be better to try and measure, but I suspect you already know it).
You can actually get faster selects as well: joining a small table would be cheaper then fetching the same data 300000 times.
This is a classic example of denormalized design. Sometimes, denormalization is done for (SELECT) performance, and always in a deliberate, measurable way. Have you actually measured whether you gain any performance by it?
If your data fits into cache, and/or the JOIN is unusually expensive1, then there may well be some performance benefit from avoiding the JOIN. However, the denormalized data is larger and will push at the limits of your cache sooner, increasing the I/O and likely reversing any gains you may have reaped from avoiding the JOIN - you might actually lose performance.
And of course, getting the incorrect data is useless, no matter how quickly you can do it. The denormalization makes your database less resilient to data inconsistencies2, and the performance difference would have to be pretty dramatic to justify this risk.
1 Which doesn't look to be the case here.
2 E.g. have you considered what happens in a concurrent environment where one application might modify existing rows and the other application inserts a new row but with old values (since the first application hasn't committed yet so there is no way for the second application to know that there was a change)?
The best way is to seperate the data and form second table with those 11 columns and call it as some MASTER DATA TABLE, which will be having a primary key.
This primary key can be referred as a foreign key in those 30,000 rows in the first table

How to deal with extremely wide tables in SQL server?

I have encountered the following dilemma several times and would be interested to hear how others have addressed this issue or if there is a canonical way that the situation can be addressed.
In some domains, one is naturally led to consider very wide tables. Take, for instance, time series surveys that evolve over many years. Such surveys can have hundreds, if not thousands, of variables. Typically though there are probably only a few thousand or tens-of-thousands of rows. It is absolutely natural to consider such a result set as a table where each variable corresponds to a column in the table however, in SQL Server at least, one is limited to 1024 (non sparse) columns.
The obvious workarounds are to
Distribute each record over multiple tables
Stuff the data into a single table with columns of say, ResponseId, VariableName, ResponseValue
Number 2. I think is very bad for a number of reasons (difficult to query, suboptimal storage, etc) so really the first choice is the only viable option I see. This choice can be improved perhaps by grouping columns that are likely to be queried together into the same table - but one can't really know this until the database is actually being used.
So, my basic question is: Are there better ways to handle this situation?
You might want to put a view in front of the tables to make them appear as if they are a single table. The upside is that you can rearrange the storage later without queries needing to change. The downside is that only modifications to the base table can be done through the view. If necessary, you could mitigate this with stored procedures for frequently used modifications. Based on your use case of time series surveys, it sounds like inserts and selects are far more frequent than updates or deletes, so this might be a viable way to stay flexible without forcing your clients to update if you need to rearrange things later.
Hmmm it really depends on what you do with it. If you want to keep the table as wide as it is (possibly this is for OLAP or data warehouse), I would just use proper indexes. Also based on the columns that are selected more often , I could also use covering indexes. Based on the rows that are searched more often, I could also use filtered indexes. If there are, let’s say, billions of records in the table, you could partition the table as well. If you just want to store the table over multiple tables, definitely use proper normalization techniques, probably up to 3NF or 3.5NF, to divide the big table into smaller tables. I would use the first method of yours, normalization, to store data for the big table just because it seems like it makes sense better to me that way.
This is an old topic but something we are currently working on resolving. Neither of the above answers really give as many benefits as the solution we feel we have found.
We previously believed that having wide tables wasn't really a problem. Having spent time analysing this we have seen the light and realise that costs on inserts/updates are indeed getting out of control.
As John states above, the solution is really to create a VIEW to provide your application with a consistent schema. One of the challenges in any redesign may be, as in our case, that you have thousands or millions of lines of code referencing an old wide table and you may want to provide backwards compatibility.
Views can also be used for UPDATES and INSERTS as John alludes to, however a problem we found initially was that if you take the example of myWideTable which may have hundreds of columns and you want to split this to myWideTable_a with columns a, b and c and myWideTable_b with columns x, y and z then an insert to a view which only sets column a will only insert a record for myWideTable_a
This causes a problem when you want to later update your record and set myWideTable.z as this will fail.
The solution we're adopting, and performance testing, is to have an 'insteadof' trigger on the View insert to always insert to our split-tables so that we can continue to update or read from the view with impunity.
The question as to whether using this trigger on inserts provides more overhead than a wide table is still open, but it is clear that it will improve subsequent writes to columns in each split table.

Voting System: Should I use a SQL Trigger or more Code?

I'm building a voting system whereby each vote is captured in a votes table with the UserID and the DateTime of the vote along with an int of either 1 or -1.
I'm also keeping a running total of TotalVotes in the table that contains the item that the user actually voted on. This way I'm not constantly running a query to SUM the Vote table.
My Question is kind of a pros/cons question when it comes to updating the TotalVotes field. With regards to code manageability, putting an additional update method in the application makes it easy to troubleshoot and find any potential problems. But if this application grows significantly in it's user base, this could potentially cause a lot of additional SQL calls from the app to the DB. Using a trigger keeps it "all in the sql family" so to speak, and should add a little performance boost, as well as keep a mundane activity out of the code base.
I understand that premature optimization could be called in this specific question, but since I haven't built it yet, I might as well try and figure out the better approach right out of the gate.
Personally I'm leaning towards a trigger. Please give me your thoughts/reasoning.
Another option is to create view on the votes table aggregating the votes as TotalVotes.
Then index the view.
The magic of SQL Server optimizer (enterprise edition only i think) is that when it sees queries of sum(voteColumn) it will pick that value up from the index on the view of the same data, which is amazing when you consider you're not referencing the view directly in your query!
If you don't have the enterprise edition you could query for the total votes on the view rather than the table, and then take advantage of the index.
Indexes are essentially denormalization of your data that the optimizer is aware of. You create or drop them as required, and let the optimizer figure it out (no code changes required) Once you start down the path of your own hand crafted denormalization you'll have that baked into your code for years to come.
Check out Improving performance with indexed views
There are some specific criteria that must be met to get indexed views working. Here's a sample based on a guess of your data model:
create database indexdemo
go
create table votes(id int identity primary key, ItemToVoteOn int, vote int not null)
go
CREATE VIEW dbo.VoteCount WITH SCHEMABINDING AS
select ItemToVoteOn, SUM(vote) as TotalVotes, COUNT_BIG(*) as CountOfVotes from dbo.votes group by ItemToVoteOn
go
CREATE UNIQUE CLUSTERED INDEX VoteCount_IndexedView ON dbo.VoteCount(itemtovoteon)
go
insert into votes values(1,1)
insert into votes values(1,1)
insert into votes values(2,1)
insert into votes values(2,1)
insert into votes values(2,1)
go
select ItemToVoteOn, SUM(vote) as TotalVotes from dbo.votes group by ItemToVoteOn
And this query (which doesn't reference the view or by extension it's index) results in this execution plan. Notice the index is used. Of course drop the index, (and gain insert performance)
And one last word. Until you're up and running there is really know way that you'll know if any sort of denormalization will in fact help overall throughput. By using indexes you can create them, measure if it helps or hurts, and then keep or drop them as required. It's the only kind of denormalization for performance that is safe to do.
I'd suggest you build a Stored Procedure that does both the vote insert and the update on the total votes. Then your application only has to know how to record a vote, but the logic on exactly what is going on when you call that is still contained in one place (the stored procedure, rather then an ad-hoc update query and a seperate trigger).
It also means that later on if you want to remove the update to total votes, all you have to change is the procedure by commenting out the update part.
The premature premature optimization is saving the total in a table instead of just summing the data as needed. Do you really need to denormalize the data for performance?
If you do not need to denormalize the data then you do not need to write a trigger.
I've done the trigger method for years and was always happier for it. So, as they say, "come on in, the water's fine." However, I usually do it when there are many many tables involved, not just one.
The pros/cons are well known. Materializing the value is a "pay me now" decision, you pay a little more on the insert to get faster reads. This is the way to go if and only if you want a read in 5 milliseconds instead of 500 milliseconds.
PRO: The TotalVotes will always be instantly available with one read.
PRO: You don't have to worry about code path, the code that makes the insert is much simpler. Multiplied over many tables on larger apps this is a big deal for maintainability.
CON: For each INSERT you also pay with an additional UPDATE. It takes a lot more inserts/second than most people think before you ever notice this.
CON: For many tables, manually coding triggers can get tricky. I recommend a code generator, but as I wrote the only one I know about, that would get me into self-promotion territory. If you have only one table, just code it manually.
CON: To ensure complete correctness, it should not be possible to issue an UPDATE from a console or code to modify TotalVotes. This means it is more complicated. The trigger should execute as a special superuser that is not normally used. A second trigger on the parent table fires on UPDATE and prevents changes to TotalVotes unless the user making the update is that special super-user.
Hope this gives you enough to decide.
My first gut instinct would be to write a UDF to perform the SUM operation and make TotalVotes a computed column based on that UDF.

Resources