If my program was to hit the database with multiple updates would it be better to pull in the tables into a dataset, change the values and then send it back to the database. Does anyone know what's more expensive?
No matter what, the database needs to perform all those updates based on the edits you did to the local DataSet. As I understand it, that will be just as expensive as sequentially updating. The only advantage is its easier to iterate over a dataset rather than pull-and-push one result after another.
What will be expensive is all the workaround code for dealing with the potential exceptions that can occur because you choose "which costs less" over "which is simplest". Premature optimization.
Depends on the size of the DataSet. If your data set is too large, it doesn't worth it. Otherwise, it might be a good approach. However, nothing prevents you to do multiple updates in a batch even without using a DataSet. You could write a stored procedure with an XML parameter that will do batch updates for you.
Related
Greeting,
Recently I've started to work on an application, where 8 different modules are using the same table at some point in the workflow. This table have an Instead-Of trigger, which is 5,000 lines long (where first 500 and last 500 lines are common for all modules, and then each module has its own 500 lines of code).
Since the number of modules are going to grow and I want to keep thing as clear (and separate) as possible, I was wondering is there some sort of best practice to split trigger into stored procedures, or should I leave it all in one place?
P.S. Are there going to be any performance penalties for calling procedures from the trigger and passing 15+ parameters to them?
Bearing in mind that the inserted and deleted pseudo-tables are only accessible from within trigger code, and that they can contain multiple rows, you're facing two choices:
Process the rows in inserted and deleted in a RBAR1 fashion, to be able to pass scalar parameters to the stored procedures, or,
Copy all of the data from inserted and deleted into table variables that are then passed to the procedures as appropriate.
I'd expect either approach to impose some2 performance overhead, just from the copying
That being said, it sounds like too much is happening inside the triggers themselves - does all of this code have to be part of the same transaction that performed the DML statement? If not, consider using some form of queue (a table of requests or Service Broker, say) in which to place information on work to perform, and then process the data later - if you use Service Broker, you could have it inspect a shared message and then send appropriate messages to dedicated endpoints for each of your modules, as appropriate.
1 Row By Agonizing Row - using either a cursor of something else to simulate one to access each row in turn - usually frowned upon in a Set-based language like SQL.
2 How much is impossible to know without getting into the specifics of your code and probably trying all possible approaches and measuring the result.
I don't think there is a meaningful performance penalty in this case.
Any way, it is bad practice to write it all inside the trigger (when it is 5000 lines long...).
I think the main consideration is maintainability, which will be much better if you split it
To several SPs
How would a single BLOB column in SQL Server compare (performance wise), to ~20 REAL columns (20 x 32-bit floats)?
I remember Martin Fowler recommending using BLOBs for persisting large object graphs (in Patterns of Enterprise Application Architecture) to remove multiple joins in queries, but does it make sense to do something like this for a table with 20 fixed columns (which are never used in queries)?
This table is updated really often, around 100 times per second, and INSERT statements get rather large with all the columns specified in the query.
I presume the first answer is going to be "profile it yourself", but I'd like to know if someone already has experience with this stuff.
Typically you should not, if you have not found out that this is critical to meet your performance requirements.
If you store it in one blob you need to recalculate your whole database if you make any change to the object structure (like adding or removing a column). If you keep multiple columns your future database refactorings and deployments will be much easier.
I can't fully speak to the performance of the SELECT, you'll need to test that, but I highly doubt it will cause any performance issues there because you wouldn't be reading any more data than before. However, in regards to the INSERT, you should see a performance gain (of what size I'm unsure), because there will likely not be any statistical indexes to update. Of course that depends on a lot of settings but I'm just throwing my opinion out there. This question is pretty subjective and not near enough information is available to truly tell you if you will see performance issues surrounding the change.
Now, in practice I'm going to say, leave it be unless you're seeing real performance issues. Further, if you're seeing real performance issues, analyze those before choosing this type of solution, there are probably other ways to fix them.
On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.
I have large data set, which I want to query. The query does not change but the underlying data does. From what I read, I could construct a "view" and query it. Also, I read that Couch DB knows how to update the view when data is changed so I assume querying the view again would be still fast.
My questions are, do I understand CounchDB's views correctly? I don't need any other feature of CouchDB, I don't even need SQL, all I want is fast same query over changing data. Could I use something else? If I would use, say, good old MySQL would it really be slower than CouchDB (read: in the above scenario, how would various DBs approximately perform?).
Your assessment is completely correct. Enjoy!
The only performance trick worth mentioning is that you may see a boost if you emit() all of the data you need from the view and never use the ?include_docs feature, because include_docs causes CouchDB to go back into the main database and retrieve the original doc that caused that view row. In other words, you can emit() everything you need into your view index (more space but faster), or you can use the reference back to the original document (less space but slower.)
I don't think anyone can answer your question given the information you have provided.
Indexes in a relational database are analogous to CouchDB views. In both cases, they store a pre-sorted instance of the data and the database keeps that instance in sync with the canonical data. Both types of database transparently use the index/view to speed up subsequent queries of the form that the index/view was designed for.
Without indexes/views, queries must scan the whole collection of n records of data and they execute in O(n) time. When a query benefits from an indexes/views, it executes in O(log n) time.
But that's speaking very broadly of the performance curve with respect to the volume of data. A given database could have such speedy performance in certain cases that it out-performs another product no matter what. It's hard to make generalizations that brand X is always faster than brand Y. The only way to be sure about a specific case is to try that case in both databases and measure the performance.
I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.
The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.
Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.
So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.
I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.
This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.
This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).
Then, using a periodic SQL job, flush out the cache table.
This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.
So, long story short, am I crazy? Or is this a good idea.
Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?
I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.
The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.
Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.
Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.