improve database querying in ms sql - sql-server

what's a fast way to query large amounts of data (between 10.000 - 100.000, it will get bigger in the future ... maybe 1.000.000+) spread across multiple tables (20+) that involves left joins, functions (sum, max, count,etc.)?
my solution would be to make one table that contains all the data i need and have triggers that update this table whenever one of the other tables gets updated. i know that trigger aren't really recommended, but this way i take the load off the querying. or do one big update every night.
i've also tried with views, but once it starts involving left joins and calculations it's way too slow and times out.

Since your question is too general, here's a general answer...
The path you're taking right now is optimizing a single query/single issue. Sure, it might solve the issue you have right now, but it's usually not very good in the long run (not to mention the cumulative cost of maintainance of such a thing).
The common path to take is to create an 'analytics' database - the real-time copy of your production database that you're going to query for all your reports. This analytics database can eventually be even a full blown DWH, but you're probably going to start with a simple real-time replication (or replicate nightly or whatever) and work from there...
As I said, the question/problem is too broad to be answered in a couple of paragraphs, these only some of the guidelines...

Need a bit more details, but I can already suggest this:
Use "with(nolock)", this will slightly improve the speed.
Reference: Effect of NOLOCK hint in SELECT statements
Use Indexing for your table fields for fetching data fast.
Reference: sql query to select millions record very fast

Related

How to Pre Stage Queries on SQL Server?

I've looked for tips of how to speed up sql intensive application and found this particularly useful link.
In point 6 he says
Do pre-stage data This is one of my favorite topics because it's an old technique that's often overlooked. If you have a report or a
procedure (or better yet, a set of them) that will do similar joins to
large tables, it can be a benefit for you to pre-stage the data by
joining the tables ahead of time and persisting them into a table. Now
the reports can run against that pre-staged table and avoid the large
join.
You're not always able to use this technique, but when you can, you'll
find it is an excellent way to save server resources.
Note that many developers get around this join problem by
concentrating on the query itself and creating a view-only around the
join so that they don't have to type the join conditions again and
again. But the problem with this approach is that the query still runs
for every report that needs it. By pre-staging the data, you run the
join just once (say, 10 minutes before the reports) and everyone else
avoids the big join. I can't tell you how much I love this technique;
in most environments, there are popular tables that get joined all the
time, so there's no reason why they can't be pre-staged
From what I understood you join the tables once and several SQL Queries can "benefit" from it. That looks extremely interesting for the application I'm working at.
The thing is, I've been looking for pre staging data around and couldn't find anything that seems to be related to that technique. Maybe I'm missing a few keywords ?
I'd like to know how to use the described technique within SQL Server. The link says it's an old technique so it shouldn't be a problem that I'm using SQL Server 2008.
What I would like is the following: I have several SELECT queries that run in a row. All of them join the same 7-8 tables and they're all really heavy, that impacts performance. So I'm thinking of joining them, run the queries and them drop this intermediate table. How/Can it be done ?
If your query meets the requirements for an indexed view then you can just create such an object materialising the result of your query. This means that it will always be up-to-date.
Otherwise you would need to write code to materialise it yourself. Either eagerly on a schedule or potentially on first request and then cached for some amount of time that you deem acceptable.
The second approach is not rocket science and can be done with a TRUNCATE ... INSERT ... SELECT for moderate amounts of data or perhaps an ALTER TABLE ... SWITCH if the data is large and there are concurrent queries that might be accessing the cached result set.
Regarding your edit it seems like you just need to create a #temp table and insert the results of the join into it though. Then reference the #temp table for the several SELECTs and drop it. There is no guarantee that this will improve performance though and insufficient details in the question to even hazard a guess.

Any suggestions for identifying what indexes need to be created?

I'm in a situation where I have to improve the performance of about 75 stored procedures (created by someone else) used for reporting. The first part of my solution was creating about 6 denormalized tables that will be used for the bulk of the reporting. Now that I've created the tables I have the somewhat daunting task of determining what Indexes I should create to best improve the performance of these stored procs.
I'm curious to see if anyone has any suggestions for finding what columns would make sense to include in the indexes? I've contemplated using Profiler/DTA, or possibly fasioning some sort of query like the one below to figure out the popular columns.
SELECT name, Count(so.name) as hits, so.xtype
from syscomments as sc
INNER JOIN sysobjects so ON sc.id=so.id
WHERE sc.text like '%ColumnNamme%'
AND xtype = 'P'
Group by name,so.xtype
ORDER BY hits desc
Let me know if you have any ideas that would help me not have to dig through these 75 procs by hand.
Also, inserts are only performed on this DB once per day so insert performance is not a huge concern for me.
Any suggestions for identifying what indexes need to be created?
Yes! Ask Sql Server to tell you.
Sql Server automatically keeps statistics for what indexes it can use to improve performance. This is already going on in the background for you. See this link:
http://msdn.microsoft.com/en-us/library/ms345417.aspx
Try running a query like this (taken right from msdn):
SELECT mig.*, statement AS table_name,
column_id, column_name, column_usage
FROM sys.dm_db_missing_index_details AS mid
CROSS APPLY sys.dm_db_missing_index_columns (mid.index_handle)
INNER JOIN sys.dm_db_missing_index_groups AS mig ON mig.index_handle = mid.index_handle
ORDER BY mig.index_group_handle, mig.index_handle, column_id;
Just be careful. I've seen people take the missing index views as Gospel, and use them to push out a bunch of indexes they don't really need. Indexes have costs, in terms of upkeep at insert, update, and delete time, as well as disk space and memory use. To make real, accurate use of this information you want to profile actual execution times of your key procedures both before and after any changes, to make sure the benefits of an index (singly or cumulative) aren't outweighed by the costs.
If you know all of the activity is coming from the 75 stored procedures then I would use profiler to track which stored procedures take the longest and are called the most. Once you know which ones are then look at those procs and see what columns are being used most often in the Where clause and JOIN ON sections. Most likely, those are the columns you will want to put non-clustered indexes on. If a set of columns are often times used together then there is a good chance you will want to make 1 non-clustered index for the group. You can have many non-clustered indexes on a table (250) but you probably don't want to put more than a handful on it. I think you will find the data is being searched and joined on the same columns over and over. Remember the 80/20 rule. You will probably get 80% of your speed increases in the first 20% of the work you do. There will be a point where you get very little speed increase for the added indexes, that is when you want to stop.
I concur with bechbd - use a good sample of your database traffic (by running a server trace on a production system during real office hours, to get the best snapshot), and let the Database Tuning Advisor analyze that sampling.
I agree with you - don't blindly rely on everything the Database Tuning Advisor tells you to do - it's just a recommendation, but the DTA can't take everything into account. Sure - by adding indices you can speed up querying - but you'll slow down inserts and updates at the same time.
Also - to really find out if something helps, you need to implement it, measure again, and compare - that's really the only reliable way. There are just too many variables and unknowns involved.
And of course, you can use the DTA to fine-tune a single query to perform outrageously well - but that might neglect the fact that this query is only ever called one per week, or that by tuning this one query and adding an index, you hurt other queries.
Index tuning is always a balance, a tradeoff, and a trial-and-error kind of game - it's not an exact science with a formula and a recipe book to strictly determine what you need.
You can use SQL Server profiler in SSMS to see what and how your tables are being called then using the Database Tuning Tool in profiler to at least start you down the correct path. I know most DBA's will probably scream at me for recommending this but for us non-DBA types such as myself it at least gives us a starting point.
If this is strictly a reporting database and you need performance, consider moving to a data warehouse design. A star or snowflake schema will outperform even a denormalized relational design when it comes to reporting.

Oracle Multiple Schemas Aggregate Real Time View

All,
Looking for some guidance on an Oracle design decision I am currently trying to evaluate:
The problem
I have data in three separate schemas on the same oracle db server. I am looking to build an application that will show data from all three schemas, however the data that is shown will be based on real time sorting and prioritisation rules that is applied to the data globally (i.e.: based on the priority weightings applied I may pull back data from any one of the three schemas).
Tentative Solution
Create a VIEW in the DB which maintains logical links to the relevant columns in the three schemas, write a stored procedure which accepts parameterised priority weightings. The application subsequently calls the stored procedure to select the ‘prioritised’ row from the view and then queries the associated schema directly for additional data based on the row returned.
I have concerns over performance where the data is being sorted/ prioritised upon each query being performed but cannot see a way around this as the prioritisation rules will change often. We are talking of data sets in the region of 2-3 million rows per schema.
Does anyone have alternative suggestions on how to provide an aggregated and sorted view over the data?
Querying from multiple schemas (or even multiple databases) is not really a big deal, even inside the same query. Just prepend the table name with the schema you are interested in, as in
SELECT SOMETHING
FROM
SCHEMA1.SOME_TABLE ST1, SCHEMA2.SOME_TABLE ST2
WHERE ST1.PK_FIELD = ST2.PK_FIELD
If performance becomes a problem, then that is a big topic... optimal query plans, indexes, and your method of database connection can all come into play. One thing that comes to mind is that if it does not have to be realtime, then you could use materialized views (aka "snapshots") to cache the data in a single place. Then you could query that with reasonable performance.
Just set the snapshots to refresh at an interval appropriate to your needs.
It doesn't matter that the data is from 3 schemas, really. What's important to know is how frequently the data will change, how often the criteria will change, and how frequently it will be queried.
If there is a finite set of criteria (that is, the data will be viewed in a limited number of ways) which only change every few days and it will be queried like crazy, you should probably look at materialized views.
If the criteria is nearly infinite, then there's no point making materialized views since they won't likely be reused. The same holds true if the criteria itself changes extremely frequently, the data in a materialized view wouldn't help in this case either.
The other question that's unanswered is how often the source data is updated, and how important is it to have the newest information. Frequently updated source day can either mean a materialized view will get "stale" for some duration or you may be spending a lot of time refreshing the materialized views unnecessarily to keep the data "fresh".
Honestly, 2-3 million records isn't a lot for Oracle anymore, given sufficient hardware. I would probably benchmark simple dynamic queries first before attempting fancy (materialized) view.
As others have said, querying a couple of million rows in Oracle is not really a problem, but then that depends on how often you are doing it - every tenth of a second may cause some load on the db server!
Without more details of your business requirements and a good model of your data its always difficult to provide good performance ideas. It usually comes down to coming up with a theory, then trying it against your database and accessing if it is "fast enough".
It may also be worth you taking a step back and asking yourself how accurate the results need to be. Does the business really need exact values for this query or are good estimates acceptable
Tom Kyte (of Ask Tom fame) always has some interesting ideas (and actual facts) in these areas. This article describes generating a proper dynamic search query - but Tom points out that when you query Google it never tries to get the exact number of hits for a query - it gives you a guess. If you can apply a good estimate then you can really improve query performance times

Limits on number of Rows in a SQL Server Table

Are there any hard limits on the number of rows in a table in a sql server table? I am under the impression the only limit is based around physical storage.
At what point does performance significantly degrade, if at all, on tables with and without an index. Are there any common practicies for very large tables?
To give a little domain knowledge, we are considering usage of an Audit table which will log changes to fields for all tables in a database and are wondering what types of walls we might run up against.
You are correct that the number of rows is limited by your available storage.
It is hard to give any numbers as it very much depends on your server hardware, configuration, and how efficient your queries are.
For example, a simple select statement will run faster and show less degradation than a Full Text or Proximity search as the number of rows grows.
BrianV is correct. It's hard to give a rule because it varies drastically based on how you will use the table, how it's indexed, the actual columns in the table, etc.
As to common practices... for very large tables you might consider partitioning. This could be especially useful if you find that for your log you typically only care about changes in the last 1 month (or 1 day, 1 week, 1 year, whatever). You could then archive off older portions of the data so that it's available if absolutely needed, but won't be in the way since you will almost never actually need it.
Another thing to consider is to have a separate change log table for each of your actual tables if you aren't already planning to do that. Using a single log table makes it VERY difficult to work with. You usually have to log the information in a free-form text field which is difficult to query and process against. Also, it's difficult to look at data if you have a row for each column that has been changed because you have to do a lot of joins to look at changes that occur at the same time side by side.
In addition to all the above, which are great reccomendations I thought I would give a bit more context on the index/performance point.
As mentioned above, it is not possible to give a performance number as depending on the quality and number of your indexes the performance will differ. It is also dependent on what operations you want to optimize. Do you need to optimize inserts? or are you more concerned about query response?
If you are truly concerned about insert speed, partitioning, as well a VERY careful index consideration is going to be key.
The separate table reccomendation of Tom H is also a good idea.
With audit tables another approach is to archive the data once a month (or week depending on how much data you put in it) or so. That way if you need to recreate some recent changes with the fresh data, it can be done against smaller tables and thus more quickly (recovering from audit tables is almost always an urgent task I've found!). But you still have the data avialable in case you ever need to go back farther in time.

How to gain performance when maintaining historical and current data?

I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...

Resources