Breaking join in Case Of Large Tables

Breaking join in Case Of Large Tables - sql-server

I just need a common approach.
When we are joining suppose 7 table out of them 5 are huge table and 2 are small tables.
So if we execute the query in one go it will take a numerous amount of time.
So, what is the basic approach we follow to break the query.
And all the tables are having only one joining condition to the next table.
Is their any approach that can help us to optimize the joins and getting a better execution plan.

Related

SQL Server query improvement suggestion, aggregation of large amount of data in a query

My BI developer wrote a query that took 14 hours to run and I'm trying to help him out. On a high level, it's a query that explores financial transaction of the past 15 years and break them down for each quarter.
I'm sharing the answers I already gave him here but I would like to know if you have any suggestion where we can explore and research further to improve the performance, answer such as: "perhaps you may want to look at snapshot.."
His query consists of:
Includes the use of multiple views, meaning select from one view to produce another view etc..
Some views join three tables, each has around 100 - 200 million rows.
Certain view use sub select query.
Here are my recommendations so far:
Do not use nested views to produce the query, instead of using views create new tables for each of them because the data is not dynamic (financial transaction data) and won't change. Nested views from my experience aren't good for performance.
Do not use sub query, use JOIN whenever possible.
I make sure he creates non cluster index wherever appropriate.
Do not use TEMPT table when there is this much data.
Try and use WITH(NO LOCK) on all tables used in JOIN
Find an common query and convert it into a stored procedure
When joining those three large tables (100 - 200 million rows), try to limit the data amount at the JOIN instead of using WHERE. Ex, instead of select * from tableA JOIN tableB WHERE... , USE SELECT * FROM TableA JOIN tableB ON .... AND tableA.date BETWEEN range. This will give less data when joining with other table later in the query.
The problem is the data he has to work with are too huge, I wonder the query performance can only do so much because at the end of the day, you still have to process all those data in your query. Perhaps the next step is to think of how one can prepare these data and store them in smaller table first such as CostQ1_2010, CostQ2_2020 ect... and then write your query based on all those tables.

You have given us very little information to go on. Tolstoy wrote, "All happy families are alike; each unhappy family is unhappy in its own way.” That's also true of SQL queries, especially big BI queries.
I'll risk some general answers.
With tables of the size you mention, your query surely contains date-range WHERE filters like transaction_date >= something AND transaction_date < anotherthing. In general, a useful report covers a year out of a decade of transactions. So make sure you have the right indexes to do index range scans where possible. SSMS, if you choose the Show Actual Execution Plan feature, sometimes suggests indexes.
Learn to read execution plans.
Read about covering indexes. They can sometimes make a big difference.
Use the statement SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED before starting this kind of long-running historical BI query. You'll get less interference between the BI query and other activity on the database.
It may make sense to preload some denormalized tables from the views used in the BI query.

Updating a table with INSERT vs UPDATE (performance vs clearness)

I update my big table incrementally. My big table is a result of a join of many tables. There are so many joins of so tables in my query that it became maze-like and I stumbled on a problem while adding another join to get another column to the table.
Do you recommend to make the query even more complex by adding another join - all this for making then one insert to the table. Or do you recommend to split the query into two queries - one insert and then update. In a second query just update the column based on a query.
I could imagine that one insert of all columns could be faster then small insert of few columns accompanied by update of the rest columns. But the number of tables in a join has its limits at least in the terms of clearness. What are the best practices here.

Have you heard of the MERGE statement? Handles inserts/updates/deletes in the same query. Aaron Bertrand has an article talking about the pitfalls of it but overall, it helps simplify when you have to do any of those operations: http://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/
As for your question, maintenance-wise and index tuning wise, it would be much more advantageous to break those two operations out. Adding another join to a table does not make a query complex. That is how an RDBMS is supposed to function. Well commented code and concise naming standards should make any joins clear to someone reading the query.
If adding another join complicates the update/insert process, then yes, break them out. If it is just a matter of deciphering what the query is doing but the operation is fine, give it some context with comments.
Asking for best practices on something like this can tend to have people say 'it depends'. It is kind of a broad subject and can be handled multiple ways, depending on what your situation is like. Someone with a small row count that isn't expecting it to get large could get away with a somewhat expensive operation. Someone with a large row count (read millions-billions) would most likely be looking for a new job if they brought the server to its knees all the time.

Is it better to use one complex query or several simpler ones?

Which option is better:
Writing a very complex query having large number of joins, or
Writing 2 queries one after the other, applying the obtained result set of the processed query on other.

Generally, one query is better than two, because the optimizer has more information to work with and may be able to produce a more efficient query plan than either separately. Additionally, using two (or more) queries typically means you'll be running the second query multiple times, and the DBMS might have to generate the query plan for the query repeatedly (but not if you prepare the statement and pass the parameters as placeholders when the query is (re)executed). This means fewer back and forth exchanges between the program and the DBMS. If your DBMS is on a server on the other side of the world (or country), this can be a big factor.
Arguing against combining the two queries, you might end up shipping a lot of repetitive data between the DBMS and the application. If each of 10,000 rows in table T1 is joined with an average of 30 rows from table T2 (so there are 300,000 rows returned in total), then you might be shipping a lot of data repeatedly back to the client. If the row size of (the relevant projection of) T1 is relatively small and the data from T2 is relatively large, then this doesn't matter. If the data from T1 is large and the data from T2 is small, then this may matter; measure before deciding.

When I was a junior DB person I once worked for a year in a marketing dept where I had so much free time I did each task 2 or 3 different ways. I made a habit of writing one mega-select that grabbed everything in one go and comparing it to a script that built interim tables of selected primary keys and then once I had the correct keys went and got the data values.
In almost every case the second method was faster. the cases where it wasn't were when dealing with a small number of small tables. Where it was most noticeably faster was of course large tables and multiple joins.
I got into the habit of select the required primary keys from tableA, select the required primary keys from tableB, etc. Join them and select the final set of primary keys. Use the selected primary keys to go back to the tables and get the data values.
As a DBA I now understand that this method resulted in less purging of the data cache and played nicer with others using the DB (as mentioned by Amir Raminfar).
It does however require the use of temporary tables which some places / DBA don't like (unfairly in my mind)

Depends a lot on the actual query and the actual database i.e. SQL, Oracle mySQL.

At large companies, they prefer option 2 because option 1 will hog the database cpu. This results in all other connections being slow and everything being a bottle neck. That being said, it all depends on your data and the ammount you are joining. If you are joining on 10000 to 1000 then you are going to get back 10000 x 1000 records. (Assuming an inner join)
Possible duplicate MySQL JOIN Abuse? How bad can it get?

Assuming "better" means "faster", you can easily test these scenarios in a junit test. Note that a determining factor that you may not be able to get from a unit test is network latency. If the database sits right next to your machine where you run the unit test, you may see no difference in performance that is attributed to the network. If your production servers are in another town, country, or continent from the database, network traffic becomes more of a bottleneck. You do not want to go back and forth across the wire- you more likely want to make one round trip and get everything at once.
Again, it all depends :)

It could depend on many things: ,
the indexes you have set up
how many tables,
what the actual query is,
how big the data set is,
what the underlying DB is,
what table engine you are using
The best thing to do would probably test both methods on a variety of test data and see which one bottle necks.
If you are using MySQL, ( and Oracle maybe? ) you can use
EXPLAIN SELECT .....
and it will give you a lot of info on how it will execute the query, and therefor how you can improve it etc.

Your first gut feeling on this SqlServer design question

We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute)
every measurement holds a FK to a timestamp.
We have 8M (million) measurements, and 2M timestamps.
We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table.
Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.
But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.
My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.
What is your first feeling on this?
EDIT:
new measurements: 400 records per 5 minutes
EDIT 2:
maybe the question is not so clear:
the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted.
In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.
The second solution is to join the two tables in a query.

With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.
*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.

I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.

If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.
So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.

Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.

How do I troubleshoot performance problems with an Oracle SQL statement

I have two insert statements, almost exactly the same, which run in two different schemas on the same Oracle instance. What the insert statement looks like doesn't matter - I'm looking for a troubleshooting strategy here.
Both schemas have 99% the same structure. A few columns have slightly different names, other than that they're the same. The insert statements are almost exactly the same. The explain plan on one gives a cost of 6, the explain plan on the other gives a cost of 7. The tables involved in both sets of insert statements have exactly the same indexes. Statistics have been gathered for both schemas.
One insert statement inserts 12,000 records in 5 seconds.
The other insert statement inserts 25,000 records in 4 minutes 19 seconds.
The number of records being insert is correct. It's the vast disparity in execution times that confuses me. Given that nothing stands out in the explain plan, how would you go about determining what's causing this disparity in runtimes?
(I am using Oracle 10.2.0.4 on a Windows box).
Edit: The problem ended up being an inefficient query plan, involving a cartesian merge which didn't need to be done. Judicious use of index hints and a hash join hint solved the problem. It now takes 10 seconds. Sql Trace / TKProf gave me the direction, as I it showed me how many seconds each step in the plan took, and how many rows were being generated. Thus TKPROF showed me:-
Rows Row Source Operation
------- ---------------------------------------------------
23690 NESTED LOOPS OUTER (cr=3310466 pr=17 pw=0 time=174881374 us)
23690 NESTED LOOPS (cr=3310464 pr=17 pw=0 time=174478629 us)
2160900 MERGE JOIN CARTESIAN (cr=102 pr=0 pw=0 time=6491451 us)
1470 TABLE ACCESS BY INDEX ROWID TBL1 (cr=57 pr=0 pw=0 time=23978 us)
8820 INDEX RANGE SCAN XIF5TBL1 (cr=16 pr=0 pw=0 time=8859 us)(object id 272041)
2160900 BUFFER SORT (cr=45 pr=0 pw=0 time=4334777 us)
1470 TABLE ACCESS BY INDEX ROWID TBL1 (cr=45 pr=0 pw=0 time=2956 us)
8820 INDEX RANGE SCAN XIF5TBL1 (cr=10 pr=0 pw=0 time=8830 us)(object id 272041)
23690 MAT_VIEW ACCESS BY INDEX ROWID TBL2 (cr=3310362 pr=17 pw=0 time=235116546 us)
96565 INDEX RANGE SCAN XPK_TBL2 (cr=3219374 pr=3 pw=0 time=217869652 us)(object id 272084)
0 TABLE ACCESS BY INDEX ROWID TBL3 (cr=2 pr=0 pw=0 time=293390 us)
0 INDEX RANGE SCAN XIF1TBL3 (cr=2 pr=0 pw=0 time=180345 us)(object id 271983)
Notice the rows where the operations are MERGE JOIN CARTESIAN and BUFFER SORT. Things that keyed me into looking at this were the number of rows generated (over 2 million!), and the amount of time spent on each operation (compare to other operations).

Use the SQL Trace facility and TKPROF.

The main culprits in insert slow downs are indexes, constraints, and oninsert triggers. Do a test without as many of these as you can remove and see if it's fast. Then introduce them back in and see which one is causing the problem.
I have seen systems where they drop indexes before bulk inserts and rebuild at the end -- and it's faster.

The first thing to realize is that, as the documentation says, the cost you see displayed is relative to one of the query plans. The costs for 2 different explains are not comparable. Secondly the costs are based on an internal estimate. As hard as Oracle tries, those estimates are not accurate. Particularly not when the optimizer misbehaves. Your situation suggests that there are two query plans which, according to Oracle, are very close in performance. But which, in fact, perform very differently.
The actual information that you want to look at is the actual explain plan itself. That tells you exactly how Oracle executes that query. It has a lot of technical gobbeldy-gook, but what you really care about is knowing that it works from the most indented part out, and at each step it merges according to one of a small number of rules. That will tell you what Oracle is doing differently in your two instances.
What next? Well there are a variety of strategies to tune bad statements. The first option that I would suggest, if you're in Oracle 10g, is to try their SQL tuning advisor to see if a more detailed analysis will tell Oracle the error of its ways. It can then store that plan, and you will use the more efficient plan.
If you can't do that, or if that doesn't work, then you need to get into things like providing query hints, manual stored query outlines, and the like. That is a complex topic. This is where it helps to have a real DBA. If you don't, then you'll want to start reading the documentation, but be aware that there is a lot to learn. (Oracle also has a SQL tuning class that is, or at least used to be, very good. It isn't cheap though.)

I've put up my general list of things to check to improve performance as an answer to another question:
Favourite performance tuning tricks
... It might be helpful as a checklist, even though it's not Oracle-specific.

I agree with a previous poster that SQL Trace and tkprof are a good place to start. I also highly recommend the book Optimizing Oracle Performance, which discusses similar tools for tracing execution and analyzing the output.

SQL Trace and tkprof are only good if you have access to theses tools. Most of the large companies that I do work for do not allow developers to access anything under the Oracle unix IDs.
I believe you should be able to determine the problem by first understanding the question that is being asked and by reading the explain plans for each of the queries. Many times I find that the big difference is that there are some tables and indexes that have not been analyzed.

Another good reference that presents a general technique for query tuning is the book SQL Tuning by Dan Tow.

When the performance of a sql statement isn't as expected / desired, one of the first things I do is to check the execution plan.
The trick is to check for things that aren't as expected. For example you might find table scans where you think an index scan should be faster or vice versa.
A point where the oracle optimizer sometimes takes a wrong turn are the estimates how many rows a step will return. If the execution plan expects 2 rows, but you know it will more like 2000 rows, the execution plan is bound to be less than optimal.
With two statements to compare you can obviously compare the two execution plans to see where they differ.
From this analysis, I come up with an execution plan that I think should be suited better. This is not an exact execution plan, but just some crucial changes, to the one I found, like: It should use Index X or a Hash Join instead of a nested loop.
Next thing is to figure out a way to make Oracle use that execution plan. Often by using Hints, or creating additonal indexes, sometimes changing the SQL statement. Then of course test that the changed statement
a) still does what it is supposed to do
b) is actually faster
With b it is very important to make sure you are testing the correct use case. A typical pit fall is the difference between returning the first row, versus returning the last row. Most tools show you the first results as soon as they are available, with no direct indication, that there is more work to be done. But if your actual program has to process all rows before it continues to the next processing step, it is almost irrelevant when the first row appears, it is only relevant when the last row is available.
If you find a better execution plan, the final step is to make you database actually use it in the actual program. If you added an index, this will often work out of the box. Hints are an option, but can be problematic if a library creates your sql statement, those ofte don't support hints. As a last resort you can save and fix execution plans for specific sql statements. I'd avoid this approach, because its easy to become forgotten and in a year or so some poor developer will scratch her head why the statement performs in a way that might have been apropriate with the data one year ago, but not with the current data ...

analyzing the oI also highly recommend the book Optimizing Oracle Performance, which discusses similar tools for tracing execution and utput.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight