In order to load data from multiple data sources and a big amount of Data using SQL Server 2014.
My ETL Scripts are in T-SQL and it taking a lot of time to execute because my TempDB are full.
In your opinion, which is the best way to lead with this:
Using Commit Transactions?
Clean TempDB?
etc.
They only way to answer this question is with a very high level general response.
You have a few options:
Simply allocate more space to TempDB.
Optimize your ETL queries and tune your indexes.
Option 2 is often the better apporoach. Excessive use of TempDB indicates that inefficient sorts or joins are occurring. To resolve this, you need to analyze the actual execution plans of your ETL code. Look for the following:
Exclamation marks in your query plan. This often indicates that a join or a sort operation had to spill over to TempDB because the optimizer under estimated the amount of memory required. You might have statistics which needs to be updated.
Look for large differences in the estimated number of rows and actual number of rows. This can also indicate statistics that are out of date of parameter sniffing issues.
Look for sort operations. It is often possible to remove these by adding indexes to your tables.
Look for inefficient access methods. These can often be resolved by adding covering indexes. E.g table scan if you only need a small number of rows from a large table. Just note that table scans are often the best approach when loading data warehouses.
Hope this was helpful.
Marius
Related
We have a query in our system that has been a problem in the amount of logical reads it is using. The query is run often enough (a few times a day), but it is report in nature (i.e. gathering data, it is not transactional).
After having a couple of people look at it we are mulling over a few different options.
Using OPTION (FORCE ORDER) and a few MERGE JOIN hints to get the optimizer to process the data more efficiently (at least on the data that has been tested).
Using temp tables to break up the query so the optimizer isn't dealing with a very large query which is allowing it process it more efficiently.
We do not really have the option of doing a major schema change or anything, tuning the query is kind of the rallying point for this issue.
The query hints option is performing a little better than the other option, but both options are acceptable in terms of performance at this point.
So the question is, which would you prefer? The query hints are viewed as slightly dangerous because it we are overriding the optimizer etc. The temp table solution needs to write out to the tempdb etc.
In the past we have been able to see large performance gains using temp tables on our larger reporting queries but that has generally been for queries that are run less frequently than this query.
if you have exhausted optimizing via indexes and removed non-SARGABLE sql then I recommend going for the temp tables option:
temp tables provide repeatable performance, provided they do not put excessive pressure on the tempdb in terms of size increase and performance - you will need to monitor those
sql hints may stop being effective because of other table/index changes in the future
remember to clean up temp tables when you are finished.
I was asked in an interview to enlighten the ways one can use to optimize the query Select * from TableA if it is taking a lot of time to execute. (TableA) can be any table with large amount of data. The interviewer didn't leave me any option like to select few columns or to use "WHERE" clause rather he wanted me to give solution for the subject query.
It's really hard to know what the interviewer was looking for.
They might be relatively inexperienced and expected answers like:
"list all the columns instead of * since that's way faster!"; or,
"add an ORDER BY because that will always speed it up!"
The kinds of things an experienced person might be looking for are:
inspect the query plan, are there computed columns or other similar things taking additional resources?
revisit the requirements - do the users really need the whole table in arbitrary order?
is there a clustered index on the table; if not, is the heap full of forwarding pointers?
is there excessive fragmentation on the underlying table (and/or the index being used to satisfy the query)?
is the query being blocked?
what is the query waiting on?
is the query waiting on an external resource (e.g. crappy I/O subsystem, a memory grant, a tempdb autogrow)?
is the query parallel and suffering packet waits because the stats are out of date?
There are a lot of underlying things that may be making that query slow or that may make that query a bad choice.
Actually some databases will have optimize commands that will rebuild the database tables to reduce fragmentation - and this way actually improve the performance for such queries.
PostgreSQL and SQLite have the command
VACUUM;
MySQL and ORACLE have a command
OPTIMIZE TABLE table;
It is expensive, as it will move around a lot of data. But in doing so, it will make the pages more balanced and this way usually shrink the total database size (some databases may however decide to add an index at this point, so it may also grow).
Since the data is stored in pages, reducing the number of pages by rebuilding the database can improve the performance even for a SELECT * FROM table; statement.
We use an Oracle 10.2.0.5 database in Production.
Optimizer is in "cost-based" mode.
Do we need to calculate statistics (DBMS_STATS package) after:
creating a new index
adding a column
creating a new table
?
Thanks
There's no short answer. It totally depends on your data and how you use it. Here are some things to consider:
As #NullUserException pointed out, statistics are automatically gathered, usually every night. That's usually good enough; in most (OLTP) environments, if you just added new objects they won't contain a lot of data before the stats are automatically gathered. The plans won't be that bad, and if the objects are new they probably won't be used much right away.
creating a new index - No. "Oracle Database now automatically collects statistics during index creation and rebuild".
adding a column - Maybe. If the column will be used in joins and predicates you probably want stats on it. If it's just used for storing and displaying data it won't really affect any plans. But, if the new column takes up a lot of space it may significantly alter the average row length, number of blocks, row chaining, etc., and the optimizer should know about that.
creating a new table - Probably. Oracle is able to compensate for missing statistics through dynamic sampling, although this often isn't good enough. Especially if the new table has a lot of data; bad statistics almost always lead to under-estimating the cardinality, which will lead to nested loops when you want hash joins. Also, even if the table data hasn't changed, you may need to gather statistics one more time to enable histograms. By default, Oracle creates histograms for skewed data, but will not enable those histograms if those columns haven't been used as a predicate. (So this applies to adding a new column as well). If you drop and re-create a table, even with the same name, Oracle will not maintain any of that column use data, and will not know that you need histograms on certain columns.
Gathering optimizer statistics is much more difficult than most people realize. At my current job, most of our performance problems are ultimately because of bad statistics. If you're trying to come up with a plan for your system you ought to read the Managing Optimizer Statistics chapter.
Update:
There's no need to gather statistics for empty objects; dynamic sampling will work just as quickly as reading stats from the data dictionary. (Based on a quick test hard parsing a large number of queries with and without stats.) If you disable dynamic sampling then there may be some weird cases where the Oracle default values lead to inaccurate plans, and you would be better off with statistics on an empty table.
I think the reason Oracle automatically gathers stats for indexes at creation time is because it doesn't cost much extra. When you create an index you have to read all the blocks in the table, so Oracle might as well calculate the number of levels, blocks, keys, etc., at the same time.
Table statistics can be more complicated, and may require multiple passes of the data. Creating an index is relatively simple compared to the arbitrary SQL that may be used as part of a create-table-as-select. It may not be possible, or efficient, to take those arbitrary SQL statements and transform them into a query that also returns the information needed to gather statistics.
Of course it wouldn't cost anything extra to gather stats for an empty table. But it doesn't gain you anything either, and it would just be misleading to anyone who looks at the USER_TABLES.LAST_ANALYZED - the table appear to be analyzed, but not with any meaningful data.
I have two table in database A,B
A: summary view of my data
B: detail view (text files has detail story)
I have 1 million record on my database 70% of the size is for Table B and 30% for Table A.
I want to know if the size of database affect the query performance response time ?
Is it beneficial to remove my Table B and store it on Disk to reduce the file size of the database to optimize the performance of my database ?
Absolutely size can be a factor on DB performance! However, the size can be mitigated through the proper use of indexes, relationships and integrity. There are a number of other things than can cause performance loss like triggers that may execute in unwanted situations.
I would say removing the table is an option but I don't think should be your first choice. Make sure your database has used the things I mention above properly.
There are many databases that are exponentially larger than yours that perform very well. Also, use explain plans on your queries to make sure you are using sound syntax like the proper use of joins.
I would personally start with using explain plans. This will tell you if you are missing indexes, joins, etc. Then make the changes one at a time until you are happy with the performance.
A million records is a tiny table in database terms. Any modern datbase should be able to handle that without breaking a sweat, even Access can easily handle that. If you are currently experiencing performance problems you likely have database design problems, poorly written queries, and/or inadequate hardware.
Im trying to squeeze some extra performance from searching through a table with many rows.
My current reasoning is that if I can throw away some of the seldom used member from the searched table thereby reducing rowsize the amount of pagesplits and hence IO should drop giving a benefit when data start to spill from memory.
Any good resource detailing such effects?
Any experiences?
Thanks.
Tuning the size of a row is only a major issue if the RDBMS is performing a full table scan of the row, if your query can select the rows using only indexes then the row size is less important (unless you are returning a very large number of rows where the IO of returning the actual result is significant).
If you are doing a full table scan or partial scans of large numbers of rows because you have predicates that are not using indexes then rowsize can be a major factor. One example I remember, On a table of the order of 100,000,000 rows splitting the largish 'data' columns into a different table from the columns used for querying resulted in an order of magnitude performance improvement on some queries.
I would only expect this to be a major factor in a relatively small number of situations.
I don't now what else you tried to increase performance, this seems like grasping at straws to me. That doesn't mean that it isn't a valid approach. From my experience the benefit can be significant. It's just that it's usually dwarfed by other kinds of optimization.
However, what you are looking for are iostatistics. There are several methods to gather them. A quite good introduction can be found ->here.
The sql server query plan optimizer is a very complex algorithm and decision what index to use or what type of scan depends on many factors like query output columns, indexes available, statistics available, statistic distribution of you data values in the columns, row count, and row size.
So the only valid answer to your question is: It depends :)
Give some more information like what kind of optimization you have already done, what does the query plan looks like, etc.
Of cause, when sql server decides to do a table scna (clustered index scan if available), you can reduce io-performance by downsize row size. But in that case you would increase performance dramatically by creating a adequate index (which is a defacto a separate table with smaller row size).
If the application is transactional then look at the indexes in use on the table. Table partitioning is unlikely to be much help in this situation.
If you have something like a data warehouse and are doing aggregate queries over a lot of data then you might get some mileage from partitioning.
If you are doing a join between two large tables that are not in a 1:M relationship the query optimiser may have to resolve the predicates on each table separately and then combine relatively large intermediate result sets or run a slow operator like nested loops matching one side of the join. In this case you may get a benefit from a trigger-maintained denormalised table to do the searches. I've seen good results obtained from denormalised search tables for complex screens on a couple of large applications.
If you're interested in minimizing IO in reading data you need to check if indexes are covering the query or not. To minimize IO you should select column that are included in the index or indexes that cover all columns used in the query, this way the optimizer will read data from indexes and will never read data from actual table rows.
If you're looking into this kind of details maybe you should consider upgrading HW, changing controllers or adding more disk to have more disk spindle available for the query processor and so allowing SQL to read more data at the same time
SQL Server disk I/O is frequently the cause of bottlenecks in most systems. The I/O subsystem includes disks, disk controller cards, and the system bus. If disk I/O is consistently high, consider:
Move some database files to an additional disk or server.
Use a faster disk drive or a redundant array of inexpensive disks (RAID) device.
Add additional disks to a RAID array, if one already is being used.
Tune your application or database to reduce disk access operations.
Consider index coverage, better indexes, and/or normalization.
Microsoft SQL Server uses Microsoft Windows I/O calls to perform disk reads and writes. SQL Server manages when and how disk I/O is performed, but the Windows operating system performs the underlying I/O operations. Applications and systems that are I/O-bound may keep the disk constantly active.
Different disk controllers and drivers use different amounts of CPU time to perform disk I/O. Efficient controllers and drivers use less time, leaving more processing time available for user applications and increasing overall throughput.
First thing I would do is ensure that your indexes have been rebuilt; if you are dealing with huge amount of data and an index rebuild is not possible (if SQL server 2005 onwards you can perform online rebuilds without locking everyone out), then ensure that your statistics are up to date (more on this later).
If your database contains representative data, then you can perform a simple measurement of the number of reads (logical and physical) that your query is using by doing the following:
SET STATISTICS IO ON
GO
-- Execute your query here
SET STATISTICS IO OFF
GO
On a well setup database server, there should be little or no physical reads (high physical reads often indicates that your server needs more RAM). How many logical reads are you doing? If this number is high, then you will need to look at creating indexes. The next step is to run the query and turn on the estimated execution plan, then rerun (clearing the cache first) displaying the actual execution plan. If these differ, then your statistics are out of date.
I think you're going to be farther ahead using standard optimization techniques first -- check your execution plan, profiler trace, etc. and see whether you need to adjust your indexes, create statistics etc. -- before looking at the physical structure of your table.