Maintenance commands on Redshift - database

I find myself dealing with a Redshift cluster with 2 different types of tables: the ones that get fully replaced every day and the ones that receive a merge every day.
From what I understand so far, there are maintenance commands that should be given since all these tables have millions of rows. The 3 commands I've found so far are:
vacuum table_name;
vacuum reindex table_name;
analyze table_name;
Which of those commands should be applied on which circumstance? I'm planning on doing it daily after they load in the middle of the night. The reason to do it daily is because after running some of these manually, there is a huge performance improvement.
After reading the documentation, I feel it's not very clear what the standard procedure should be.
All the tables have interleaved sortkeys regardless of the load type.

A quick summary of the commands, from the VACUUM documentation:
VACUUM: Sorts the specified table (or all tables in the current database) and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations. A full vacuum doesn't perform a reindex for interleaved tables.
VACUUM REINDEX: Analyzes the distribution of the values in interleaved sort key columns, then performs a full VACUUM operation.
ANALYZE: Updates table statistics for use by the query planner.
It is good practice to perform an ANALYZE when significant quantities of data have been loaded into a table. In fact, Amazon Redshift will automatically skip the analysis if less than 10% of data has changed, so there is little harm in running ANALYZE.
You mention that some tables get fully replaced every day. This should be done either by dropping and recreating the table, or by using TRUNCATE. Emptying a table with DELETE * is less efficient and should not be used to empty a table.
VACUUM can take significant time. In situations where data is being appended in time-order and the table's SORTKEY is based on the time, it is not necessary to vacuum the table. This is because the table is effectively sorted already. This, however, does not apply to interleaved sorts.
Interleaved sorts are more tricky. From the sort key documentation:
An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
Basically, interleaved sorts use a fancy algorithm to sort the data so that queries based on any of the columns (individually or in combination) will minimize the number of data blocks that are required to be read from disk. Disk access always takes the most time in a database, so minimizing disk access is the best way to speed-up the database. Amazon Redshift uses Zone Maps to identify which blocks to read from disk and the best way to minimize such access is to sort data and then skip over as many blocks as possible when performing queries.
Interleaved sorts are less performant than normal sorts, but give the benefit that multiple fields are fairly well sorted. Only use interleaved sorts if you often query on many different fields. The overhead in maintaining an interleaved sort (via VACUUM REINDEX) is quite high and should only be done if the reindex effort is worth the result.
So, in summary:
ANALYZE after significant data changes
VACUUM regularly if you delete data from the table
VACUUM REINDEX if you use Interleaved Sorts and significant amounts of data have changed

Related

Best Options to lead with tempDB with a lot of data

In order to load data from multiple data sources and a big amount of Data using SQL Server 2014.
My ETL Scripts are in T-SQL and it taking a lot of time to execute because my TempDB are full.
In your opinion, which is the best way to lead with this:
Using Commit Transactions?
Clean TempDB?
etc.
They only way to answer this question is with a very high level general response.
You have a few options:
Simply allocate more space to TempDB.
Optimize your ETL queries and tune your indexes.
Option 2 is often the better apporoach. Excessive use of TempDB indicates that inefficient sorts or joins are occurring. To resolve this, you need to analyze the actual execution plans of your ETL code. Look for the following:
Exclamation marks in your query plan. This often indicates that a join or a sort operation had to spill over to TempDB because the optimizer under estimated the amount of memory required. You might have statistics which needs to be updated.
Look for large differences in the estimated number of rows and actual number of rows. This can also indicate statistics that are out of date of parameter sniffing issues.
Look for sort operations. It is often possible to remove these by adding indexes to your tables.
Look for inefficient access methods. These can often be resolved by adding covering indexes. E.g table scan if you only need a small number of rows from a large table. Just note that table scans are often the best approach when loading data warehouses.
Hope this was helpful.
Marius

Optimum number of rows in a table for creating indexes

My understanding is that creating indexes on small tables could be more cost than benefit.
For example, there is no point creating indexes on a table with less than 100 rows (or even 1000 rows?)
Is there any specific number of rows as a threshold for creating indexes?
Update 1
The more I am investigating, the more I get conflicting information. I might be too concern about preserving IO write operations; since my SQL servers database is in HA Synchronous-commit mode.
Point #1:
This question concerns very much the IO write performance. With scenarios like SQL Server HA Synchronous-commit mode, the cost of IO write is high when database servers reside in cross subnet data centers. Adding indexes adds to the expensive IO write cost.
Point #2:
Books Online suggests:
Indexing small tables may not be optimal because it can take the query
optimizer longer to traverse the index searching for data than to
perform a simple table scan. Therefore, indexes on small tables might
never be used, but must still be maintained as data in the table
changes.
I am not sure adding index to a table with only 1 one row will ever have any benefit - or am I wrong?
Your understanding is wrong. Small tables also benefit from index specially when are used to join with bigger tables.
The cost of index has two part, storage space and process time during insert/update. First one is very cheap this days so is almost discard. So you only consideration should be when you have a table with lot of updates and inserts apply the proper configurations.

How large can a hbase table actually grow?

Would there be any reason to split a hbase table into smaller entities, or can it grow forever (assuming available disk space)?
Background:
We have realtime data (measurements), up to lets say 500,000/s, which consists essentially of timestamp, value, flags. If we distribute the values to different tables, it would also mean to insert each of the entries individually, which is a performance killer. If we insert in bulk it is much faster. The question is, are there any downsides to have a hbase table with an extreme size?
There could be a strong reason behind splitting a table, which is avoiding RegionServer hotspotting, by distributing the load across multiple RegionServers. HBase, by virtue of its nature, stores rows sequentially at one place. Rows with similar keys go to the same server(timeseries data, for example). This is to facilitate better range queries. However, this starts becoming a bottleneck once your data grows too big(and your disk still has space).
In cases like above data will continue to go to the same RegionServer, leading to hotspotting. So, we split tables manually to distribute the data uniformly across the cluster.
I don't see the point in manually splitting an HBase table, HBase does this on his own and extremely well (which called HBase table regions)
HBase has been made to handle extremely large data, so I like to believe that the limit depends on your hardware only (of course so configurations might impact performance such as automatic major compaction etc...)

Oracle - Do you need to calculate statistics after creating index or adding columns?

We use an Oracle 10.2.0.5 database in Production.
Optimizer is in "cost-based" mode.
Do we need to calculate statistics (DBMS_STATS package) after:
creating a new index
adding a column
creating a new table
?
Thanks
There's no short answer. It totally depends on your data and how you use it. Here are some things to consider:
As #NullUserException pointed out, statistics are automatically gathered, usually every night. That's usually good enough; in most (OLTP) environments, if you just added new objects they won't contain a lot of data before the stats are automatically gathered. The plans won't be that bad, and if the objects are new they probably won't be used much right away.
creating a new index - No. "Oracle Database now automatically collects statistics during index creation and rebuild".
adding a column - Maybe. If the column will be used in joins and predicates you probably want stats on it. If it's just used for storing and displaying data it won't really affect any plans. But, if the new column takes up a lot of space it may significantly alter the average row length, number of blocks, row chaining, etc., and the optimizer should know about that.
creating a new table - Probably. Oracle is able to compensate for missing statistics through dynamic sampling, although this often isn't good enough. Especially if the new table has a lot of data; bad statistics almost always lead to under-estimating the cardinality, which will lead to nested loops when you want hash joins. Also, even if the table data hasn't changed, you may need to gather statistics one more time to enable histograms. By default, Oracle creates histograms for skewed data, but will not enable those histograms if those columns haven't been used as a predicate. (So this applies to adding a new column as well). If you drop and re-create a table, even with the same name, Oracle will not maintain any of that column use data, and will not know that you need histograms on certain columns.
Gathering optimizer statistics is much more difficult than most people realize. At my current job, most of our performance problems are ultimately because of bad statistics. If you're trying to come up with a plan for your system you ought to read the Managing Optimizer Statistics chapter.
Update:
There's no need to gather statistics for empty objects; dynamic sampling will work just as quickly as reading stats from the data dictionary. (Based on a quick test hard parsing a large number of queries with and without stats.) If you disable dynamic sampling then there may be some weird cases where the Oracle default values lead to inaccurate plans, and you would be better off with statistics on an empty table.
I think the reason Oracle automatically gathers stats for indexes at creation time is because it doesn't cost much extra. When you create an index you have to read all the blocks in the table, so Oracle might as well calculate the number of levels, blocks, keys, etc., at the same time.
Table statistics can be more complicated, and may require multiple passes of the data. Creating an index is relatively simple compared to the arbitrary SQL that may be used as part of a create-table-as-select. It may not be possible, or efficient, to take those arbitrary SQL statements and transform them into a query that also returns the information needed to gather statistics.
Of course it wouldn't cost anything extra to gather stats for an empty table. But it doesn't gain you anything either, and it would just be misleading to anyone who looks at the USER_TABLES.LAST_ANALYZED - the table appear to be analyzed, but not with any meaningful data.

SQL Server 2005 - Rowsize effect on query performance?

Im trying to squeeze some extra performance from searching through a table with many rows.
My current reasoning is that if I can throw away some of the seldom used member from the searched table thereby reducing rowsize the amount of pagesplits and hence IO should drop giving a benefit when data start to spill from memory.
Any good resource detailing such effects?
Any experiences?
Thanks.
Tuning the size of a row is only a major issue if the RDBMS is performing a full table scan of the row, if your query can select the rows using only indexes then the row size is less important (unless you are returning a very large number of rows where the IO of returning the actual result is significant).
If you are doing a full table scan or partial scans of large numbers of rows because you have predicates that are not using indexes then rowsize can be a major factor. One example I remember, On a table of the order of 100,000,000 rows splitting the largish 'data' columns into a different table from the columns used for querying resulted in an order of magnitude performance improvement on some queries.
I would only expect this to be a major factor in a relatively small number of situations.
I don't now what else you tried to increase performance, this seems like grasping at straws to me. That doesn't mean that it isn't a valid approach. From my experience the benefit can be significant. It's just that it's usually dwarfed by other kinds of optimization.
However, what you are looking for are iostatistics. There are several methods to gather them. A quite good introduction can be found ->here.
The sql server query plan optimizer is a very complex algorithm and decision what index to use or what type of scan depends on many factors like query output columns, indexes available, statistics available, statistic distribution of you data values in the columns, row count, and row size.
So the only valid answer to your question is: It depends :)
Give some more information like what kind of optimization you have already done, what does the query plan looks like, etc.
Of cause, when sql server decides to do a table scna (clustered index scan if available), you can reduce io-performance by downsize row size. But in that case you would increase performance dramatically by creating a adequate index (which is a defacto a separate table with smaller row size).
If the application is transactional then look at the indexes in use on the table. Table partitioning is unlikely to be much help in this situation.
If you have something like a data warehouse and are doing aggregate queries over a lot of data then you might get some mileage from partitioning.
If you are doing a join between two large tables that are not in a 1:M relationship the query optimiser may have to resolve the predicates on each table separately and then combine relatively large intermediate result sets or run a slow operator like nested loops matching one side of the join. In this case you may get a benefit from a trigger-maintained denormalised table to do the searches. I've seen good results obtained from denormalised search tables for complex screens on a couple of large applications.
If you're interested in minimizing IO in reading data you need to check if indexes are covering the query or not. To minimize IO you should select column that are included in the index or indexes that cover all columns used in the query, this way the optimizer will read data from indexes and will never read data from actual table rows.
If you're looking into this kind of details maybe you should consider upgrading HW, changing controllers or adding more disk to have more disk spindle available for the query processor and so allowing SQL to read more data at the same time
SQL Server disk I/O is frequently the cause of bottlenecks in most systems. The I/O subsystem includes disks, disk controller cards, and the system bus. If disk I/O is consistently high, consider:
Move some database files to an additional disk or server.
Use a faster disk drive or a redundant array of inexpensive disks (RAID) device.
Add additional disks to a RAID array, if one already is being used.
Tune your application or database to reduce disk access operations.
Consider index coverage, better indexes, and/or normalization.
Microsoft SQL Server uses Microsoft Windows I/O calls to perform disk reads and writes. SQL Server manages when and how disk I/O is performed, but the Windows operating system performs the underlying I/O operations. Applications and systems that are I/O-bound may keep the disk constantly active.
Different disk controllers and drivers use different amounts of CPU time to perform disk I/O. Efficient controllers and drivers use less time, leaving more processing time available for user applications and increasing overall throughput.
First thing I would do is ensure that your indexes have been rebuilt; if you are dealing with huge amount of data and an index rebuild is not possible (if SQL server 2005 onwards you can perform online rebuilds without locking everyone out), then ensure that your statistics are up to date (more on this later).
If your database contains representative data, then you can perform a simple measurement of the number of reads (logical and physical) that your query is using by doing the following:
SET STATISTICS IO ON
GO
-- Execute your query here
SET STATISTICS IO OFF
GO
On a well setup database server, there should be little or no physical reads (high physical reads often indicates that your server needs more RAM). How many logical reads are you doing? If this number is high, then you will need to look at creating indexes. The next step is to run the query and turn on the estimated execution plan, then rerun (clearing the cache first) displaying the actual execution plan. If these differ, then your statistics are out of date.
I think you're going to be farther ahead using standard optimization techniques first -- check your execution plan, profiler trace, etc. and see whether you need to adjust your indexes, create statistics etc. -- before looking at the physical structure of your table.

Resources