I am working on a badly designed database in MS SQL 2008. It has a table with "60,000,000" records and it increases by about "4,000,000" records per week.
So if I use "SQL 2008 table partitioning", will it help? If so, please suggest the steps to follow.
Couple of points you need to check before arriving at Table Partitioning Strategy
Total Records "60,000,000" records and it increases by about "4,000,000" records per week
Question - Can you Archive Remove any Data on Weekly basis. Reason is to know if you use queries for OLTP or OLAP you can think of having another machine where data is replicated and retained
Do you know statics of how queries look like, Indexes that are being used
Objective of partitioning is to reduce the number of partitioning to look for required data. Example - Suppose your query where clause filter is based on month number. Then in that case if you have parition done every month, then when you have a query for month = 5, then only that particular partition would be queried
You have to provide little more information on schema, query usage patterns, indexes
Help with what? You did not explain the problem. Are certain queries slow? Do you have problems with ETL? Do you have problems deletion of obsolete data? Do you have problems with maintenance and backups?
As a general rule table partitioning is a feature for easing ETL via fast partition switch operations and for data location administration (distribute tables across several filegroups and leverage filegroup features like piece-meal restore).
One area partitioning does not help is performance. Performance issues are addressed with indexes and the best you can hope for is on-par performance with a non-partitioned case. There is a misconception going around that partition elimination will help performance, see also Introduction to Partitioned Tables. What is usually missed is that for performance a much much better alternative is to simply move the partitioning key as the leftmost key in the clustered key, which will cause a range scan with, in the worse case, at least on-par, if not better, performance than compared with the partition elimination. Partition elimination helps queries that have to handle partitioned tables when partitioning was required due to the other reasons (ETL, fast deletes, filegroup management).
Related
I have tables have millions of partitions.
Should I reduce partition count for performance?
As my experience of spark application or hive query system, too many partition was bad for performance.
If you do not have auto clustering on the table, it will not be auto defragmented. So if you write to the table frequently with small row counts, it will be in very bad shape.
Partition count impacts compile time badly, as every partition has metadata that is load to plan/optimize the query. I would suggest doing a rebuild test (select into a new transient table) and run some comparable queries to see the different in compile time.
We have a number of table that sorting (thus auto clustering) does not make sense for as the use pattern is always full-table scan, thus we just rebuild those tables on schedules to keep the partition count down, and for us, that rebuild cost is worth the performance gain.
As with everything Snowflake you should run a test, and see how it is for you. And monitor hot spots as they can and do change.
In Snowflake, there are micro-partitions, and they are managed automatically. Therefore you do not need to worry about the number of micro-partition.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#what-are-micro-partitions
It says:
Micro-partitioning is automatically performed on all Snowflake tables.
Tables are transparently partitioned using the ordering of the data as
it is inserted/loaded.
From this page, I understand that micro-partitions are managed by Snowflake, and you do not need to focus on reducing the partition count (this is the original question).
This should also help to understand the difference between clustering and micro-partitions:
https://docs.snowflake.com/en/user-guide/table-considerations.html#when-to-set-a-clustering-key
If you read the above link, you can see that it is not a must to define clustering on even large tables to get a good query performance!
As the original question about reducing the partition count, I also have to say that clustering does not always reduce the number of partitions, but it is another story.
In the Snowflake documentation, I could not find a reference to using Indexes.
Does Snowflake support Indexes and, if not, what is the alternative approach to performance tuning when using Snowflake?
Snowflake does not use indexes. This is one of the things that makes Snowflake scale so well for arbitrary queries. Instead, Snowflake calculates statistics about columns and records in files that you load, and uses those statistics to figure out what parts of what tables/records to actually load to execute a query. It also uses a columnar store file format, that lets it only read the parts of the table that contain the fields (columns) you actually use, and thus cut down on I/O on columns that you don't use in the query.
Snowflake slices big tables (gigabyte, terabyte or larger) into smaller "micro partitions." For each micro partition, it collects statistics about what value ranges each column contains. Then, it only loads micro partitions that contain values in the range needed by your query. As an example, let's say you have a column of time stamps. If your query asks for data between June 1 and July 1, then partitions that do not contain any data in this range, will not be loaded or processed, based on the statistics stored for dates in the micropartition files.
Indexes are often used for online transaction processing, because they accelerate workflows when you work with one or a few records, but when you run analytics queries on large datasets, you almost always work with large subsets of each table in your joins and aggregates. The storage mechanism, with automatic statistics, automatically accelerates such large queries, with no need for you to specify an index, or tune any kind of parameters.
Snowflake does not support indexes, though it does support "clustering" for performance improvements of I/O.
I recommend reading these links to get familiar with this:
https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html
https://docs.snowflake.net/manuals/user-guide/tables-auto-reclustering.html
Here's a really good blog post on the topic as well:
https://www.snowflake.com/blog/automatic-query-optimization-no-tuning/
Hope this helps...Rich
No Snowflake does not have indexes. Its performance boosts come through by eliminating unnecessary scanning which it achieves my maintaining rich metadata in each of its micro partitions. For instance if you have a time filter in your query and your table is more or less sorted by time, then Snowflake can "prune" away the parts of the table that are not relevant to the query.
Having said this, Snowflake is constantly releasing new features and one such feature is its Search Optimisation Service which allows you to perform "needle in a hay stack" queries on selected columns that you enable. Not quite indexes that you can create, but something like that being used behind the scenes perhaps.
No, Snowflake doesn't support indexes. And don't let them tell you that this is an advantage.
Performance tuning can be done as described above, but is often is done with money: Pay for bigger warehouses.
Snowflake doesn't support indexes, it keeps data in micro partition or in another sense it breaks data sets in small files and format rows to column and compress them. Snowflake metadata manager in service layer will have all the information about each micro partition like which partition have which data.
Each partition will have information about itself in header like max value, min value, cardinality etc. this is much better then indexes as compare to conventional databases.
Snowflake is a columnar database with automatic micro-partitioning. Note that in SQL Server, Microsoft call their columnar storage option a column store index.
The performance gain from columnar storage on data warehouse/mart type queries is spectacular compared with their row store brethren. By storing data by column the columns can be greatly compressed allowing a huge amount of data can be held in memory.
If your predominant queries are on a naturally ordered column, such as OrderDate then it makes sense to cluster on OrderDate. You will gain a performance benefit from doing that.
Clustering isn't a catch-all performance boost. Choose your clustering unwisely and you can degrade performance for your queries.
In terms of performance tuning there are techniques you can use.
When using a dimensional model look at the most commonly used aspects of those dimensions and look to denormalise those aspects into your fact tables to reduce the number of joins.
For example, if the queries use Week, Month and Quarter then denormalise those aspects into the fact table giving you performance concerns. The affect on storage in a column store DB is far less than in a row store DB so the cost/benefit balance is much better.
Materialised views are another way of performance tuning however these come with caveats.
The range of SQL statements available to you for materialised views is far less than for other views
Not all aggregates are supported
Can only be on a single table
They work well when data doesn't change often.
If your underlying table is clustered on OrderDate then a materialised view of last months orders might not give you the desired performance benefit because partition pruning might already be doing what is needed.
If your query performance is as a result of contention with other users then spinning up another warehouse might be the answer. 2 warehouses dedicated to their tasks might be more cost effective than scaling up a single warehouse.
Primary/unique key constraints can be defined but are metadata only despite the constraint documentation describing the enforced/not enforced syntax.
Some distributed column stores do support PK and FK constraints, Vertica being an example, but most do not because the performance impact of enforcing them is too high.
** Updated Fall 2022 - thanks to Hobo's comment: Yes, via Unistore's Hybrid Tables. **
Original Response:
Neither Snowflake nor any high-performance big data / OLAP system will support [unique] indexes because these systems are MPP (Massively Parallel Processing). MPP systems load data with thousands of concurrent inserts into the same table. [Unique] Indexes are a concept from much smaller / OLTP systems. Even then many data engineers intentionally disable the [unique] indexes on OLTP systems when they approach big data scale especially as the data is inserted or frequently updated and deleted.
If you want a "non-unique index" then you can use a slew of features such as: micro-partitions, clustered tables, auto-clustering, Search Optimization Service, etc.
This Medium can give you some workarounds. How can we enforce [Unique, Primary Key, Foreign Key (UPF)] column constraints in Snowflake?
Snowflake does not support indexing natively, but it has other ways to tune performance:
Reduce queuing by setting a time-out and/or adjusting the max concurrency
Use result caching
Tackle disk spilling
Rectify row expansion by using the distinct clause, using temporary tables and checking your join order
Fix inadequate pruning by setting up data clustering
Reference: https://rockset.com/blog/what-do-i-do-when-my-snowflake-query-is-slow-part-2-solutions/ (Disclosure: I work for Rockset).
In short,
snowflake does not support indexes but a single clustering key on a each table.
Snowflake does not support indexes but if you are looking for optimization you can use search optimization service of Snowflake.
Please refer below snowflake documentation.
https://docs.snowflake.com/en/user-guide/search-optimization-service.html
Snowflake's Search Optimization Service will create indexes over all the pertinent columns in a table "out of the box" as well as other advances search features (e.g. substring and regex matching).
If you'd like optimize for specific expressions used in your queries, you can customize SOS, as well.
How can you determine if the performance gained on a SELECT by indexing a column will outweigh the performance loss on an INSERT in the same table? Is there a "tipping-point" in the size of the table when the index does more harm than good?
I have table in SQL Server 2008 with 2-3 million rows at any given time. Every time an insert is done on the table, a lookup is also done on the same table using two of its columns. I'm trying to determine if it would be beneficial to add indexes to the two columns used in the lookup.
Like everything else SQL-related, it depends:
What kind of fields are they? Varchar? Int? Datetime?
Are there other indexes on the table?
Will you need to include additional fields?
What's the clustered index?
How many rows are inserted/deleted in a transaction?
The only real way to know is to benchmark it. Put the index(es) in place and do frequent monitoring, or run a trace.
This depends on your workload and your requirements. Sometimes data is loaded once and read millions of times, but sometimes not all loaded data is ever read.
Sometimes reads or writes must complete in certain time.
case 1: If table is static and is queried heavily (eg: item table in Shopping Cart application) then indexes on the appropriate fields is highly beneficial.
case 2: If table is highly dynamic and not a lot of querying is done on a daily basis (eg: log tables used for auditing purposes) then indexes will slow down the writes.
If above two cases are the boundary cases, then to build indexes or not to build indexes on a table depends on which case above does the table in contention comes closest to.
If not leave it to the judgement of Query tuning advisor. Good luck.
I have an app, which cycles through a huge number of records in a database table and performs a number of SQL and .Net operations on records within that database (currently I am using Castle.ActiveRecord on PostgreSQL).
I added some basic btree indexes on a couple of the feilds, and as you would expect, the performance of the SQL operations increased substantially. Wanting to make the most of dbms performance I want to make some better educated choices about what I should index on all my projects.
I understand that there is a detrement to performance when doing inserts (as the database needs to update the index, as well as the data), but what suggestions and best practices should I consider with creating database indexes? How do I best select the feilds/combination of fields for a set of database indexes (rules of thumb)?
Also, how do I best select which index to use as a clustered index? And when it comes to the access method, under what conditions should I use a btree over a hash or a gist or a gin (what are they anyway?).
Some of my rules of thumb:
Index ALL primary keys (I think most RDBMS do this when the table is created).
Index ALL foreign key columns.
Create more indexes ONLY if:
Queries are slow.
You know the data volume is going to increase significantly.
Run statistics when populating a lot of data in tables.
If a query is slow, look at the execution plan and:
If the query for a table only uses a few columns, put all those columns into an index, then you can help the RDBMS to only use the index.
Don't waste resources indexing tiny tables (hundreds of records).
Index multiple columns in order from high cardinality to less. This means: first index the columns with more distinct values, followed by columns with fewer distinct values.
If a query needs to access more than 10% of the data, a full scan is normally better than an index.
Here's a slightly simplistic overview: it's certainly true that there is an overhead to data modifications due to the presence of indexes, but you ought to consider the relative number of reads and writes to the data. In general the number of reads is far higher than the number of writes, and you should take that into account when defining an indexing strategy.
When it comes to which columns to index I'v e always felt that the designer ought to know the business well enough to be able to take a very good first pass at which columns are likely to benefit. Other then that it really comes down to feedback from the programmers, full-scale testing, and system monitoring (preferably with extensive internal metrics on performance to capture long-running operations),
As #David Aldridge mentioned, the majority of databases perform many more reads than they do writes and in addition, appropriate indexes will often be utilised even when performing INSERTS (to determine the correct place to INSERT).
The critical indexes under an unknown production workload are often hard to guess/estimate, and a set of indexes should not be viewed as set once and forget. Indexes should be monitored and altered with changing workloads (that new killer report, for instance).
Nothing beats profiling; if you guess your indexes, you will often miss the really important ones.
As a general rule, if I have little idea how the database will be queried, then I will create indexes on all Foriegn Keys, profile under a workload (think UAT release) and remove those that are not being used, as well as creating important missing indexes.
Also, make sure that a scheduled index maintenance plan is also created.
We have a very large table (> 77M records and growing) runing on SQL Server 2005 64bit Standard edition and we are seeing some performance issues. There are up to a hundred thousand records added daily.
Does anyone know if there is a limit to the number of records SQL server Standard edition can handle? Should be be considering moving to Enterprise edition or are there some tricks we can use?
Additional info:
The table in question is pretty flat (14 columns), there is a clustered index with 6 fields, and two other indexes on single fields.
We added a fourth index using 3 fields that were in a select in one problem query and did not see any difference in the estimated performance (the query is part of a process that has to run in the off hours so we don't have metrics yet). These fields are part of the clustered index.
Agreeing with Marc and Unkown above ... 6 indexes in the clustered index is way too many, especially on a table that has only 14 columns. You shouldn't have more than 3 or 4, if that, I would say 1 or maybe 2. You may know that the clustered index is the actual table on the disk so when a record is inserted, the database engine must sort it and place it in it's sorted organized place on the disk. Non clustered indexes are not, they are supporting lookup 'tables'. My VLDBs are laid out on the disk (CLUSTERED INDEX) according to the 1st point below.
Reduce your clustered index to 1 or 2. The best field choices are the IDENTITY (INT), if you have one, or a date field in which the fields are being added to the database, or some other field that is a natural sort of how your data is being added to the database. The point is you are trying to keep that data at the bottom of the table ... or have it laid out on the disk in the best (90%+) way that you'll read the records out. This makes it so that there is no reorganzing going on or that it's taking one and only one hit to get the data in the right place for the best read. Be sure to put the removed fields into non-clustered indexes so you don't lose the lookup efficacy. I have NEVER put more than 4 fields on my VLDBs. If you have fields that are being update frequently and they are included in your clustered index, OUCH, that's going to reorganize the record on the disk and cause COSTLY fragmentation.
Check the fillfactor on your indexes. The larger the fill factor number (100) the more full the data pages and index pages will be. In relation to how many records you have and how many records your are inserting you will change the fillfactor # (+ or -) of your non-clustered indexes to allow for the fill space when a record is inserted. If you change your clustered index to a sequential data field, then this won't matter as much on a clustered index. Rule of thumb (IMO), 60-70 fillfactor for high writes, 70-90 for medium writes, and 90-100 for high reads/low writes. By dropping your fillfactor to 70, will mean that for every 100 records on a page, 70 records are written, which will leave free space of 30 records for new or reorganized records. Eats up more space, but it sure beats having to DEFRAG every night (see 4 below)
Make sure the statistics exist on the table. If you want to sweep the database to create statistics using the "sp_createstats 'indexonly'", then SQL Server will create all the statistics on all the indexes that the engine has accumulated as requiring statistics. Don't leave off the 'indexonly' attribute though or you'll add statistics for every field, that would then not be good.
Check the table/indexes using DBCC SHOWCONTIG to see which indexes are getting fragmented the most. I won't go into the details here, just know that you need to do it. Then based on that information, change the fillfactor up or down in relation to the changes the indexes are experiencing change and how fast (over time).
Setup a job schedule that will do online (DBCC INDEXDEFRAG) or offline (DBCC DBREINDEX) on individual indexes to defrag them. Warning: don't do DBCC DBREINDEX on this large of a table without it being during maintenance time cause it will bring the apps down ... especially on the CLUSTERED INDEX. You've been warned. Test and test this part.
Use the execution plans to see what SCANS, and FAT PIPES exist and adjust the indexes, then defrag and rewrite stored procs to get rid of those hot spots. If you see a RED object in your execution plan, it's because there are not statistics on that field. That's bad. This step is more of the "art than the science".
On off peak times, run the UPDATE STATISTICS WITH FULLSCAN to give the query engine as much information about the data distributions as you can. Otherwise do the standard UPDATE STATISTICS (with standard 10% scan) on tables during the weeknights or more often as you see fit with your observerations to make sure the engine has more information about the data distributions to retrieve the data for efficiently.
Sorry this is so long, but it's extremely important. I've only give you here minimal information but will help a ton. There's some gut feelings and observations that go in to strategies used by these points that will require your time and testing.
No need to go to Enterprise edition. I did though in order to get the features spoken of earlier with partitioning. But I did ESPECIALLY to have much better mult-threading capabilities with searching and online DEFRAGING and maintenance ... In Enterprise edition, it is much much better and more friendly with VLDBs. Standard edition doesn't handle doing DBCC INDEXDEFRAG with online databases as well.
The first thing I'd look at is indexing. If you use the execution plan generator in Management Studio, you want to see index seeks or clustered index seeks. If you see scans, particularly table scans, you should look at indexing the columns you generally search on to see if that improves your performance.
You should certainly not need to move to Enterprise edition for this.
[there is a clustered index with 6 fields, and two other indexes on single fields.]
Without knowing any details about the fields, I would try to find a way to make the clustered index smaller.
With SQL Server, all the clustered-key fields will also be included in all the non-clustered indices (as a way to do the final lookup from non-clustered index to actual data page).
If you have six fields at 8 bytes each = 48 bytes, multiply that by two more indices times 77 million rows - and you're looking at a lot of wasted space which translates into a lot
of I/O operations (and thus degrades performance).
For the clustered index, it's absolutely CRUCIAL for it to be unique, stable, and as small as possible (preferably a single INT or such).
Marc
Do you really need to have access to all 77 million records in a single table?
For example, if you only need access to the last X months worth of data, then you could consider creating an archiving strategy. This could be used to relocate data to an archive table in order to reduce the volume of data and subsequently, query time on your 'hot' table.
This approach could be implemented in the standard edition.
If you do upgrade to the Enterprise edition you can make use of table partitioning. Again depending on your data structure this can offer significant performance improvements. Partitioning can also be used to implement the strategy previously mentioned but with less administrative overhead.
Here is an excellent White paper on table partitioning in SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope what I have detailed is clear and understandable. Please do feel to contact me directly if you require further assistance.
Cheers,
http://msdn.microsoft.com/en-us/library/ms143432.aspx
You've got some room to grow.
As far as performance issues, that's a whole other question. Caching, sharding, normalizing, indexing, query tuning, app code tuning, and so on.
Standard should be able to handle it. I would look at indexing and the queries you use with the table. You want to structure things in such a way that your inserts don't cause too many index recalcs, but your queries can still take advantage of the index to limit lookups to a small portion of the table.
Beyond that, you might consider partitioning the table. This will allow you to divide the table into several logical groups. You can do it "behind-the-scenes", so it still appears in sql server as one table even though it stored separately, or you can do it manually (create a new 'archive' or yearly table and manually move over rows). Either way, only do it after you looked at the other options first, because if you don't get that right you'll still end up having to check every partition. Also: partitioning does require Enterprise Edition, so that's another reason to save this for a last resort.
In and of itself, 77M records is not a lot for SQL Server. How are you loading the 100,000 records? is that a batch load each day? or thru some sort of OLTP application? and is that the performance issue you are having, i.e adding the data? or is it the querying that giving you the most problems?
If you are adding 100K records at a time, and the records being added are forcing the cluster-index to re-org your table, that will kill your performance quickly. More details on the table structure, indexes and type of data inserted will help.
Also, the amount of ram and the speed of your disks will make a big difference, what are you running on?
maybe these are minor nits, but....
(1) relational databases don't have FIELDS... they have COLUMNS.
(2) IDENTITY columns usually mean the data isn't normalized (or the designer was lazy). Some combination of columns MUST be unique (and those columns make up the primary key)
(3) indexing on datetime columns is usually a bad idea; CLUSTERING on datetime columns is also usually a bad idea, especially an ever-increasing datetime column, as all the inserts are contending for the same physical space on disk. Clustering on datetime columns in a read-only table where that column is part of range restrictions is often a good idea (see how the ideas conflict? who said db design wasn't an art?!)
What type of disks do you have?
You might monitor some disk counters to see if requests are queuing.
You might move this table to another drive by putting it in another filegroup. You can also to the same with the indexes.
Initially I wanted to agree with Marc. The width of your clustered index seems suspect, as it will essentially be used as the key to perform lookups on all your records. The wider the clustered index, the slower the access, generally. And a six field clustered index feels really, really suspect.
Uniqueness is not required for a clustered index. In fact, the best candidates for fields that should be in the clustered index are ones that are not unique and used in joins. For example, in a Persons table where each Person belongs to one Group and you frequently join Persons to Groups, while accessing batches of people by group, Person.group_id would be an ideal candidate, for this particular use case.