I work on databases used for Analysis workloads so I usually use a stored procedure to output final datasets into SQL Server tables that we can connect to from Tableau or SAS, etc.
We process multiple batches of data through the same system so the output dataset tables all contain a BATCH_ID column which users use to filter on the specific batch they want to analyze.
Each time a dataset is published, I delete any old data for that batch in the output table before inserting a fresh set of rows for that batch of data. For this type of workload what do you think the best indexes would be?
I'm currently using a clustered index on the BATCH_ID column because I figured that would group all the rows together resulting in efficient filtering and deletion/insertion. Will this result in a lot of index or table fragmentation over time? Keep in mind that the entire batch is deleted and re-inserted each time so there's no issue with partial updates or additions to existing batches.
Would I be better off with a typical clustered index on an identity column and a non-clustered index on batch_ID?
Related
I have a large table consisting of 4 Billion+ rows and 50 columns, most of which are either datetime or numeric except a few which are varchar.
Data will be inserted into the table on a weekly basis (about 20 million rows).
I expect queries with where clauses on some of the datetime columns, and a couple of the the varchar columns. There is no primary key in the table.
There are no indexes, nor the table is partitioned. I am using SQL Server 2016.
I understand that I need to partition or index the table, but I am not sure which approach to take or both in-fact.
Since the table is large, should I create the indexes first or should I create the partitions first? If I do create the indexes and then create the partitions, what should I do to maintain these with new data coming in weekly.
EDIT: Also, minimal updates and deletes are expected on the table
I understand that I need to partition or index the table
You need to understand what you gain from partitioning. It is not at all the case that SQL Server requires partitioning on big tables to function adequately. SQL Server scales to arbitrary tables sizes without any inherent issues.
Common benefits of partitioning are:
Mass deletion in constant time
Different storage for older partitions
Not backing up old partitions
Sometimes in special situations (e.g. columnstore), partitioning can help as a strategy to speed up queries. Normally, indexing is better for that.
Essentially, partitioning splits the table physically into multiple sub tables. Most often this has a negative effect on query plans. Indexes are perfectly capable of restricting the set of data that needs to be touched. Partitions are worse for that.
Most of the queries will be filtering on the datetime columns and on some of the varchar columns. Like, get data for a certain daterange for a certain entity. With the indexes, it will be fragmented a lot because of new inserts and rebuilding/reorganising the indexes will also consume a lot of time. I can do it but again not sure which approach.
It seems you can best solve this by indexing:
Index according to the queries you expect.
Maintain the indexes properly. This is not too hard. For example, rebuild them after the weekly load.
Since the table is large, should I create the indexes first or should I create the partitions first?
Set up that partitioning objects first. Then, create or rebuild the clustered index on the new partitioning scheme. If possible drop other indexes first and recreate them afterwards (might not work due to availability restrictions).
what should I do to maintain these with new data coming in weekly.
What concerns do you have? New data will be stored in the appropriate partitions automatically. Make sure to create new partitions before loading the data. Keep partitions ready for 2 weeks in advance. The latest partitions must always be empty to avoid costly splits.
There is no primary key in the table.
Most often this is a not a good design. Most tables should have a primary key and a clustered index. If there is no natural key use an artifical one such as a bigint identity.
You definitely can apply partitioning but my feeling is that it will not gain you what you maybe expect. But it will force you to take on additional maintenance burdens, possibly reduce performance and there is risk of making mistakes that threaten availability. Simplicity is important.
My question is about performance on SQL server tables.
Assume I have a table that has many columns, for example 30 columns, with 1 column indexed. This table has approximately 30,000 rows.
If I perform a select that selects the indexed column, and one more, for example this:
SELECT IndexedColumn, column1
FROM table
Will this be slower than performing the same select on a table that only has 2 columns, and doing a SELECT * ...
So basically, will the existence of the extra columns slow down the select query event if I am not retrieving the data from the extra columns?
There will be minor difference on the very end of the process as you don't have to print/pass the rest of information for the end client (either SSMS or other app).
When performing a read based on clustered index all of the column (without BLOB) are saved on the same page set so to read the data you have to access the same set of pages anyway.
You would see a performance increase if you would have a nonclustered index on the column list you are after as then they are saved in their own structure of data pages (so it would be less to read).
Assuming that you are using the default Clustered Index created by SQL server when defining the primary key on the table in both scenarios then no, there shouldn't be any performance difference between these two scenarios. Maybe worth just checking it out and generating an Actual Execution plan to see for yourself? -- Actually not sure above is true, as given this is rowstore, the first table wont be able to fit as many rows onto each page so will suffer more of an IO/Disk overhead when reading data.
I have a table in a production server with 350 million rows and aproximatelly 25GB size. It has a single clustered identity index.
The queries targeting this table require some missing indexes for better perfomance.
I need to delete unnecessary data (aprox 200 million rows) and then create two non-clustered indexes.
However, I have some concerns:
I need to avoid increasing the log too much
Keep the database downtime as low as possible.
Keep the identity (primary key) the same in the remaining data.
I would like to hear you opinion for the best solution to adopt.
The following is a guideline on how you might do this:
Suspend insert/update operations or start logging them explicitly (this might result in degraded performance).
Select the records to keep into a new table.
Then you have two options. If this is the only table in your universe:
Build the indexes on the new table.
Stop the system.
Rename the existing table to something else.
Rename the new table to the real table name
Turn the system back on.
If there are other tables (such as foreign key relationships):
Truncate the existing table
Insert the data into the existing table
Build the secondary indexes
Turn the system back on
Depending on your user requirements, one of the above variations is likely to work for your problem.
Note that there are other more internally intensive techniques. For instance, create a replicated database and once that is working, you have two systems and can do the clean-up work on one at a time (a method such as this would be the preferred method for a system with near 100% uptime requirements). Or create a separate table that is just right and swap the table spaces.
I have a historical data table like (Date,ItemId,Price). Normally around 60,000 records will be inserted into the table. Now, the table record amount is around 3 millions. And our query is something like select 2000 products in 3 months which is very slow in present. I already make some indexes for it , but I still want more better performance.
For this situation, how can I do can make the query faster? Table partitioning or Caching ?
Thanks
Please specify the the version of SQL Server you are using? Partitioning only works with Enterprise edition.
To improve performance, you may make use of temporary tables, i.e. create temporary table on a subset (rows and columns) of data which you require. Temporary table would be smaller than original table, further they can be indexed also if required. This subset of data stored in temp tables can also be cached thereby increasing performance.
I have a table on a MS Azure SQL DB with 60,000 rows that is starting to take longer to execute with a SELECT statement. The first column is the "ID" column which is the primary key. As of right now, there is no other indexes. The thing about this table is the rows are based on recent news articles, therefore the last rows in the table are always going to be accessed more than the older rows.
If possible, how can I tell SQL Server to start querying at the end of the table working backwards when I do a SELECT operation?
Also, what can I do with indexes to make reading from the table faster with the last rows as the priority?
Typically, the SQL Server query optimizer will choose the data access strategy based on the available indexes, data distribution statistics & query. For example, SQL Server can scan an index forward, backward, physical order & so on. The choice is determined based on many variables.
In your example, if there is a date/time column in the table then you can index and use that in your predicate(s). This will automatically enable use of that index if that is the most selective one.
Alternatively, you can partition the table based on a column and access most recent data based on the partitioning key. This is common use of partitioning with a rolling window. With this approach, the predicate in your queries will specify the partitioning column which will help the optimizer pick the correct set of partitions to scan. This will dramatically reduce the amount of data that needs to be searched since partition elimination happens before execution depending on the query plan.