Azure Database Large Table Group By Performance - sql-server

I'm looking for design and/or index recommendations for the problem listed below.
I have a couple of denormalized tables in an Azure S1 Standard (20 DTU) database. One of those tables has ~20 columns and a million rows. My application requirements need me to support sub-second (or at least close to it) querying of this table by any combination of columns in my WHERE clause, as well as sub-second (or at least close to it) querying of DISTINCT values in each column.
In order to picture the use case behind this, here is an example. Imagine you were using an HR application that allowed you to search for employees and view employee information. The employee table might have 5 columns and millions of rows. The application allows you to filter by any column, and provides an interface to allow this. Therefore, the underlying SQL queries that must be made are:
A GROUP BY (or DISTINCT) query for each column, which provides the interface with the available filter options
A general employee search query, that filters all rows by any combination of filters
In order to solve performance issues on the first set of queries, I've implemented the following:
Index columns with a large variety of values
Full-Text index columns that require string matching (So CONTAINS querying instead of LIKE)
Do not index columns with a small variety of values
In order to solve the performance issues on the second query, I've implemented the following:
Forcing the front end to use pagination, implemented using SELECT * FROM table OFFSET 0 ROWS FETCH NEXT n ROWS ONLY, and ensuring the order by column is indexed
Locally, this seemed to work fine. Unfortunately, and Azure Standard database doesn't have the same performance as my local machine, and I'm seeing issues. Specifically, the columns I am not indexing (the ones with a very small set of distinct values) are taking 30+ seconds to query for. Additionally, while the paging is initially very quick, the query takes longer and longer the higher and higher I increase the offset.
So I have two targeted questions, but any other advice or design suggestions would be most welcome:
How bad is it to index every column in the table? Know that the table does need to be updated, but the columns that I update won't actually be part of any filters or WHERE clauses. Will the indexes still need to be rebuilt on update? You can also safely assume that the table will not see any inserts/deletes, except for once a month where the entire table is truncated and rebuilt from scratch
In regards to the paging getting slower and slower the deeper I get, I've read this is expected, but the performance becomes unacceptable at a certain point. Outside of making my clustered column the sort by column, are there any other suggestions to get this working?
Thanks,
-Tim

Related

SQL Server - Inserting new data worsens query performance

We have a 4-5TB SQL Server database. The largest table is around 800 GB big containing 100 million rows. 4-5 other comparable tables are 1/3-2/3 of this size. We went through a process to create new indexes to optimize performance. While the performance certainly improved we saw that the newly inserted data was slowest to query.
It's a financial reporting application with a BI tool working on top of the database. The data is loaded overnight continuing in the late morning, though the majority of the data is loaded by 7am. Users start to query data around 8am through the BI tool and are most concerned with the latest (daily) data.
I wanted to know if newly inserted data causes indexes to go out of order. Is there anything we can do so that we get better performance on the newly inserted data than the old data. I hope I have explained the issue well here. Let me know in case of any missing information. Thanks
Edit 1
Let me describe the architecture a bit.
I have a base table (let’s call it Base) with Date,id as clustered index.
It has around 50 columns
Then we have 5 derived tables (Derived1, Derived2,...) , according to different metric types, which also have Date,Id as clustered index and foreign key constraint on the Base table.
Tables Derived1 and Derived2 have 350+ columns. Derived3,4,5 have around 100-200 columns. There is one large view created to join all the data tables due limitations of the BI tool. The date,ID are the joining columns for all the tables joining to form the view (Hence I created clustered index on those columns). The main concern is with regard to BI tool performance. The BI tool always uses the view and generally sends similar queries to the server.
There are other indexes as well on other filtering columns.
The main question remains - how to prevent performance from deteriorating.
In addition I would like to know
If NCI on Date,ID on all tables would be better bet in addition to the clustered index on date,ID.
Does it make sense to have 150 columns as included in NCI for the derived tables?
You have about a 100 million rows, increasing every day with new portions and those new portions are usually selected. I should use partitioned indexes with those numbers and not regular indexes.
Your solution within sql server would be partitioning. Take a look at sql partitioning and see if you can adopt it. Partitioning is a form of clustering where groups of data share a physical block. If you use year and month for example, all 2018-09 records will share the same physical space and easy to be found. So if you select records with those filters (and plus more) it is like the table has the size of 2018-09 records. That is not exactly accurate but its is quite like it. Be careful with data values for partitioning - opposite to standard PK clusters where each value is unique, partitioning column(s) should result a nice set of different unique combinations thus partitions.
If you cannot use partitions you have to create 'partitions' yourself using regular indexes. This will require some experiments. The basic idea is data (a number?) indicating e.g. a wave or set of waves of imported data. Like data imported today and the next e.g. 10 days will be wave '1'. Next 10 days will be '2' and so on. Filtering on the latest e.g. 10 waves, you work on the latest 100 days import effectively skip out all the rest data. Roughly, if you divided your existing 100 million rows to 100 waves and start on at wave 101 and search for waves 90 or greater then you have 10 million rows to search if SQL is put correctly to use the new index first (will do eventually)
This is a broad question especially without knowing your system. But one thing that I would try is manually update your stats on the indexes/table once you are done loading data. With tables that big, it is unlikely that you will manipulate enough rows to trigger an auto-update. Without clean stats, SQL Server won't have an accurate histogram of your data.
Next, dive into your execution plans and see what operators are the most expensive.

SQL Server detecting slow vs fast columns

I have an ASP.Net MVC application & I use PetaPoco and SQL Server.
My usecase is I want to allow a search on a table with many fields, but hide fields that are "slow" (ie) unindexed. I'm going to modify the PetaPoco T4 template to decorate this information on the columns.
I found this answer that gives you a list of tables vs indexes. My concern is it shows a lot of columns for a particular table. Is the query given in the answer reliable for my usecase ? (ie) can the columns shown be included in the where clause & it wont be slow ? I have some tables that have 40M rows. I dont want to include slow columns in the where condition.
Or is there a better way to solve this problem ?
There are no slow columns in the sense of your question. You have to distinguish between two uses of a column.
Searching. When the column appears in the WHERE, or JOIN clause, it slows down your query, if there is no index for it.
Returning in recordset. If the column appears in the SELECT clause, its content must be returned with each row, whether you need it, or not. So for queries returning many rows, each additional column to be returned means a performance penalty.
Conclusion: As you can see, the performance impact of SELECTED columns does NOT DEPEND on index, but on the number of the returned rows.
Advice: Create indexes for columns used to search and do not return unnecessary columns. Let your queries be as specific as possible in terms of both, selected columns and returned rows.
I think it will not be that simple. You can check indexed columns using the suggested approach (or similar), but the fact that a column is present in an index does not mean your query will necessarily utilize it efficiently. For example if an index is created on columns A, B and C (in that order) and you only have a 'WHERE' clause on B or C (but not on A) you will probably end up with index scan rather than index seek and your query is likely to be slower than expected.
So your check should take into account the sequence of the columns in the indices - instantly fast columns (in your situation) might probably be considered the first columns of the indices (where ic.index_column_id = 1 in the post you mentioned). Columns that are not first in the indices (i.e. ic.index_column_id > 1) will be fast as long as the first columns are also included in the filter. There are other things you might also need to take into account (e.g. cardinality), but this is important to make sure you drive index seeks rather than scans.

Handling large datasets with SQL Server

I'm looking to manage a large dataset of log files. There is an average of 1.5 million new events per month that I'm trying to keep. I've used access in the past, though it's clearly not meant for this, and managing the dataset is a nightmare, because I'm having to split the datasets into months.
For the most part, I just need to filter event types and count the number. But before I do a bunch of work on the data import side of things, I wanted to see if anyone can verify that this SQL Server is a good choice for this. Is there an entry limit I should avoid and archive entries? Is there a way of archiving entries?
The other part is that I'm entering logs from multiple sources, with this amount of entries, is it wise to put them all into the same table, or should each source have their own table, to make queries faster?
edit...
There would be no joins, and about 10 columns. Data would be filtered through a view, and I'm interested to see if the results from a select query that filter based on one or more columns would have a reasonable response time? Does creating a set of views speed things up for frequent queries?
In my experience, SQL Server is a fine choice for this, and you can definitely expect better performance from SQL Server than MS-Access, with generally more optimization methods at your disposal.
I would probably go ahead and put this stuff into SQL Server Express as you've said, hopefully installed on the best machine you can use (though you did mention only 2GB of RAM). Use one table so long as it only represents one thing (I would think a pilot's flight log and a software error log wouldn't be in the same "log" table, as an absurdly contrived example). Check your performance. If it's an issue, move forward with any number of optimization techniques available to your edition of SQL Server.
Here's how I would probably do it initially:
Create your table with a non-clustered primary key, if you use a PK on your log table -- I normally use an identity column to give me a guaranteed order of events (unlike duplicate datetimes) and show possible log insert failures (missing identities). Set a clustered index on the main datetime column (you mentioned that your're already splitting into separate tables by month, so I assume you'll query this way, too). If you have a few queries that you run on this table routinely, by all means make views of them but don't expect a speedup by simply doing so. You'll more than likely want to look at indexing your table based upon the where clauses in those queries. This is where you'll be giving SQL server the information it needs to run those queries efficiently.
If you're unable to get your desired performance through optimizing your queries, indexes, using the smallest possible datatypes (especially on your indexed columns) and running on decent hardware, it may be time to try partitioned views (which require some form of ongoing maintenance) or partitioning your table. Unfortunately, SQL Server Express may limit you on what you can do with partitioning, and you'll have to decide if you need to move to a more feature-filled edition of SQL Server. You could always test partitioning with the Enterprise evaluation or Developer editions.
Update:
For the most part, I just need to filter event types and count the number.
Since past logs don't change (sort of like past sales data), storing the past aggregate numbers is an often-used strategy in this scenario. You can create a table which simply stores your counts for each month and insert new counts once a month (or week, day, etc.) with a scheduled job of some sort. Using the clustered index on your datetime column, SQL Server could much more easily aggregate the current month's numbers from the live table and add them to the stored aggregates for displaying the current values of total counts and such.
Sounds like one table to me, that would need indexes on exactly the sets of columns you will filter. Restricting access through views is generally a good idea and ensures your indexes will actually get used.
Putting each source into their own table will require UNION in your queries later, and SQL-Server is not very good optimizing UNION-queries.
"Archiving" entries can of course be done manually, by moving entries in a date-range to another table (that can live on another disk or database), or by using "partitioning", which means you can put parts of a table (e.g. defined by date-ranges) on different disks. You have to plan for the partitions when you plan your SQL-Server installation.
Be aware that Express edition is limited to 4GB, so at 1.5 million rows per month this could be a problem.
I have a table like yours with 20M rows and little problems querying and even joining, if the indexes are used.

How to deal with billions of records in an sql server?

I have an sql server 2008 database along with 30000000000 records in one of its major tables. Now we are looking for the performance for our queries. We have done with all indexes. I found that we can split our database tables into multiple partitions, so that the data will be spread over multiple files, and it will increase the performance of the queries.
But unfortunatly this functionality is only available in the sql server enterprise edition, which is unaffordable for us.
Is there any way to opimize for the query performance? For example, the query
select * from mymajortable where date between '2000/10/10' and '2010/10/10'
takes around 15 min to retrieve around 10000 records.
A SELECT * will obviously be less efficiently served than a query that uses a covering index.
First step: examine the query plan and look for and table scans and the steps taking the most effort(%)
If you don’t already have an index on your ‘date’ column, you certainly need one (assuming sufficient selectivity). Try to reduce the columns in the select list, and if ‘sufficiently’ few, add these to the index as included columns (this can eliminate bookmark lookups into the clustered index and boost performance).
You could break your data up into separate tables (say by a date range) and combine via a view.
It is also very dependent on your hardware (# cores, RAM, I/O subsystem speed, network bandwidth)
Suggest you post your table and index definitions.
First always avoid Select * as that will cause the select to fetch all columns and if there is an index with just the columns you need you are fetching a lot of unnecessary data. Using only the exact columns you need to retrieve lets the server make better use of indexes.
Secondly, have a look on included columns for your indexes, that way often requested data can be included in the index to avoid having to fetch rows.
Third, you might try to use an int column for the date and convert the date into an int. Ints are usually more effective in range searches than dates, especially if you have time information to and if you can skip the time information the index will be smaller.
One more thing to check for is the Execution plan the server uses, you can see this in management studio if you enable show execution plan in the menu. It can indicate where the problem lies, you can see which indexes it tries to use and sometimes it will suggest new indexes to add.
It can also indicate other problems, Table Scan or Index Scan is bad as it indicates that it has to scan through the whole table or index while index seek is good.
It is a good source to understand how the server works.
If you add an index on date, you will probably speed up your query due to an index seek + key lookup instead of a clustered index scan, but if your filter on date will return too many records the index will not help you at all because the key lookup is executed for each result of the index seek. SQL server will then switch to a clustered index scan.
To get the best performance you need to create a covering index, that is, include all you columns you need in the "included columns" part of your index, but that will not help you if you use the select *
another issue with the select * approach is that you can't use the cache or the execution plans in an efficient way. If you really need all columns, make sure you specify all the columns instead of the *.
You should also fully quallify the object name to make sure your plan is reusable
you might consider creating an archive database, and move anything after, say, 10-20 years into the archive database. this should drastically speed up your primary production database but retains all of your historical data for reporting needs.
What type of queries are we talking about?
Is this a production table? If yes, look into normalizing a bit more and see if you cannot go a bit further as far as normalizing the DB.
If this is for reports, including a lot of Ad Hoc report queries, this screams data warehouse.
I would create a DW with seperate pre-processed reports which include all the calculation and aggregation you could expect.
I am a bit worried about a business model which involves dealing with BIG data but does not generate enough revenue or even attract enough venture investment to upgrade to enterprise.

What are the methods for identifying unnecessary columns within a covering index?

What methods are there for identifying superfluous columns in covering indices: columns which are never searched against, and therefore may be extracted into Includes, or even removed completely without affecting the applicability of the index?
To clarify things
The idea of a covering index is that it also includes columns which may not be searched by (used in the WHERE clause and such) but may be selected (part of the SELECT columns list).
There doesn't seem to be any easy way to assert the existence of unused colums in a covering index. I can only think of a painstaking process below:
For a representative period of time, record all queries being run on the server (or on the table desired)
Filter out (through regular expression) queries not involving the underlying table
For remaining queries, obtain the query plan; discard queries not involving the index in question
For the remaining queries, or rather for each "template" of query (many queries are same but for the search criteria values), make the list of the columns from the index that are either in select or where clause (or in JOIN...)
the columns from the index not found in that list are positively good to go.
Now, there may be a few more [columns to remove] because the process above doesn't check in which context the covering index is used (it is possible that it be used for resolving the where, but that the underlying table is still accessed as well (for example to get to columns not in the covering index...)
The above clinical approach is rather unattractive. An analytical approach may be preferable:
Find all queries "templates" that may be used in all the applications using the server. For each of these patterns, find the ones which may be using the covering index. These are (again a few holes...) queries that:
include a reference to the underlying table
do not cite in any way a column from the underlying table that is not a column in the index
do not use a search criteria from the underlying table that is more selective that the columns of the index (in their very order...)
Or... without even going to the applications: think of all the use cases, and if queries that would serve these cases would benefit of not from all columns in the index. Doing so would imply that you have a relatively good idea of the selectivity of the index, regarding its first few columns.
If you do audits of your use cases and data points, obviously anything that isn't used or caught in the audit is a candidate for deletion. If the database lacks such a thorough audit, you can save a time-window's worth of queries that hit the database by running a trace and saving it. You can analyze the trace and see what type of queries are hitting the database and from there intuit which columns can be dropped.
Trace analysis is typically used to find candidates for missing indices, but I'm guessing that it could be also used to analyze usage trends.

Resources