Indexing on a column of a huge database that can take only 3 possible values - database

I am trying to understand if it is worth from a performance point of view in order to create an index on a column of a huge table (about 90 million records in total).
What I am trying to achieve is fast filtering on the indexed column. The column to be indexed can have only 3 possible values and as per my requirement I have to fetch data on a regular basis with two of those values. This comes out to about 45 million records (half the table size).
Does it make any sense to create an index on a column that can have only 3 possible values and you need to retrieve data with two values amongst them ? Also, will creating this index improve performance of my query with WHERE clause on the column ?

Related

SQL Server Include columns in non clustered index

I have a table it has around 30 columns in it, I am running a select query, I am getting key lookup in the execution plan,I have around 2 million records in that table. Is it good practice to have that thirty columns in include clause?, or is there any way to solve issue? I know include columns store that data in leaf level node to satisfy the query. What is that 900 bytes? Does it mean that length of all include columns should not more that 900 character?
Creating an index with the leading key being the main predicate of the query and including all other columns will speed up the query and replace the key lookup with an index seek. BUT and big but you are then storing the data twice just in a different order so a 100MB table becomes a 200MB table. If this is the single most important query of your application then then you might be able to justify that.
Brent Ozar has a general rule of thumb of '5 in 5'. No more than 5 indexes with no more than 5 columns per table. I say general because obliviously there are situations when that doesn't apply.
If you need all 30 columns then a key lookup in the execution plan is going to be the best bet, have an index on the main predicate and let SQL Server use the hidden PK on the index to a key lookup for the other columns.
900 bytes if the maximum size of a single row on an index, so that will depend on what data types all columns in the table are

Azure SQL DW rowstore vs columnstore

I have over 500 million records in Azure SQL Data Warehouse.
I'm trying to do some benchmark in order to understand in which way keep the records. Rowstore or Columnstore.
I'll not join table with other tables, it's not an analytical fact table.
Both tables are distributed as a Round Robin and both of them contains 17 partitions. And both of them has 45 columns.
When I query to sum two columns, I expect Columnstore table perform much better than rowstore , however the reality is that I get my sum result from Rowstore somewhere around 2.5 min and for columnstore around 10 min. I don't use any filter or group by.
On the other hand , when i query count(*) , columnar table performs much much better than rowstore.
EDIT
Though I can't share all the details with you because its private,
here is some of them just to have understanding what's going on.
I run queries on smallrc and 100DWU.
Table is loaded one CTAS and contains pre joined information from several tables and is going to serve queries over custom defined protocol(sort/group/filter/paging) from our internal application.
The domain is gambling and from 45 Columns we have 43 could be used as filter. The output set usually contains 3 to 4 columns plus two sum columns with no more than 1000 row per query.
I partitioned both tables monthly via EventDate assuming to have each month a new partition. Mostly my queries contains EventDate as a filter.
My Rowstroe table contains EventDate as a clustered index in addition to partitions which are the same as for columnstore.
Adding EventDate as a secondary index for columnstore gave some improvement but performance still far behind rowstore.
EventDate is in int format and values pattern are yyyyMMdd (20180101).
Every DW optimized for elasticity has 60 distributions while the lower skews for DW optimzied for compute also have 60 distributions.
SQL Server's columnstore creates row groups based on row count (as opposed to Parquet for example, where row groups are created based on disk size). Row groups should ideally have 1M rows (see the link that #GregGalloway added), but row groups can get COMPRESSED if they have at least 100k rows loaded in a single bulk load. When a row group is not compressed it is stored in row format in delta stores (they are regular B-trees, with a MD/access overhead since they are part of the columnstore index. Note that you cannot specify the indexing, since they are part of the clustered columnstore index).
I assume that you have 500M rows in 60 distributions, that is 8.3M rows per distribution; assuming your partitioning is homogeneous with 17 partitions you'd have ~490k rows per partition.
When bulk loading into partitioned table you need to be careful about the memory requirements/resource class you're loading with, since the sort iterator on top of the bulk load is not spilling so it will feed the bulk load only that many rows that it can sort.
Ensure that your index has good quality. If you'll do only aggregates over the table without much filtering then 1 partition is ideal, even if you do filtering remember that columnstore does segment elimination so if your data is loaded in the right order you'll be fine.
You should ensure that you have at least a few million rows per partition and that you have COMPRESSED row groups to have good performance. Given your scan results you have most if not all of your columnstore data in OPEN row groups (delta stores).
What do you mean by much better performance in case of count(*)?
Also were these runs cold or warm? If it's a warm run for count(*) CS might be just grabbing the row group MD and up the row count - though in both cases the compiled plans show full table scan.

Comma separated values vs multirow approach for storing data in SQL Server based on performance and handling of data

Scenario: There are 2 tables. Table1 contains users and Table2 contains Hobbies
User can have multiple hobbies(20-40). and number of users is over 100k.
Approach 1. Create a UsersHobby Table with column 1 as UserID and column 2 as Hobbies and store hobbies as comma separated values. It reduces the number of rows. For example it there are 100k users and each has at least 20 hobbies, still number of rows will be 100k. But it violates normalization principles.
Approach 2. Column 1 as UserID and column 2 as hobbies and store new rows for each hobby. In this case total number of rows would be 2 million if there are 100k users, but it follows normalization principles.
Which one is better approach considering performance and ease of handling data?
Approach 2 would be better because of normalization and appropriate indexes. Since you have sql server 2012, you have an option of Non-Clustered Column Store index if your inserts are in less frequency and reading are in high frequency. Non-Clustered ColumnStore index internally applies compression which makes faster IO.
In Approach 2 you can apply Compression in order to have faster IO which will be faster than IO while handling comma-separated values as in approach 1.
But if you have a frequent UI requirement which needs this comma-separated values to shown in UI then you still consider Approach 1 but the drawbacks are your inserts/update will be very slow as you require a custom approach of making data a comma-separated and if you have a normal retrieval then that will be very slow as you might require to un-parse at that time.

SQLite: Performance of Rows VS Columns

I have a database table where each row (movie) has a couple of numeric tags (movie categories). Currently I put all these tags in the same column as a string, and search for them using %LIKE%, which requires a slow full table scan when I need to find all movies in a certain category.
I want to speed up searching for these tags, but the only solution I can think of is creating a second table with two integer columns. The first one contains a single category, and the second contains the rowid of the movie.
However, this will require much more inserts in the database. A row has 10 tags on average, so instead of inserting a single row, I have to insert 11 rows. Since my application does much more inserting than actually querying, the insert-performance is crucial.
Is there another way to solve this, without sacrificing insert-performance? Or is there no big difference between inserting 1 row with 10 columns VS 10 rows with 2 columns?
You'll have slightly slower insert performance because the indexes need to be updated (at the very least it'll have an index on ROWID, and you need an index on category ID to get a significant speedup). The data size itself is trivial.
However, I'd expect it to be completely dwarfed by transaction overheads (all the calls to fsync(), for one). SQLite is terrible for concurrent write-heavy loads.
If you do more inserting than querying, you may want to rethink your data structure.

Sql Server 2005 Indexed View

I have created a unique, clustered index on a view. The clustered index contains 5 columns (out of the 30 on this view), but a typical select using this view will want all 30 columns.
Doing some testing shows that the time it takes to query for the 5 columns is way faster than all 30 columns. Is this because that is just natural overhead regarding selecting on 6x as many columns, or because the indexed view is not storing the non-indexed columns in a temp table, and therefore needs to perform some extra steps to gather the missing columns (joins on base tables I guess?)
If the latter, what are some steps to prevent this? Well, even if the former... what are some ways around this!
Edit: for comparison purposes, a select on the indexed view with just the 5 columns is about 10x faster than the same query on the base tables. But a select on all columns is basically equivalent in speed to the query on the base tables.
A clustered index, by definition, contains every field in every row in the table. It basically is a recreation of the table, but with the physical data pages in order by the clustered index, with b-tree sorting to allow quick access to specified values of your clustered key(s).
Are you just pulling values or are you getting aggregate functions like MIN(), MAX(), AVG(), SUM() for the other 25 fields?
An indexed view is a copy of the data, stored (clustered) potentially (and normally) in a different way to the base table. For all purposes
you now have two copies of the data
SQL Server is smart enough to see that the view and table are aliases of each other
for queries that involve only the columns in the indexed view
if the indexed view contains all columns, it is considered a full alias and can be used (substituted) by the optimizer wherever the table is queried
the indexed view can be used as just another index for the base table
When you select only the 5 columns from tbl (which has an indexed view ivw)
SQL Server completely ignores your table, and just gives you data from ivw
because the data pages are shorter (5 columns only), more records can be grabbed into memory in each page retrieval, so you get a 5x increase in speed
When you select all 30 columns - there is no way for the indexed view to be helpful. The query completely ignores the view, and just selects data from the base table.
IF you select data from all 30 columns,
but the query filters on the first 4 columns of the indexed view,*
and the filter is very selective (will result in a very small subset of records)
SQL Server can use the indexed view (scanning/seeking) to quickly generate a small result set, which it can then use to JOIN back to the base table to get the rest of the data.
However, similarly to regular indexes, an index on (a,b,c,d,e) or in this case clustered indexed view on (a,b,c,d,e) does NOT help a query that searches on (b,d,e) because they are not the first columns in the index.

Resources