I have created a unique, clustered index on a view. The clustered index contains 5 columns (out of the 30 on this view), but a typical select using this view will want all 30 columns.
Doing some testing shows that the time it takes to query for the 5 columns is way faster than all 30 columns. Is this because that is just natural overhead regarding selecting on 6x as many columns, or because the indexed view is not storing the non-indexed columns in a temp table, and therefore needs to perform some extra steps to gather the missing columns (joins on base tables I guess?)
If the latter, what are some steps to prevent this? Well, even if the former... what are some ways around this!
Edit: for comparison purposes, a select on the indexed view with just the 5 columns is about 10x faster than the same query on the base tables. But a select on all columns is basically equivalent in speed to the query on the base tables.
A clustered index, by definition, contains every field in every row in the table. It basically is a recreation of the table, but with the physical data pages in order by the clustered index, with b-tree sorting to allow quick access to specified values of your clustered key(s).
Are you just pulling values or are you getting aggregate functions like MIN(), MAX(), AVG(), SUM() for the other 25 fields?
An indexed view is a copy of the data, stored (clustered) potentially (and normally) in a different way to the base table. For all purposes
you now have two copies of the data
SQL Server is smart enough to see that the view and table are aliases of each other
for queries that involve only the columns in the indexed view
if the indexed view contains all columns, it is considered a full alias and can be used (substituted) by the optimizer wherever the table is queried
the indexed view can be used as just another index for the base table
When you select only the 5 columns from tbl (which has an indexed view ivw)
SQL Server completely ignores your table, and just gives you data from ivw
because the data pages are shorter (5 columns only), more records can be grabbed into memory in each page retrieval, so you get a 5x increase in speed
When you select all 30 columns - there is no way for the indexed view to be helpful. The query completely ignores the view, and just selects data from the base table.
IF you select data from all 30 columns,
but the query filters on the first 4 columns of the indexed view,*
and the filter is very selective (will result in a very small subset of records)
SQL Server can use the indexed view (scanning/seeking) to quickly generate a small result set, which it can then use to JOIN back to the base table to get the rest of the data.
However, similarly to regular indexes, an index on (a,b,c,d,e) or in this case clustered indexed view on (a,b,c,d,e) does NOT help a query that searches on (b,d,e) because they are not the first columns in the index.
Related
I am currently using ClickHouse to store few billions of data each week. We use aggregated tables to fetch data, so far so good. Now there is a need to fetch a single row from this database.
ClickHouse is not meant for such a case, even though after applying some optimization recommended by ClickHouse single row select is still somehow slow (few seconds).
to clarify this little more, this table is indexed by columns a,b,c, and d and also partitioned monthly (The table has some more columns). A new service has to query this table whereas only knows a and b and z (a UUID column). However, the average response is between 3 and 10 seconds for 10 billion data.
I have an opportunity to add an extra data store layer so that I can store the data into an extra database for this need.
Now the actual question: What could be the best database for such a case where we only need to read a single row of billions of data?
P.S:
Due to storage and network cost, we can't use Redis
We can't add more columns to the select query to optimize the query
Cassandra?
You can use an additional table and a materialized view to emulate inverted index.
This additional table should be sorted by z and contain pk columns (a,b,c, d) from the main table.
Then query the main table like
select ... from main_table where (a,b,c,d) in
( select a,b,c,d from additional_table where z= ... )
and z = ...
additional_table can be automatically filled by the materialized view from the main_table.
My question is about performance on SQL server tables.
Assume I have a table that has many columns, for example 30 columns, with 1 column indexed. This table has approximately 30,000 rows.
If I perform a select that selects the indexed column, and one more, for example this:
SELECT IndexedColumn, column1
FROM table
Will this be slower than performing the same select on a table that only has 2 columns, and doing a SELECT * ...
So basically, will the existence of the extra columns slow down the select query event if I am not retrieving the data from the extra columns?
There will be minor difference on the very end of the process as you don't have to print/pass the rest of information for the end client (either SSMS or other app).
When performing a read based on clustered index all of the column (without BLOB) are saved on the same page set so to read the data you have to access the same set of pages anyway.
You would see a performance increase if you would have a nonclustered index on the column list you are after as then they are saved in their own structure of data pages (so it would be less to read).
Assuming that you are using the default Clustered Index created by SQL server when defining the primary key on the table in both scenarios then no, there shouldn't be any performance difference between these two scenarios. Maybe worth just checking it out and generating an Actual Execution plan to see for yourself? -- Actually not sure above is true, as given this is rowstore, the first table wont be able to fit as many rows onto each page so will suffer more of an IO/Disk overhead when reading data.
After I created the indexed view, I tried disabling all the indexes in base tables including the indexes for foreign key column (constraint is still there) and the query plan for the view stays the same.
It is just like magic to me that the indexed view would be able to optimize the query so much even without base table being indexed. Even without any index on the View, SQL Server is able to do an index scan on the primary key index of the indexed view to retrieve data like 1000 times faster than using the base table.
Something like SELECT * FROM MyView WITH(NOEXPAND) WHERE NotIndexedColumn = 5 ORDER BY NotIndexedColumn
So the first two questions are:
Is there any benefit to index base tables of indexed view?
What is Sql server doing when it is doing a index scan on the PK while the constraint is on a not indexed column?
Then I noticed that if I use full-text search + order by I would see a table spool (eager spool) in the query plan with a cost like 95%.
Query looks like SELECT ID FROM View WITH(NOEXPAND) WHERE CONTAINS(IndexedColumn, '"SomeText*"') ORDER BY IndexedColumn
Question n° 3:
Is there any index I could add to get rid of that operation?
It's important to understand that an indexed view is a "materialized view" and the results are stored onto disk.
So the speedup you are seeing is the actual result of the query you are seeing stored to disk.
To answer your questions:
1) Is there any benefit to index base tables of indexed view?
This is situational. If your view is flattening out data or having many extra aggregate columns, then an indexed view is better than the table. If you are just using your indexed view like such
SELECT * FROM foo WHERE createdDate > getDate() then probably not.
But if you are doing SELECT sum(price),min(id) FROM x GROUP BY id,price then the indexed view would probably be better. Granted, you are doing a more complex query with joins and other advanced options.
2) What is Sql server doing when it is doing a index scan on the PK while the constraint is on a not indexed column?
First we need to understand how clustered indexes are stored. The index is stored in a B-tree. So SQL Server is walking the tree finding all values that match your criteria when you are searching on a clustered index Depending on how you have your indexes set up i.e covering vs non covering and how your non-clustered indexes are set up will determine what the Pages and Extents look like. Without more knowledge of the table structure I can't help you understand what the scan is actually doing.
3)Is there any index I could add to get rid of that operation?
Just because something is taking 95% of the query's time doesn't make that a bad thing. The query time needs to add up to 100%, so no matter what you do there is always going to be something taking up a large percentage of time. What you need to check is the IO reads and how much time the query itself takes.
To determine this, you need to understand that SQL Server caches the results of queries. With this in mind, you can have a query take a long time the first time but afterward since the data itself is cached it would be much quicker. It all depends on the frequency of the query and how your system is set up.
For a more in-depth read on indexed view
I have one poor performing procedure with couple of queries in it.
I have identified few temp table queries that does scanning of temp table. I decided to add index on temp table to avoid table scanning. I have noticed that there are multiple columns of temp table which are being used in where clause. However, I am not sure whether I should include all columns in single index (composite index) or multiple indexes with one column each index to gain the maximum performance.
Database is DB2
This all depends greatly on your queries and the data on your table. As a rule of thumb you should include only the columns that reduce greatly the result rows.
If the where clause for first limiting column already drops for instance 90% of the rows and the next one would only reduce a few hundred rows anymore it is not worth the resources to include in the index. Always keep in mind that the database engine works first with the first column of composite index, and then proceeds to the next ones. If your queries have the columns in different order the index will potentially start even slowing your queries down.
Also, if you have a lot of data and using several indexed columns seems worth it you might in some cases want to have separate indexes and have intra-parallelism work. It is possible that running parallel index lookups using several CPUs has better performance - if your server has to spare.
In case of MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on.
If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
Lets say that you have INDEX nameIdx (last_name,first_name) created on table test
Therefore, the nameIdx index is used for lookups in the following queries:
SELECT * FROM test WHERE last_name='Widenius';
SELECT * FROM test
WHERE last_name='Widenius' AND first_name='Michael';
SELECT * FROM test
WHERE last_name='Widenius'
AND (first_name='Michael' OR first_name='Monty');
where as name nameIdx is not used for lookups in the following queries:
SELECT * FROM test WHERE first_name='Michael';
SELECT * FROM test
WHERE last_name='Widenius' OR first_name='Michael';
for more detail refer URL
summary of this is if you are using columns in where clause as mentioned in index order (from left to right ) then it is better than single column index
I created indexed view (clustered unique index on Table1_ID) view with such T-SQL:
Select Table1_ID, Count_BIG(*) as Table2TotalCount from Table2 inner join
Table1 inner join... where Table2_DeletedMark=0 AND ... Group BY Table1_ID
Also after creating the view, we set clustered unique index on column Table1_ID.
So View consists of two columns:
Table1_ID
Table2TotalCount
T-Sql for creating View is a heavy because of group by and several millions of rows in Table2.
But when I run a query to a view like
Select Total2TotalCount from MyView where Table1_ID = k
- it executes fast and without overhead for server.
Also in t-sql for creating view many conditions in where clause for Table2 columns. And
If I changed Table2_DeletedMark to 1 and run a query
Select Total2TotalCount from MyView where Table1_ID = k
again - I'll get correct results. (Table2TotalCount decreased by 1).
So our questions are:
1. Why does query execution time decreased so much when we used Indexed View (compare to without view using (even we run DBCC DROPCLEANBUFFERS() before executing query to VIEW))
2. After changing
Table2_DeletedMark
View immediately recalculated and we get correct results, but what is the process behind? we can't imagine that sql executes t-sql by what view was generated each time we changes any values of 10+ columns containing in the t-sql view generating, because it is too heavy.
We understand that it is enough to run a simple query to recalculate values, depends on columns values we changing.
But how does sql understand it?
An indexed view is materialized e.g. its rows that it contains (from the tables it depends on) are physically stored on disk - much like a "system-computed" table that's always kept up to date whenever its underlying tables change. This is done by adding the clustered index - the leaf pages of the clustered index on a SQL Server table (or view) are the data pages, really.
Columns in an indexed view can be indexed with non-clustered indexes, too, and thus you can improve query performance even more. The down side is: since the rows are stored, you need disk space (and some data is duplicated, obviously).
A normal view on the other hand is just a fragment of SQL that will be executed to compute the results - based on what you select from that view. There's no physical representation of that view, there are no rows stored for a regular view - they need to be joined together from the base tables as needed.
Why do you think there are so many bizarre rules on what's allowed in indexed views, and what the base tables are allowed to do? It's so that the SQL engine can immediately know "If I'm touching this row, it potentially affects the result of this view - let's see, this row no longer fits the view criteria, but I insisted on having a COUNT_BIG(*), so I can just decrement that value by one"